Tutorial¶
Example uses¶
Basic use, access to metadata and data from the Core file¶
from dwca.read import DwCAReader
from dwca.darwincore.utils import qualname as qn
# Let's open our archive...
# Using the with statement ensure that resources will be properly freed/cleaned after use.
with DwCAReader('my-archive.zip') as dwca:
# We can now interact with the 'dwca' object
# We can read scientific metadata (EML) through a xml.etree.ElementTree.Element object in the 'metadata'
# attribute.
dwca.metadata
# The 'descriptor' attribute gives access to the Archive Descriptor (meta.xml) and allow
# inspecting the archive:
# For example, discover what the type the Core file is: (Occurrence, Taxon, ...)
print("Core type is: %s" % dwca.descriptor.core.type)
# => Core type is: http://rs.tdwg.org/dwc/terms/Occurrence
# Check if a Darwin Core term in present in the core file
if 'http://rs.tdwg.org/dwc/terms/locality' in dwca.descriptor.core.terms:
print("This archive contains the 'locality' term in its core file.")
else:
print("Locality term is not present.")
# Using full qualnames for DarwincCore terms (such as 'http://rs.tdwg.org/dwc/terms/country') is verbose...
# The qualname() helper function make life easy for common terms.
# (here, it has been imported as 'qn'):
qn('locality')
# => u'http://rs.tdwg.org/dwc/terms/locality'
# Combined with previous examples, this can be used to things more clear:
# For example:
if qn('locality') in dwca.descriptor.core.terms:
pass
# Or:
if dwca.descriptor.core.type == qn('Occurrence'):
pass
# Finally, let's iterate over the archive core rows and get the data:
for row in dwca:
# row is an instance of CoreRow
# iteration respects their order of appearance in the core file
# Print() can be used for debugging purposes...
print(row)
# => --
# => Rowtype: http://rs.tdwg.org/dwc/terms/Occurrence
# => Source: Core file
# => Row ID:
# => Data: {u'http://rs.tdwg.org/dwc/terms/basisOfRecord': u'Observation', u'http://rs.tdwg.org/dwc/terms/family': # => u'Tetraodontidae', u'http://rs.tdwg.org/dwc/terms/locality': u'Borneo', u'http://rs.tdwg.#
# => org/dwc/terms/scientificName': u'tetraodon fluviatilis'}
# => --
# You can get the value of a specific Darwin Core term through
# the "data" dict:
print("Value of 'locality' for this row: %s" % row.data[qn('locality')])
# => Value of 'locality' for this row: Mumbai
# Alternatively, we can get a list of core rows instead of iterating:
# BEWARE: all rows will be loaded in memory!
rows = dwca.rows
# Or retrieve a specific row by its id:
occurrence_number_three = dwca.get_row_by_id(3)
# Caution: ids are generally a fragile way to identify a core row in an archive, since the standard doesn't
# guarantee unicity (nor even that there will be an id). The index (position) of the row (starting at 0) is
# generally preferable.
occurrence_on_second_line = dwca.get_row_by_index(1)
# We can retreive the (absolute) of embedded files
# NOTE: this path point to a temporary directory that will be removed at the end of the DwCAReader object life
# cycle.
path = dwca.absolute_temporary_path('occurrence.txt')
Access to Darwin Core Archives with extensions (star schema)¶
from dwca.read import DwCAReader
with DwCAReader('archive_with_vernacularnames_extension.zip') as dwca:
# Let's ask the archive what kind of extensions are in use:
for e in dwca.descriptor.extensions:
print(e.type)
# => http://rs.gbif.org/terms/1.0/VernacularName
first_core_row = dwca.rows[0]
# Extension rows are accessible from a core row as a list of ExtensionRow instances:
for extension_line in first_core_row.extensions:
# Display all rows from extension files reffering to the first Core row
print(extension_line)
Another example with multiple extensions (no new API here)¶
from dwca.read import DwCAReader
with DwCAReader('multiext_archive.zip') as dwca:
rows = dwca.rows
ostrich = rows[0]
print("You'll find below all extensions rows reffering to Ostrich")
print("There should be 3 vernacular names and 2 taxon description")
for ext in ostrich.extensions:
print(ext)
print("We can then simply filter by type...")
for ext in ostrich.extensions:
if ext.rowtype == 'http://rs.gbif.org/terms/1.0/VernacularName':
print(ext)
GBIF Downloads¶
The GBIF website allow visitors to export occurrences as a Darwin Core Archive. The resulting file contains a few more things that are not part of the Darwin Core Archive standard. These additions also works with python-dwca-reader. See The GBIF Occurrence download format for explanations on the file format and how to use it.
Data analysis and manipulation with Pandas¶
Python-dwca-reader provides specific tools to make working with Pandas easier, see Interaction with Pandas Package for concrete examples.