Tutorial¶

Example uses¶

Basic use, access to metadata and data from the Core file¶

from dwca.read import DwCAReader
from dwca.darwincore.utils import qualname as qn

# Let's open our archive...
# Using the with statement ensure that resources will be properly freed/cleaned after use.
with DwCAReader('my-archive.zip') as dwca:
    # We can now interact with the 'dwca' object

    # We can read scientific metadata (EML) through a xml.etree.ElementTree.Element object in the 'metadata'
    # attribute.
    dwca.metadata

    # The 'descriptor' attribute gives access to the Archive Descriptor (meta.xml) and allow
    # inspecting the archive:
    # For example, discover what the type the Core file is: (Occurrence, Taxon, ...)
    print("Core type is: %s" % dwca.descriptor.core.type)
    # => Core type is: http://rs.tdwg.org/dwc/terms/Occurrence

    # Check if a Darwin Core term in present in the core file
    if 'http://rs.tdwg.org/dwc/terms/locality' in dwca.descriptor.core.terms:
        print("This archive contains the 'locality' term in its core file.")
    else:
        print("Locality term is not present.")

    # Using full qualnames for DarwincCore terms (such as 'http://rs.tdwg.org/dwc/terms/country') is verbose...
    # The qualname() helper function make life easy for common terms.
    # (here, it has been imported as 'qn'):
    qn('locality')
    # => u'http://rs.tdwg.org/dwc/terms/locality'

    # Combined with previous examples, this can be used to things more clear:
    # For example:
    if qn('locality') in dwca.descriptor.core.terms:
        pass

    # Or:
    if dwca.descriptor.core.type == qn('Occurrence'):
        pass

    # Finally, let's iterate over the archive core rows and get the data:
    for row in dwca:
        # row is an instance of CoreRow
        # iteration respects their order of appearance in the core file

        # Print() can be used for debugging purposes...
        print(row)

        # => --
        # => Rowtype: http://rs.tdwg.org/dwc/terms/Occurrence
        # => Source: Core file
        # => Row ID:
        # => Data: {u'http://rs.tdwg.org/dwc/terms/basisOfRecord': u'Observation', u'http://rs.tdwg.org/dwc/terms/family': # => u'Tetraodontidae', u'http://rs.tdwg.org/dwc/terms/locality': u'Borneo', u'http://rs.tdwg.#
        # => org/dwc/terms/scientificName': u'tetraodon fluviatilis'}
        # => --

        # You can get the value of a specific Darwin Core term through
        # the "data" dict:
        print("Value of 'locality' for this row: %s" % row.data[qn('locality')])
        # => Value of 'locality' for this row: Mumbai

    # Alternatively, we can get a list of core rows instead of iterating:
    # BEWARE: all rows will be loaded in memory!
    rows = dwca.rows

    # Or retrieve a specific row by its id:
    occurrence_number_three = dwca.get_row_by_id(3)

    # Caution: ids are generally a fragile way to identify a core row in an archive, since the standard doesn't
    # guarantee unicity (nor even that there will be an id). The index (position) of the row (starting at 0) is
    # generally preferable.

    occurrence_on_second_line = dwca.get_row_by_index(1)

    # We can retreive the (absolute) of embedded files
    # NOTE: this path point to a temporary directory that will be removed at the end of the DwCAReader object life
    # cycle.
    path = dwca.absolute_temporary_path('occurrence.txt')

Access to Darwin Core Archives with extensions (star schema)¶

from dwca.read import DwCAReader

with DwCAReader('archive_with_vernacularnames_extension.zip') as dwca:
    # Let's ask the archive what kind of extensions are in use:
    for e in dwca.descriptor.extensions:
        print(e.type)
    # => http://rs.gbif.org/terms/1.0/VernacularName

    first_core_row = dwca.rows[0]

    # Extension rows are accessible from a core row as a list of ExtensionRow instances:
    for extension_line in first_core_row.extensions:
        # Display all rows from extension files reffering to the first Core row
        print(extension_line)

Another example with multiple extensions (no new API here)¶

from dwca.read import DwCAReader

with DwCAReader('multiext_archive.zip') as dwca:
    rows = dwca.rows
    ostrich = rows[0]

    print("You'll find below all extensions rows reffering to Ostrich")
    print("There should be 3 vernacular names and 2 taxon description")
    for ext in ostrich.extensions:
        print(ext)

    print("We can then simply filter by type...")
    for ext in ostrich.extensions:
        if ext.rowtype == 'http://rs.gbif.org/terms/1.0/VernacularName':
            print(ext)

GBIF Downloads¶

The GBIF website allow visitors to export occurrences as a Darwin Core Archive. The resulting file contains a few more things that are not part of the Darwin Core Archive standard. These additions also works with python-dwca-reader. See The GBIF Occurrence download format for explanations on the file format and how to use it.

Data analysis and manipulation with Pandas¶

Python-dwca-reader provides specific tools to make working with Pandas easier, see Interaction with Pandas Package for concrete examples.