Complete API¶
Reader objects¶
High-level classes to open and read DarwinCore Archive.
-
class
dwca.read.
DwCAReader
(path: str, extensions_to_ignore: List[str] = None, tmp_dir: str = None)¶ Bases:
object
This class is used to represent a Darwin Core Archive as a whole.
It gives read access to the contained data, to the scientific metadata, … It supports archives with or without Metafile, such as described on page 2 of the Reference Guide to the XML Descriptor.
- Parameters
path (str) – path to the Darwin Core Archive (either a zip/tgz file or a directory) to open.
extensions_to_ignore (list) – path (relative to the archive root) of extension data files to ignore. This will improve speed and memory usage for large archives. Missing files are silently ignored.
tmp_dir (str) – temporary directory to use to uncompress the archive (if needed). If not provided, Python default will be used.
- Raises
- Raises
Usage:
from dwca.read import DwCAReader dwca = DwCAReader('my_archive.zip') # Iterating on core rows is easy: for core_row in dwca: # core_row is an instance of dwca.rows.CoreRow print(core_row) # Scientific metadata (EML) is available as an ElementTree.Element object print(dwca.metadata) # Close the archive to free resources dwca.close()
The archive can also be opened using the with statement. This is recommended, since it ensures resources will be properly cleaned after usage:
from dwca.read import DwCAReader with DwCAReader('my-archive.zip') as dwca: pass # Do what you want # When leaving the block, resources are automatically freed.
-
absolute_temporary_path
(relative_path: str) → str¶ Return the absolute path of a file located within the archive.
This method allows raw access to the files contained in the archive. It can be useful to open additional, non-standard files embedded in the archive, or to open a standard file with another library.
- Parameters
relative_path (str) – the path (relative to the archive root) of the file.
- Returns
the absolute path to the file.
Usage:
dwca.absolute_temporary_path('occurrence.txt') # => /tmp/afdfsec7/occurrence.txt
Warning
If the archive is contained in a zip or tgz file, the returned path will point to a temporary file that will be removed when closing the
dwca.read.DwCAReader
instance.Note
File existence is not tested.
-
archive_path
= None¶ The path to the Darwin Core Archive file, as passed to the constructor.
-
close
() → None¶ Close the Darwin Core Archive and remove temporary/working files.
Note
Alternatively,
DwCAReader
can be instanciated using the with statement. (see example above).
-
core_contains_term
(term_url: str) → bool¶ Return True if the Core file of the archive contains the term_url term.
-
core_file
= None¶ An instance of
dwca.files.CSVDataFile
for the core data file.
-
property
core_file_location
¶ The (relative) path to the core data file.
Example: ‘occurrence.txt’
-
descriptor
= None¶ An
descriptors.ArchiveDescriptor
instance giving access to the archive descriptor/metafile (meta.xml
)
-
extension_files
= None¶ A list of
dwca.files.CSVDataFile
, one entry for each extension data file , sorted by order of appearance in the Metafile (or an empty list if the archive doesn’t use extensions).
-
get_corerow_by_id
(row_id: str) → dwca.rows.CoreRow¶ Return the (core) row whose ID is row_id.
- Parameters
row_id (str) – ID of the core row you want
- Returns
dwca.rows.CoreRow
– the matching row.- Raises
Warning
It is rarely a good idea to rely on the row ID, because: 1) Not all Darwin Core Archives specifies row IDs. 2) Nothing guarantees that the ID will actually be unique within the archive (depends of the data publisher). In that case, this method don’t guarantee which one will be returned.
get_corerow_by_position()
may be more appropriate in this case.
-
get_corerow_by_position
(position: int) → dwca.rows.CoreRow¶ Return a core row according to its position/index in core file.
- Parameters
position (int) – the position (starting at 0) of the row you want in the core file.
- Returns
dwca.rows.CoreRow
– the matching row.- Raises
Note
If index is bigger than the length of the archive, None is returned
The position is often an appropriate way to unambiguously identify a core row in a DwCA.
-
get_descriptor_for
(relative_path: str) → dwca.descriptors.DataFileDescriptor¶ Return a descriptor for the data file located at relative_path.
- Parameters
relative_path (str) – the path (relative to the archive root) to the data file you want info about.
- Returns
- Raises
dwca.exceptions.NotADataFile
if relative_path doesn’t reference a valid data file.
Examples:
dwca.get_descriptor_for('occurrence.txt') dwca.get_descriptor_for('verbatim.txt')
-
metadata
= None¶ A
xml.etree.ElementTree.Element
instance containing the (scientific) metadata of the archive, or None if the archive has no metadata.
-
open_included_file
(relative_path: str, *args: Any, **kwargs: Any) → IO¶ Simple wrapper around Python’s build-in open function.
To be used only for reading.
Warning
Don’t forget to close the files after usage. This is especially important on Windows because temporary (extracted) files won’t be cleanable if not closed.
-
orphaned_extension_rows
() → Dict[str, Dict[str, List[int]]]¶ Return a dict of the orphaned extension rows.
Orphaned extension rows are extension rows who reference non-existing core rows. This methods returns a dict such as:
{'description.txt': {u'5': [3, 4], u'6': [5]}, 'vernacularname.txt': {u'7': [4]}}
Meaning:
in description.txt, rows at position 3 and 4 reference a core row whose ID is ‘5’, but such a core row doesn’t exists. Row at position 5 references an imaginary core row with ID ‘6’
in vernacularname.txt, the row at position 4 references an imaginary core row with ID ‘7’
-
pd_read
(relative_path, **kwargs)¶ Return a Pandas DataFrame for the data file located at relative_path.
This method wraps pandas.read_csv() and accept the same keyword arguments. The following arguments will be ignored (because they are set appropriately for the data file): delimiter, skiprows, header and names.
- Parameters
relative_path (str) – path to the data file (relative to the archive root).
- Raises
ImportError if Pandas is not installed.
- Raises
dwca.exceptions.NotADataFile
if relative_path doesn’t designate a valid data file in the archive.
Warning
You’ll need to install Pandas before using this method.
Note
Default values of Darwin Core Archive are supported: A column will be added to the DataFrame if a term has a default value in the Metafile (but no corresponding column in the CSV Data File).
-
property
rows
¶ A list of
rows.CoreRow
objects representing the content of the archive.Warning
All rows will be loaded in memory. In case of a large Darwin Core Archive, you may prefer using a for loop.
-
source_metadata
= None¶ If the archive contains source-level metadata (typically, GBIF downloads), this is a dict such as:
{'dataset1_UUID': <dataset1 EML> (xml.etree.ElementTree.Element object), 'dataset2_UUID': <dataset2 EML> (xml.etree.ElementTree.Element object), ...}
See The GBIF Occurrence download format for more details.
-
property
use_extensions
¶ True if the archive makes use of extensions.
Row objects¶
Objects that represents data rows coming from DarwinCore Archives.
-
class
dwca.rows.
CoreRow
(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶ Bases:
dwca.rows.Row
This class is used to represent a row/line from a Darwin Core Archive core data file.
You probably won’t instantiate it manually but rather obtain it via
dwca.read.DwCAReader.get_corerow_by_position()
,dwca.read.DwCAReader.get_corerow_by_id()
or simply by looping over adwca.read.DwCAReader
object.-
property
extensions
¶ A list of
ExtensionRow
instances that relates to this Core row.
-
id
= None¶ The row id
-
property
-
class
dwca.rows.
ExtensionRow
(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶ Bases:
dwca.rows.Row
This class is used to represent a row/line from a Darwin Core Archive extension data file.
Most of the time, you won’t instantiate it manually but rather obtain it trough the extensions attribute of
CoreRow
.-
core_id
= None¶ The id of the core row this extension row is referring to.
-
-
class
dwca.rows.
Row
(csv_line: str, position: int, datafile_descriptor: dwca.descriptors.DataFileDescriptor)¶ Bases:
object
This class is used to represent a row/line in a Darwin Core Archive.
This class is intended to be subclassed rather than used directly.
-
data
= None¶ A dict containing the Row data, such as:
{'dwc_term_1': 'value', 'dwc_term_2': 'value', ...}
Usage:
myrow.data['http://rs.tdwg.org/dwc/terms/locality'] # => "Brussels"
Note
The
dwca.darwincore.utils.qualname()
helper is available to make such calls less verbose.
-
descriptor
= None¶ An instance of
dwca.descriptors.DataFileDescriptor
describing the originating data file.
-
position
= None¶ The row position/index (starting at 0) in the source data file. This can be used, for example with
dwca.read.DwCAReader.get_corerow_by_position()
ordwca.files.CSVDataFile.get_row_by_position()
.
-
raw_fields
= None¶
-
rowtype
= None¶ The csv line type as stated in the archive descriptor. (or None if the archive has no descriptor). Examples: http://rs.tdwg.org/dwc/terms/Occurrence, http://rs.gbif.org/terms/1.0/VernacularName, …
-
-
dwca.rows.
csv_line_to_fields
(csv_line, line_ending, field_ending, fields_enclosed_by)¶ Split a line from a CSV file.
Return a list of fields. Content is not trimmed.
Descriptor objects¶
Classes to represents descriptors of a DwC-A.
ArchiveDescriptor
represents the full archive descriptor, initialized from the metafile content.DataFileDescriptor
describes characteristics of a given data file in the archive. It’s either created from a subsection of the ArchiveDescriptor describing the data file, either by introspecting the CSV data file (useful for Archives without metafile).
-
class
dwca.descriptors.
ArchiveDescriptor
(metaxml_content: str, files_to_ignore: List[str] = None)¶ Bases:
object
Class used to encapsulate the whole Metafile (meta.xml).
-
extensions
= None¶ A list of
dwca.descriptors.DataFileDescriptor
instances describing each of the archive’s extension data files.
-
extensions_type
= None¶ A list of extension (types) in use in the archive.
Example:
["http://rs.gbif.org/terms/1.0/VernacularName", "http://rs.gbif.org/terms/1.0/Description"]
-
metadata_filename
= None¶ The path (relative to archive root) of the (scientific) metadata of the archive.
-
raw_element
= None¶ A
xml.etree.ElementTree.Element
instance containing the complete Archive Descriptor.
-
-
class
dwca.descriptors.
DataFileDescriptor
(created_from_file: bool, raw_element: xml.etree.ElementTree.Element, represents_corefile: bool, datafile_type: Optional[str], file_location: str, file_encoding: str, id_index: int, coreid_index: int, fields: List[Dict], lines_terminated_by: str, fields_enclosed_by: str, fields_terminated_by: str)¶ Bases:
object
Those objects describe a data file fom the archive.
They’re generally not instanciated manually, but rather by calling:
make_from_metafile_section()
(if the archive contains a metafile)make_from_file()
(created by analyzing the data file)
-
coreid_index
= None¶ If the section represents an extension data file, the index/position of the core_id column in that file. The core_id in an extension is the foreign key to the “extended” core row.
-
created_from_file
= None¶ True if this descriptor was created by analyzing the data file.
-
fields
= None¶ A list of dicts where each entry represent a data field in use.
- Each dict contains:
The term identifier
(Possibly) a default value
The column index/position in the CSV file (except if we use a default value instead)
Example:
[{'term': 'http://rs.tdwg.org/dwc/terms/scientificName', 'index': '1', 'default': None}, {'term': 'http://rs.tdwg.org/dwc/terms/locality', 'index': '2', 'default': ''}, # The data for `country` is a the default value 'Belgium' for all rows, so there's # no column in CSV file. {'term': 'http://rs.tdwg.org/dwc/terms/country', 'index': None, 'default': 'Belgium'}]
-
fields_enclosed_by
= None¶ The string or character used to enclose fields in the data file.
-
fields_terminated_by
= None¶ The string or character used as a field separator in the data file. Example: “\t”.
-
file_encoding
= None¶ The encoding of the data file. Example: “utf-8”.
-
file_location
= None¶ The data file location, relative to the archive root.
-
property
headers
¶ A list of (ordered) column names that can be used to create a header line for the data file.
Example:
['id', 'http://rs.tdwg.org/dwc/terms/scientificName', 'http://rs.tdwg.org/dwc/terms/basisOfRecord', 'http://rs.tdwg.org/dwc/terms/family', 'http://rs.tdwg.org/dwc/terms/locality']
See also
short_headers
if you prefer less verbose headers.
-
id_index
= None¶ If the section represents a core data file, the index/position of the id column in that file.
-
lines_terminated_by
= None¶ The string or character used as a line separator in the data file. Example: “\n”.
-
property
lines_to_ignore
¶ Return the number of header lines/lines to ignore in the data file.
-
classmethod
make_from_file
(datafile_path)¶ Create and return a DataFileDescriptor by analyzing the file at datafile_path.
- Parameters
datafile_path (str) – Relative path to a data file to analyze in order to instantiate the descriptor.
-
classmethod
make_from_metafile_section
(section_tag)¶ Create and return a DataFileDescriptor from a metafile <section> tag.
- Parameters
section_tag (
xml.etree.ElementTree.Element
) – The XML Element section containing details about the data file.
-
raw_element
= None¶ The <section> element describing the data file, from the metafile. None if the archive contains no metafile.
-
represents_corefile
= None¶ True if this descriptor is used to represent the core file an archive.
-
represents_extension
= None¶ True if this descriptor is used to represent an extension file in an archive.
-
property
short_headers
¶ A list of (ordered) column names (short version) that can be used to create a header line for the data file.
Example:
['id', 'scientificName', 'basisOfRecord', 'family', 'locality']
See also
headers
.
-
property
terms
¶ Return a Python set containing all the Darwin Core terms appearing in file.
-
type
= None¶
File objects¶
File-related classes and functions.
-
class
dwca.files.
CSVDataFile
(work_directory: str, file_descriptor: dwca.descriptors.DataFileDescriptor)¶ Object used to access a DwCA-enclosed CSV data file.
- Parameters
work_directory – absolute path to the target directory (archive content, previously extracted if necessary).
file_descriptor – an instance of
dwca.descriptors.DataFileDescriptor
describing the data file.
The file content can be accessed:
By iterating on this object: a str is returned, including separators.
With
get_row_by_position()
(Adwca.rows.CoreRow
ordwca.rows.ExtensionRow
object is returned)For an extension data file, with
get_all_rows_by_coreid()
(Adwca.rows.CoreRow
ordwca.rows.ExtensionRow
object is returned)
On initialization, an index of new lines is build. This may take time, but makes random access faster.
-
close
() → None¶ Close the file.
The content of the file will not be accessible in any way afterwards.
-
property
coreid_index
¶ An index of the core rows referenced by this data file.
It is a Python dict such as:
{ core_id1: [1], # Row at position 1 references a Core Row whose ID is core_id1 core_id2: [8, 10] # Rows at position 8 and 10 references a Core Row whose ID is core_id2 }
- Raises
AttributeError if accessed on a core data file.
Warning
for permformance reasons, dictionary values are arrays(‘L’) instead of regular python lists
Warning
coreid_index is only available for extension data files.
Warning
Creating this index can be time and memory consuming for large archives, so it’s created on the fly at first access.
-
file_descriptor
= None¶ An instance of
dwca.descriptors.DataFileDescriptor
, as given to the constructor.
-
get_all_rows_by_coreid
(core_id: int) → List[dwca.rows.ExtensionRow]¶ Return a list of
dwca.rows.ExtensionRow
whose Core Id field match core_id.
-
get_row_by_position
(position: int) → Union[dwca.rows.CoreRow, dwca.rows.ExtensionRow]¶ Return the row at position in the file.
Header lines are ignored.
- Raises
IndexError if there’s no line at position.
-
lines_to_ignore
= None¶ Number of lines to ignore (header lines) in the CSV file.
Helpers¶
This module contains small helpers to make life easier.
-
dwca.darwincore.utils.
qualname
(short_term)¶ Takes a darwin core term (short form) and returns the corresponding qualname.
Note
It is generally used to make data access less verbose (see example below).
- Raises
StopIteration
if short_term is not found.
Typical real-world example:
from dwca.darwincore.utils import qualname as qn qn("Occurrence") # => "http://rs.tdwg.org/dwc/terms/Occurrence" # To access data row: myrow.data[qn('scientificName')] # => u"Tetraodon fluviatilis" # Instead of the verbose: myrow.data['http://rs.tdwg.org/dwc/terms/scientificName'] # => u"Tetraodon fluviatilis"
Exceptions¶
Exceptions for the whole package.
-
exception
dwca.exceptions.
InvalidArchive
¶ The archive appears to be invalid.
-
exception
dwca.exceptions.
InvalidSimpleArchive
¶ The simple archive appears to be invalid.
-
exception
dwca.exceptions.
NotADataFile
¶ The file doesn’t exists or is not a data file.
-
exception
dwca.exceptions.
RowNotFound
¶ The DwC-A Row cannot be found.