Data quality and completeness
There are a few caveats for working with these data that are important for researchers to understand. These are detailed in our limitations docs.
Incomplete or invalid records
Metadata fields such as
year are always required to make a
Record whereas others such as catch weight
cpue_kgkm2 are not present on all records returned by the API and are optional. See the data structure section for additional details. For fields with optional values:
- A maybe getter (like
get_cpue_weight_maybe) is provided which will return None without error if the value is not provided or could not be parsed.
- A regular getter (like
get_cpue_weight) is provided which asserts the value is not None before it is returned.
Record objects also have an
is_complete method which returns true if both all optional fields on the
Record are non-None and the
date_time field on the
Record is a valid ISO 8601 string. By default, records for which
is_complete are false are returned when iterating or through
get_page but this can be overridden by with the
filter_incomplete keyword argument like so:
import afscgap query = afscgap.Query() query.filter_year(eq=2021) query.filter_srvy(eq='GOA') query.filter_scientific_name(eq='Pasiphaea pacifica') query.set_filter_incomplete(True) results = query.execute() for result in results: assert result.is_complete()
Results returned by the API for which non-Optional fields could not be parsed (like missing
year) are considered "invalid" and always excluded during iteration when those raw unreadable records are kept in a
queue.Queue[dict] that can be accessed via
get_invalid like so:
import afscgap query = afscgap.Query() query.filter_year(eq=2021) query.filter_srvy(eq='GOA') query.filter_scientific_name(eq='Pasiphaea pacifica') results = query.execute() valid = list(results) invalid_queue = results.get_invalid() percent_invalid = invalid_queue.qsize() / len(valid) * 100 print('Percent invalid: %%%.2f' % percent_invalid) complete = filter(lambda x: x.is_complete(), valid) num_complete = sum(map(lambda x: 1, complete)) percent_complete = num_complete / len(valid) * 100 print('Percent complete: %%%.2f' % percent_complete)
Note that this queue is filled during iteration (like
for result in results or
list(results)) and not
get_page whose invalid record handling behavior can be specified via the
Though not officially mentioned by the NOAA API, the authors of this library observe some positive longitudes in returned data where negative longitudes of the same magnitude would be expected. Users of the library should be careful to determine how to handle these records (inferring they should have been the same magnitude of longitude but negative or excluded). Publications should be careful to document their decision.
Fully empty hauls
We've been made aware that some historical surveys from NOAA may not report hauls in the hauls flat file if no specimens were captured. This is a limitation of upstream data and may cause some approximation in CPUE calculations depending on one's definition of effort (should hauls which did not get any specimens be considered a "successful" haul and should they be included as effort?). Note that this inclusion decision is decided by NOAA and not this project. Instead this package uses any hauls reported by NOAA in zero catch inference. One can use an alternative inclusion criteria by providing a static hauls file via