Absence vs presence data
The API itself provides access to presence only data. This means that records are only given for when a species was found. This can cause issues if trying to aggregate data like, for example, to determine the weight of the species in a region in terms of catch weight per hectare. The AFSC GAP API on its own would not necessarily provide the total nubmer of hecatres surveyed in that region because hauls without the species present would be excluded. That in mind, this library provides a method for inferring absence data.
Example of absence data in aggregation
Here is a practical memory efficient example using geolib and toolz to aggregate catch data by 5 character geohash.
import afscgap
import geolib.geohash
import toolz.itertoolz
import afscgap
query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
results = query.execute()
def simplify_record(full_record):
latitude = full_record.get_latitude(units='dd')
longitude = full_record.get_longitude(units='dd')
geohash = geolib.geohash.encode(latitude, longitude, 5)
return {
'geohash': geohash,
'area': full_record.get_area_swept(units='ha'),
'weight': full_record.get_weight(units='kg')
}
def combine_record(a, b):
assert a['geohash'] == b['geohash']
return {
'geohash': a['geohash'],
'area': a['area'] + b['area'],
'weight': a['weight'] + b['weight']
}
simplified_records = map(simplify_record, results)
totals_by_geohash = toolz.reduceby(
'geohash',
combine_record,
simplified_records
)
weight_by_area_tuples = map(
lambda x: (x['geohash'], x['weight'] / x['area']),
totals_by_geohash.values()
)
weight_by_area_by_geohash = dict(weight_by_area_tuples)
For more details see the Python functional programming guide. All that said, for some queries, the use of Pandas may lead to very heavy memory usage.
Absence inference algorithm
Though it is not possible to resolve this issue using the AFSC GAP API service alone, this library can infer those missing records using a separate static flat file provided by NOAA and the following algorithm:
- Record the set of species observed from API service returned results.
- Record the set of hauls observed from API service returned results.
- Return records normally while records remain available from the API service.
- Upon exhaustion of the API service results, download the ~10M hauls flat file from this library's community.
- For each species observed in the API returned results, check if that species had a record for each haul reported in the flat file.
- For any hauls without the species record, yield an 0 catch record from the iterator for that query.
This procedure is disabled by default. However, it can be enabled through the presence_only
keyword in query
like so: asfcgap.query(presence_only=False)
.
Memory efficiency of absence inference
Note that presence_only=False
will return a lot of records. Indeed, in some queries, this may stretch to many millions. As described in community guidelines, a goal of this project is provide those data in a memory-efficient way and, specifically, these "zero catch" records are generated by the library's iterator as requested but never all held in memory at the same time. It is recommened that client code also take care in memory efficiency. This can be as simple as aggregating via for
loops which only hold one record in memory at a time. Similarly, consider using map
, filter
, reduce
, itertools, etc.
Manual pagination of zero catch records
The goal of Cursor.get_page
is to pull results from a page returned for a query as it appears in the NOAA API service. Note that get_page
will not return zero catch records even with presence_only=False
because the "page" requested does not technically exist in the API service. In order to use the negative records inference feature, please use the iterator option instead.
Filtering absence data
Note that the library will emulate filtering in Python so that haul records are filtered just as presence records are filtered by the API service. This works for "basic" and "advanced" filtering. However, at time of writing, "manual filtering" as described below using ORDS syntax is not supported when presence_data=False
. Also, by default, a warning will be emitted when using this feature to help new users be aware of potential memory issues. This can be suppressed by including suppress_large_warning=True
in the call to query.
Cached hauls
If desired, a cached set of hauls data can be used instead. It must be a list of Haul objects and can be passed like so:
import csv
import afscgap
import afscgap.inference
with open('hauls.csv') as f:
rows = csv.DictReader(f)
hauls = [afscgap.inference.parse_haul(row) for row in rows]
query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
query.set_hauls_prefetch(hauls)
results = query.execute()
This can be helpful when executing a lot of queries and the bandwidth to download the hauls metadata file multiple times may not be desireable.