Absence vs presence data

The API itself provides access to presence only data. This means that records are only given for when a species was found. This can cause issues if trying to aggregate data like, for example, to determine the weight of the species in a region in terms of catch weight per hectare. The AFSC GAP API on its own would not necessarily provide the total nubmer of hecatres surveyed in that region because hauls without the species present would be excluded. That in mind, this library provides a method for inferring absence data.

Example of absence data in aggregation

Here is a practical memory efficient example using geolib and toolz to aggregate catch data by 5 character geohash.

import afscgap
import geolib.geohash
import toolz.itertoolz

import afscgap

query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
results = query.execute()

def simplify_record(full_record):
    latitude = full_record.get_latitude(units='dd')
    longitude = full_record.get_longitude(units='dd')
    geohash = geolib.geohash.encode(latitude, longitude, 5)

    return {
        'geohash': geohash,
        'area': full_record.get_area_swept(units='ha'),
        'weight': full_record.get_weight(units='kg')
    }

def combine_record(a, b):
    assert a['geohash'] == b['geohash']
    return {
        'geohash': a['geohash'],
        'area': a['area'] + b['area'],
        'weight': a['weight'] + b['weight']
    }

simplified_records = map(simplify_record, results)
totals_by_geohash = toolz.reduceby(
    'geohash',
    combine_record,
    simplified_records
)
weight_by_area_tuples = map(
    lambda x: (x['geohash'], x['weight'] / x['area']),
    totals_by_geohash.values()
)
weight_by_area_by_geohash = dict(weight_by_area_tuples)

For more details see the Python functional programming guide. All that said, for some queries, the use of Pandas may lead to very heavy memory usage.

Absence inference algorithm

Though it is not possible to resolve this issue using the AFSC GAP API service alone, this library can infer those missing records using a separate static flat file provided by NOAA and the following algorithm:

Record the set of species observed from API service returned results. Starting with 2.x release, we query for the list of all formally tracked species.
Record the set of hauls observed from API service returned results.
Return records normally while records remain available from the API service.
For each species observed in the API returned results, check if that species had a record for each haul reported in the flat file.
For any hauls without the species record, yield an 0 catch record from the iterator for that query.

This procedure is disabled by default. However, it can be enabled through the presence_only keyword in query like so: asfcgap.query(presence_only=False). Starting with the 2.x release, the results of this computation are cached in community flat files. See snapshot for information about the 2.x community files.

Memory efficiency of absence inference

Note that presence_only=False will return a lot of records. Indeed, in some queries, this may stretch to many millions. As described in community guidelines, a goal of this project is provide those data in a memory-efficient way and, specifically, these "zero catch" records are generated by the library's iterator as requested but never all held in memory at the same time. It is recommened that client code also take care in memory efficiency. This can be as simple as aggregating via for loops which only hold one record in memory at a time. Similarly, consider using map, filter, reduce, itertools, etc.

Manual pagination of zero catch records

The goal of Cursor.get_page is to pull results from a single page. Note that get_page will return zero catch records regardless of presence_only because the "page" requested does not technically exist in the flat files. In order to use the negative records inference feature, please use the iterator option instead.

Filtering absence data

Note that the library will emulate filtering in Python so that haul records are filtered just as presence records are filtered by the API service. This works for "basic" and "advanced" filtering. By default, a warning will be emitted when using this feature to help new users be aware of potential memory issues. This can be suppressed by including suppress_large_warning=True in the call to query. For the 1.x releases, "manual filtering" as described below using ORDS syntax is not supported when presence_data=False.

Cached hauls

If desired, a cached set of hauls data can be used instead. It must be a list of Haul objects and can be passed like so:

import csv

import afscgap
import afscgap.inference

with open('hauls.csv') as f:
    rows = csv.DictReader(f)
    hauls = [afscgap.inference.parse_haul(row) for row in rows]

query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
query.set_hauls_prefetch(hauls)
results = query.execute()

This can be helpful when executing a lot of queries and the bandwidth to download the hauls metadata file multiple times may not be desireable. Note that this option was removed in the 2.x series.