Absence vs presence data

The API itself provides access to presence only data. This means that records are only given for when a species was found. This can cause issues if trying to aggregate data like, for example, to determine the weight of the species in a region in terms of catch weight per hectare. The AFSC GAP API on its own would not necessarily provide the total nubmer of hecatres surveyed in that region because hauls without the species present would be excluded. That in mind, this library provides a method for inferring absence data.



Example of absence data in aggregation

Here is a practical memory efficient example using geolib and toolz to aggregate catch data by 5 character geohash.

import afscgap
import geolib.geohash
import toolz.itertoolz

import afscgap

query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
results = query.execute()

def simplify_record(full_record):
    latitude = full_record.get_latitude(units='dd')
    longitude = full_record.get_longitude(units='dd')
    geohash = geolib.geohash.encode(latitude, longitude, 5)

    return {
        'geohash': geohash,
        'area': full_record.get_area_swept(units='ha'),
        'weight': full_record.get_weight(units='kg')
    }

def combine_record(a, b):
    assert a['geohash'] == b['geohash']
    return {
        'geohash': a['geohash'],
        'area': a['area'] + b['area'],
        'weight': a['weight'] + b['weight']
    }

simplified_records = map(simplify_record, results)
totals_by_geohash = toolz.reduceby(
    'geohash',
    combine_record,
    simplified_records
)
weight_by_area_tuples = map(
    lambda x: (x['geohash'], x['weight'] / x['area']),
    totals_by_geohash.values()
)
weight_by_area_by_geohash = dict(weight_by_area_tuples)

For more details see the Python functional programming guide. All that said, for some queries, the use of Pandas may lead to very heavy memory usage.


Absence inference algorithm

Though it is not possible to resolve this issue using the AFSC GAP API service alone, this library can infer those missing records using a separate static flat file provided by NOAA and the following algorithm:

This procedure is disabled by default. However, it can be enabled through the presence_only keyword in query like so: asfcgap.query(presence_only=False).


Memory efficiency of absence inference

Note that presence_only=False will return a lot of records. Indeed, in some queries, this may stretch to many millions. As described in community guidelines, a goal of this project is provide those data in a memory-efficient way and, specifically, these "zero catch" records are generated by the library's iterator as requested but never all held in memory at the same time. It is recommened that client code also take care in memory efficiency. This can be as simple as aggregating via for loops which only hold one record in memory at a time. Similarly, consider using map, filter, reduce, itertools, etc.


Manual pagination of zero catch records

The goal of Cursor.get_page is to pull results from a page returned for a query as it appears in the NOAA API service. Note that get_page will not return zero catch records even with presence_only=False because the "page" requested does not technically exist in the API service. In order to use the negative records inference feature, please use the iterator option instead.


Filtering absence data

Note that the library will emulate filtering in Python so that haul records are filtered just as presence records are filtered by the API service. This works for "basic" and "advanced" filtering. However, at time of writing, "manual filtering" as described below using ORDS syntax is not supported when presence_data=False. Also, by default, a warning will be emitted when using this feature to help new users be aware of potential memory issues. This can be suppressed by including suppress_large_warning=True in the call to query.


Cached hauls

If desired, a cached set of hauls data can be used instead. It must be a list of Haul objects and can be passed like so:

import csv

import afscgap
import afscgap.inference

with open('hauls.csv') as f:
    rows = csv.DictReader(f)
    hauls = [afscgap.inference.parse_haul(row) for row in rows]

query = afscgap.Query()
query.filter_year(eq=2021)
query.filter_srvy(eq='GOA')
query.filter_scientific_name(eq='Gadus macrocephalus')
query.set_presence_only(False)
query.set_hauls_prefetch(hauls)
results = query.execute()

This can be helpful when executing a lot of queries and the bandwidth to download the hauls metadata file multiple times may not be desireable.