1 Background

Citizen science–herein defined as the involvement of volunteers who are typically not professional scientists in the production of scientific knowledge–can generate biodiversity data at spatial, temporal, and taxonomic scales difficult to achieve by any other means. As a result, citizen science is becoming increasingly important in research, management, and policy. Over 50% of data in the Global Biodiversity Information Facility (GBIF)–the international government-funded repository for the world’s species occurrence data in space and time–are citizen science observations, and 6 of the top 10 datasets on the GBIF network are citizen science datasets (GBIF 2020). Citizen science data form the basis of hundreds of peer-reviewed research publications every year, and increasingly of conservation efforts and strategies at local, regional and global scales.

The Dynamic Observatory of Biodiversity (DOB) for the California coast integrates and models information from crowd-sourced citizen science data together with long-term monitoring surveys and oceanographic model outputs to produce synthetic indicators of biodiversity and biodiversity change on the California coast. At its core, the DOB takes advantage of the vast amounts of information on the occurrence of many organisms across the whole coast and throughout the year that can be extracted from crowd-sourced observations gathered through the large-scale citizen science initiatives Snapshot Cal Coast and iNaturalist.org.

Observation biases in crowd-sourced citizen science data are addressed using a series of steps including data filtering, joint statistical inference across space, time, and taxonomy and integration with more systematic yet scarcer long-term monitoring data. Regularly updated as new information becomes available, the DOB provides insights into the pulse of biodiversity on the California coast.

Read the project’s full summary report here

2 Five steps to extract biodiversity indicators from crowd-sourced citizen science data

Although they are noisier than standardized ecological data sources, many of the systematic biases of crowd-sourced citizen science data also occur in data collected systematically by professional scientists (Kosmala et al. 2016): spatially and/or temporally non-random observations, uneven sampling effort over space, time, or taxa, and uneven detectability between rare and common species. Because these forms of bias have been known to occur for many years in data collected by professional scientists, many methods have been developed to control for and model these biases, provided that the relevant metadata are recorded (Bird et al. 2014).

This section provides an overview of the five most useful steps–a recipe of sorts–for dealing with bias in crowd-sourced citizen science data and, thus, help realize the potential of these data for research and management. We note that the steps presented below are not necessarily intended as sequential and are certainly not mutually exclusive. Instead, depending on the desired output, appropriately accounting for bias in crowd-sourced citizen science data may well involve combining several or all of these steps and iterating through them until the noise in the data is minimized.

2.1 Filtering data

Filtering is the process of selecting a smaller part of a data set and using that subset in subsequent analyses. Filtering ecological datasets is frequently used as a way to minimize bias to reveal a signal of biological change (Hickling et al. 2006, Roy et al. 2012). In the context of crowd-sourced citizen science data, filtering data serves two main purposes: reducing measurement error and equalizing observer effort. First, filtering data can be an effective means of reducing spatial, temporal, and/or taxonomic uncertainty in the data. iNaturalist labels its observations based on specific data quality filters, with Research Grade observations achieving the highest data quality standards. Research Grade observations meet all the basic metadata standards for useful species occurrence data-a taxon identification, a geospatial reference, and a timestamp-and are less likely to be subject to error in all those metadata fields than non-Research Grade observations. In particular, Research Grade observations have previously been found to be subject to rates of taxonomic identification error comparable to that of other ecological data sources: Research Grade identifications are correct approximately 85% of the time based on expert identifications (Loarie 2017, Ueda 2019). iNaturalist Research Grade observations are integrated in the Global Biodiversity Information Facility and make up a significant proportion of hundreds of papers every year. As a result, we calculated biodiversity indicators on the California coast based exclusively on iNaturalist Research Grade observations and did not consider any other type of observation on iNaturalist. Moreover, we only used Research Grade observations identified at the rank of species. These basic filters apply to all analyses described herein and all outputs presented. Second, filtering data can be an effective means to equalize observation effort over space, time, and/or taxa. Basic filters can ensure that only areas, time periods, and/or taxa that contain a minimum amount of information are included in analyses. For instance, we derived place-based biodiversity indicators-species richness, species rarity, biodiversity uniqueness, and biodiversity irreplaceability-exclusively for places in which at least 10 species had been observed (see section 3.3 for more details). Similarly, we provide species-based indicators exclusively for species with at least 100 observations on iNaturalist. A more in depth form of filtering-rarefaction-can help to even out observation effort between two areas, time periods, or taxa. Rarefaction is the process of rarefying or thinning a reference sample by repeatedly drawing random subsets of observations in order to standardize comparisons of biodiversity on the basis of a shared number of observations between samples (Gotelli and Chao 2013). Subsampling and rarefaction have been previously used to estimate changes between two broad time periods based on large-scale volunteer-based datasets such as the published Atlases of British birds, butterflies and plants (Warren et al. 2001, Thomas et al. 2004). During the course of this project, we occasionally used rarefaction to tease out ecological signals from crowd-sourced citizen science data on the California coast. For instance, yearly ecological community change values for different places are based on rarefied samples between years (see section 3.3.1.5).

2.2 Aggregating data across space and time

Aggregating data is the process of gathering and combining data and expressing those in a summary form in subsequent statistical analyses. The primary purpose of data aggregation is to derive joint inferences based on grouped data points, whilst reducing the influence of any single data point on the overall inference. Owing to the law of large numbers, the larger the aggregated dataset, the more information-rich and the less biased the overall inference (Bird et al. 2014). As a result, collecting and aggregating a sufficiently large amount of data can reduce bias in crowd-sourced citizen science data. For example, in the past, species occurrence data from opportunistic sources including citizen science were collated into broad time periods and over larger spatial extents, such as in published Atlases (e.g. British and North American Breeding Birds). This compensates to some degree for variation in observer effort and activity, enabling the assessment of changes in species’ distributions between large time periods and areas (Shaffer et al. 1998, Thomas et al. 2004, Tingley and Beissinger 2009, Botts et al. 2012). Fortunately, crowd-sourced citizen science data from Snapshot Cal Coast and iNaturalist accumulate at the rate of tens of thousands of observations every year throughout the California Coast. As a result, aggregating over certain spatial and temporal scales can be an effective means of reducing the noise in the data. Our analyses focused on providing place-based indicators that summarize biodiversity over the last decade at the level of a watershed, county, or Marine Protected Area. We found that a trade-off exists between the spatial, temporal, and taxonomic scale of inference, such that increasing the resolution in any one of these dimensions entails decreasing the resolution in another, else data aggregates do not hold enough information. For this reason, we reduced the spatial scale of our estimates of species-based yearly trends to a given coastal region or statewide, because of the lack of information to further resolve temporal trends in geographical space. When analyzing crowd-sourced citizen science data, a particularly useful instance of spatiotemporal aggregation is the definition of a discrete observation event. This step is key to determine the basic units of analysis-the units that are compared with each other across the whole crowd-sourced dataset-and identify important attributes of each unit such as the observation effort or observer behavior. Analyses of opportunistic species occurrence data commonly involve defining observation events, even if this is not explicitly highlighted as such. For instance, the analysis of historic surveys within the context of the Grinnell Resurvey Project involved making decisions about which museum specimens from the 1930s should be aggregated as part of the same trapping event (Moritz et al. 2008, Tingley et al. 2009). Selecting the degree of spatiotemporal aggregation that defines an observation event is a complicated attempt to reverse-engineer survey structure when none exists, and should likely be based on a combination of data availability and species biology. A particularly in-depth example of spatiotemporal aggregation using crowd-sourced citizen science data is eBird’s Adaptive Spatio-Temporal Exploratory Model (AdaSTEM; Fink et al. 2013). AdaSTEM involves mining the full set of observations to identify discrete spatiotemporal blocks that satisfy a bias-variance tradeoff between a sufficient sample size to fit a good model-limiting the variance of estimates- and a small enough sample size to assume stationarity-controlling for bias (Fink et al. 2013, Fink et al. 2020). Across all analyses and outputs presented herein, we defined an observation event as any set of iNaturalist observations made by a single user over a single day within the same 10km2 area (see section 3.3.2.1).

2.3 Borrowing strength across taxa

Data aggregation across multiple species and/or higher taxa constitutes a key step to deal with bias in crowd-sourced citizen science data. Even in cases where a single species is the primary focus of analyses, taking advantage of existing information on additional species associated with the focal species spatially, temporally, and/or taxonomically (or phylogenetically) can significantly improve inferences on the focal species itself. The overarching assumption here is that the observation of a second species can indicate the likelihood of presence or absence of a focal species by informing on either the observation process or the ecological process that produced the set of observations of the focal species. For instance, evidence that many commonly observed species were not observed during an observation event could indicate that the absence of a focal species is the result of a low observation effort at that location and time. Therefore, by assuming that the primary observation biases are shared across all species, we can borrow strength across species to efficiently estimate bias and improve inferences across all species, no matter how common or rare. Furthermore, evidence of the presence of several prey species may be used to infer the presence of a focal species that predates on them, even if the focal species was not itself observed. Estimating the likelihood of false absence for a focal species from observations of additional species assumes that multiple species are observed together using a pre-defined “search” list (or checklist) of potentially observable species. Unfortunately, this information is seldom recorded as metadata in crowd-sourced citizen science databases (except e.g. eBird 2020). Deciding which additional species are relevant to inferences on a focal species is essentially an attempt to reverse-engineer the observation process and simulate a survey structure. The concept of borrowing strength across taxa has been central to many of the approaches developed to derive inferences from volunteer-based species occurrence data. Telfer et al. (2002) used the change in the number of observations across all species combined as an indirect measure of how observation effort differed between two samples. If observation effort is higher in a later period compared to an earlier period, all species are expected to show increases between these two periods and deviations from this expected trend represent an index of change in a focal species. Expanding on this idea, Ball et al. (2011) proposed a model called Reporting Rate, which estimates the occurrence rate of a focal species as the proportion of all observation events in a given year that include an observation of the focal species, assuming that this would effectively control for changes in overall observation effort across years. Yet another way to borrow strength across taxa is to use a benchmark species-a common species whose occurrence is assumed to show no overall trend-as a proxy for overall observation activity (Hill 2012). Consequently, the process of borrowing strength across species permeated all our species-based analyses. For instance, we derived species’ yearly trends based on a modified version of the Reporting Rate model of Ball et al. (2011) whereby the focal species’ population in a year was estimated as the proportional rate of observation of the focal species in relation to all observations of associated species that year (see section 3.3.2.3). To define the list of associated species-species which we can assume were part of the same “search” list across observers-we mined all iNaturalist observations to identify aggregates of 50 species with the most similar observation histories in space and time (see section 3.3.2.2). Moreover, we built species distribution models jointly for multiple species of invertebrates and seaweeds, assuming that all species would inform each other’s observation biases (see section 3.3.2.4).

2.4 Statistically correcting for observation bias

Filtering and aggregating data are useful approaches to reverse-engineer survey structure and bring the ecological signals in the data to the foreground. However, these approaches require trading-off the spatial, temporal, or taxonomic resolution of inferences and cannot by themselves generate conclusions on the drivers of biodiversity change. For a higher resolution picture of biodiversity change and its underlying drivers, it is necessary to include measures of observation bias as correction factors within a statistical model. For instance, Szabo et al. (2010) proposed a modification of the Reporting Rate model in which observation events are not aggregated over years or regions but instead constitute the units of analysis in a linear model, effectively accounting for variation in the number of observation events over space and time. Importantly, to account for variation in effort among observation events, this model includes a term for the number of species observed in each observation event as a correction factor for observation effort. This correction factor, also known as the length of the species “search” list or list length, has previously been found to provide a useful proxy for uneven observation effort (Szabo et al. 2010, van Strien et al. 2013, Isaac et al. 2014, Rapacciuolo et al. 2017, Powney et al. 2019). The switch to a linear modeling framework enabled researchers to better control for uneven coverage in space by adding a spatial autocorrelation term (e.g. Kuussaari et al. 2007, Roy et al. 2012), as well as trial the use of several predictor variables on the right hand side of the equation to help explain variation in the frequency of observation of a focal species among observation events. Additional predictor variables that have been found to potentially help account for bias include the duration spent searching, the distance traveled during an observation event, as well as estimates of observer expertise (Kelling et al. 2015, Johnston et al. 2018). Though these metadata are not being tracked by iNaturalist, our analyses have shown that these predictors could all theoretically be derived after the fact from the metadata that are actually being recorded. Statistically correcting for the fourth form of bias in crowd-sourced citizen science data- uneven detectability among species-required yet another innovation: occupancy models (MacKenzie 2006). Occupancy models take advantage of replicated visits within finite spatiotemporal windows-where the ecological process is assumed to remain unchanged-to estimate the conditional probability that a species was observed when present (MacKenzie 2006). These models consist of two hierarchically coupled submodels: one modeling the ecological process (is the species present given the biotic and abiotic environment?) and the other modeling the observation process (was the species observed given that it is present?). While these models have been developed for use with standardized survey data, they have more recently been applied to species occurrence data (van Strien et al. 2010, van Strien et al. 2013, Powney et al. 2019, Outhwhite et al. 2020). For the purposes of our analyses, we included several statistical correction factors of observation bias in our models of species’ geographic ranges and changes in geographic range. Specifically, we included three predictors of uneven observation effort in space and time in our species distribution models: the total number of observations, the total number of observers, and the number of high-effort observation events (that is observation events generating at least 10 observations).

2.5 Integrating standardized data sources

Unstructured data from crowd-sourced citizen science cannot and should not be thought of as a substitute for structured and standardized survey data. Standardized survey data provide high-quality and reliable information on the presence and absence of species. However, due to their intensive, standardized and expert-based nature, these programs are limited to specific species and/or sites at a handful of times per year. On the other hand, crowd-sourced observations of biodiversity obtained through large-scale citizen science campaigns and iNaturalist provide vast amounts of information on the presence of many organisms and ecosystems globally and throughout the year. However, due to their opportunistic and unstructured nature, crowd-sourced citizen science data require careful treatment for observation bias. Fortunately, recent developments in large-scale ecological models mean that users need no longer choose between these two data types but can instead combine them to exploit their respective strengths (Fithian et al. 2014, Isaac et al. 2020). Indeed, the model-based integration of smaller quantities of long-term biodiversity monitoring data with larger quantities of crowd-sourced citizen science data can provide the best inferences on the abundance and distribution of species. Using this approach, monitoring data can help to pinpoint biases in citizen science data while citizen science data can help to fill gaps in the sparse coverage of monitoring data. Model-based data integration has recently become more accessible thanks to the development of point process models: a shared statistical framework to conceptualize and capture mathematically how multiple data types can emerge from a single ecological system (Warton et al. 2013, Fithian et al. 2014, Isaac et al. 2020). A point process is a statistical description of how points are distributed in space. In the context of ecological systems, the ‘points’ can be thought of as individuals of one or multiple species dotted around a landscape, while the ‘process’ describes the density of points as a function of area via an intensity surface. The intensity can vary in space and time, with a higher intensity indicating that a species is more likely to occur at a particular location. From this relatively simple conceptualization of an ecological process, multiple data types can arise based on different observation processes: counting all individuals over a specified area generates abundance data, recording the detection or non-detection of a species over a specified area generates presence-absence data, and the opportunistic observation of a subset of individuals across the whole landscape generates presence-only data (Isaac et al. 2020). The point process framework therefore encompasses the most common data types and model structures that exist to describe ecological systems (Warton et al. 2013). When using point processes to model the abundance and distribution of species, the intensity of points can be statistically related to variation in the physical environment (e.g. habitat, sea surface temperature, sea level) to obtain estimates of the species’ density across the whole landscape and over time. In the current project, we modeled the density and distribution of a subset of rocky intertidal species on the US West Coast using the point process modeling framework of Fithian et al. (2014), which enables the integration of opportunistic presence-only observations and presence-absence data from standardized surveys. Specifically, we integrated crowd-sourced citizen science data from iNaturalist with standardized survey data from the Multi-Agency Rocky Intertidal Network’s long-term plots and counts across multiple species. We then modeled these biological data as a function of physical environmental variables including sea surface temperature, salinity, sea level, bathymetry, and shore type (see section 3.3.2.4 for more details). These models enabled us to generate the best available estimates of the density and distribution of 14 species of invertebrates, kelp, and seaweed on the US West Coast from 2012 to 2019.

3 Methodology

3.1 Places

3.1.1 California Current System

At the largest spatial scale, we limited all analyses to data collected within the California Current System off the United States west coast from Washington State to the Baja California peninsula in Mexico. This choice was based on the necessity to include information beyond California for a more comprehensive understanding of the ecology of California species, many of which are found outside California and across the Pacific Northwest. Unfortunately, due to data constraints, we could not include data from Baja California, British Columbia, or Alaska in present analyses though this may be desirable for future analyses.

3.1.2 California coast

The main region of focus for most outputs presented in this report was the California coast. We defined California’s coastal area broadly, based on extending CDFW’s California coastal habitat layer offshore to State waters and inland to include bays and estuaries, as well as some urban areas in the vicinity of the coast. The main coastal habitats included across this area were rocky shores, beaches, dunes, and wetlands.

3.1.3 Coastal regions

For a subset of analyses, we further subdivided the California coast into three coastal regions: a North region, a Central region, and a South region. These three coastal regions were defined in line with California’s OPC definitions (MPA Monitoring Action Plan 2018). The North region goes from Alder Creek on the Oregon border to Pigeon Point, the central region goes from Pigeon Point to Point Conception, and the South region goes from Point Conception to the US/Mexico Border.

3.1.4 Coastal watersheds

For a subset of analyses, we considered all California watersheds overlapping the California coast (as defined in 3.1.2) to any extent.

3.1.5 Coastal counties

For a subset of analyses, we considered all California counties overlapping the California coast (as defined in 3.1.2) to any extent.

3.1.6 Marine Protected Areas

For a subset of analyses, we considered all California Marine Protected Areas (MPAs), excluding Special Closures.

3.1.7 Hexagonal grid cells

In order to provide higher resolution estimates for a number of biodiversity indicators, we generated an arbitrary hexagonal grid over the California coast (as defined in 3.1.2). We set the size of each hexagonal grid cell as the highest resolution that would still meet data quality standards. This resolution was approximately 20km.

3.2 Data

3.2.1 iNaturalist

We extracted all available Research Grade observations on iNaturalist.org over the California Current System (as defined in 3.1.1) from 2010 to 2019. These included but were not limited to all Research Grade observations generated as part of Snapshot Cal Coast, from its inception to the present. We subsequently filtered out all observations without spatial coordinates, year, or species-level identifications. Additionally, for all analyses, we focused on intertidal organisms, since those are the most thoroughly documented taxa over the study areas selected. As a result, we selected exclusively observations for the kingdoms Animalia (animals) and Chromista (kelp, seaweed, diatoms, and allies)-thus excluding all fungi, mosses, lichens, ferns, red algae, green algae, and vascular plants (except seagrasses from the genera Zostera and Phyllospadix which are intertidal organisms). Moreover, we excluded a number of animal taxa that cannot be considered primarily intertidal including all vertebrates other than fish, insects, and arachnids. We do not mean to imply that analyses for these taxa are not possible on the coast or elsewhere, or that these taxa do not constitute coastal organisms; however, the choice to exclude these taxa was motivated by the fact that a thorough understanding of their abundance and distribution would require extending the study area beyond strictly coastal areas (to either subtidal or terrestrial environments).

3.2.2 Multi-Agency Rocky Intertidal Network

We obtained data from the Multi-Agency Rocky Intertidal Network (MARINe; https://marine.ucsc.edu) long-term plots and/or species counts for 14 species from 2012 to 2019 across all sites in California, Oregon, and Washington overlapping with the California Current System (as defined in 3.1.1). Specifically, we obtained data for the following species or species lumping codes: Pisaster ochraceus, Lottia gigantea, Tetraclita rubescens, Mytilus californianus, Pollicipes polymerus, Semibalanus cariosus, Anthopleura elegantissima/sola, Chthamalus spp/Balanus glandula, Egregia menziesii, Eisenia arborea, Pelvetiopsis limitata, Sargassum muticum, Silvetia compressa, and Phyllospadix spp.

3.2.3 Regional Ocean Modeling System

We extracted data from the West Coast configuration of the Regional Ocean Modeling System (ROMS; Shchepetkin and McWilliams 2005). This ROMS configuration is run by the University of California Santa Cruz Ocean Modeling group and covers the whole California Current System (as defined in 3.1.1) at a resolution of 10km2. Specifically, we extracted daily values from 2011 to 2019 for the following physical variables: sea surface temperature, sea surface salinity, bathymetry, and sea level. We calculated yearly medians, minima (the 10th percentiles), and maxima (the 90th percentile), for these four variables by summarizing daily values in each year.

3.2.4 Environmental Sensitivity Index

We extracted data on shoreline habitat types from NOAA’s Environmental Sensitivity Index (ESI; https://response.restoration.noaa.gov/esi_download) for California, Oregon, and Washington. Data included vector lines and polygons representing the shoreline and coastal habitats of the California coast, classified according to the Environmental Sensitivity Index (ESI) classification system. The main shoreline and coastal habitat classes included in ESI are (1) Armored, (2) Rocky and Steep Shorelines (Bedrock/Sand/Clay), (3) Beaches (Sand/Gravel), (4) Flats (Mud/Sand), and (5) Vegetated (Grass/Marsh/Mangroves/Scrub-Shrub).

3.3 Place-based Biodiversity Indicators

3.3.1 Species richness

We calculated species richness using the Chao1 nonparametric estimator (Chao 1984), which provides an estimate of the asymptotic species richness of a community, that is the presumed asymptote of species richness beyond which additional sampling does not yield any new species. Based on the concept that rare species carry the most information about the number of unobserved species, Chao1 estimates the frequency of unobserved species in a community (those that were there but do not have corresponding observations on iNaturalist for that place) using the number of infrequently observed species (those observed only once or twice). Specifically, the Chao1 estimator uses the relationship between singletons (species observed only once) and doubletons (species observed only twice), together with the observed number of species, to obtain an estimate of asymptotic species richness. The higher the ratio of singletons to doubletons, the higher the estimated number of unobserved species and, therefore, the higher the difference between estimated and observed species richness.

3.3.2 Species rarity

We estimated species rarity using the community-weighted endemism index (Crisp et al. 2001, Guerin et al. 2015), which measures the richness of narrow-ranged or endemic species in a given place. This metric is obtained by dividing weighted endemism-the number of species in a place weighted by the inverse of their range sizes-by species richness (Crisp et al. 2001, Guerin et al. 2015).

3.3.3 Biodiversity uniqueness

Biodiversity uniqueness measures the distinctiveness of the ecological community in a given place compared with all other places. We examined patterns of biodiversity uniqueness by calculating the median dissimilarity in the ecological community composition of each place compared with all other places (i.e. the beta-diversity) in the network-whether watersheds, counties, MPAs, or hexagons. We used the complement of the Morisita-Horn overlap index (Beck et al. 2013) as our measure of ecological community dissimilarity between places. Moreover, we calculated the complement of the Morisita-Horn overlap index using the number of observations of each species in each place, therefore taking into account the relative rate of observation of species as well as their overall detection-nondetection in place. We chose the complement of the Morisita-Horn index as our measure of dissimilarity because this metric has previously been found to be insensitive to under-sampling of communities and, as such, to produce robust estimates of dissimilarity even when diverse communities may have been incompletely surveyed (Beck et al. 2013). Values range from 1 (or 100% if expressed as a percentage) for ecological communities completely distinct from all others to 0 for ecological communities identical to all others.

3.3.4 Biodiversity irreplaceability

Biodiversity irreplaceability measures how important a place is compared with all others in the network to ensure that a sufficient portion of the range of all species in the network is preserved in the most efficient manner. We estimated biodiversity irreplaceability using the core-area zonation algorithm implemented within the spatial conservation planning software Zonation (Moilanen et al. 2012). The core-area zonation algorithm takes into account the number of places in the network in which each species occurs and starts by assigning the highest irreplaceability score to the place in which the single rarest species occurs, followed by the place where the next rarest species occurs and so on, until places that only include species that are found in several other places in the network are assigned the lowest irreplaceability scores. In practice, the core-area zonation algorithm first prioritizes the places that include the highest proportion of any single species (for example, a species endemic to a given MPA or watershed), then prioritizes the next remaining cell that includes the highest proportion of any single species, and so forth, until all species are represented. Irreplaceability values range from 1 (or 100% if expressed as a percentage) for the most irreplaceable place to 0 for the least irreplaceable place.

3.3.5 Ecological community change

We estimated ecological community change using a measure of rarefied temporal beta-diversity. Specifically, for a given place, we estimated the similarity in community composition between successive years using the complement of the Morisita-Horn overlap index (Beck et al. 2013). Importantly, we controlled for uneven observation effort across years by subsampling the minimum number of observations generated in any year with at least 100 total observations (essentially setting a minimum sample size of 100 observations) from years with more observations. To avoid loss of information, we re-ran the subsampling procedure 100 times and used the median dissimilarity measure.

3.4 Species-based indicators

For a subset of species, we estimated changes in regional population trends over the last decade, geographic ranges at present and changes in geographic range over the last decade.

3.4.1 Observation events

As discussed in section 2.2, aggregating data is a powerful way to increase its information content, albeit at the expense of increased resolution over space, time, and/or taxa. As a result, the fundamental units of analysis we used for species-based indicators were data aggregates we termed “observation events”. Observation events are events or “visits” (as defined by Szabo et al. 2010, Isaac et al. 2014) that can be recognized as discrete and somewhat independent from other such events. We defined an observation event as any set of iNaturalist observations made by a single user over a single day within the same 10km2 grid cell. The choice of area was made to maximize the trade-off between resolution and information content of observation events. We assigned observation event IDs to all iNaturalist observations in our dataset (see section 3.2.1) and grouped observations by their respective observation event IDs.

3.4.2 Species associations

As discussed in section 2.3, another benefit of aggregating data for many different species is the ability to “borrow strength across taxa” when deriving inferences on a focal species from the data. One of the main advantages of doing so is to attempt to define an overall list of species that observers could have presumed to be searching for (a “search” list). Evidence that one or many species on that list were found but not the focal species, increases the likelihood that the focal species may have been present but was unobserved. Deciding which species may have constituted an observer’s search list may be relatively easy for some datasets, for instance if they have a reduced taxonomic scope (e.g. all birds), but can be challenging in most cases. For the purposes of this project, we took the approach of letting the data speak for themselves as to what constitutes observers’ typical search lists on the California coast. Specifically, we implemented the following methodology to identify species frequently found on the same search lists and, therefore, associated with each other. First, we extracted all observation events that included at least two species from the full set of observations. Second, we constructed a species-by-event presence-absence matrix indicating the events each species had been observed in. Third, we calculated the similarity between species’ search histories-that is the sequence of events in which those species were observed-using the complement of the Jaccard dissimilarity index. This step assigned values of species association ranging from 0, for species never having been observed during the same event, to 1, for species having only ever been observed at the same time. Finally, we extracted the 49 species with the most similar observation histories to each species in the set to identify subsets of 50 associated species for each species in the set. This procedure enabled us to identify subsets of associated species which were more likely to provide information on the detectability and observation bias among each other than random species in the dataset.

3.4.3 Population trends

We estimated yearly population trends from 2013 to 2019 for a subset of species on the California coast-both Statewide and for each Coastal region (see 3.1.3)-by combining species’ observation and search histories using a modified version of the Reporting Rate model (Ball et al. 2011, Isaac et al. 2014). Based on the Reporting Rate model, the yearly population trend is estimated as the yearly change in the proportional rate of observation: the number of observation events within a given year where the focal species was observed divided by the number of observation events within a given year where any of the 50 species associated with the focal species (see 3.3.2.2) were observed. Since not all observation events have equal weight, with some including observations of many more associated species than others, we applied a small modification to the reporting rate model, whereby observation events of associated species in the denominator were counted as many times as the number of distinct associated species included- and not only once. In effect, this modification weights observation events by the number of associated species observed - also known as the list length (Szabo et al. 2010, Isaac et al. 2014). To ensure that the trends presented are as reliable as possible, we did not calculate trends for species with less than 100 observations on the California coast from 2013 to the present, or for species for which species associations could be confidently estimated from the data. The final set of species for which temporal trends were calculated can be found at URL. For a subset of species, we validated yearly population trend estimates derived from iNaturalist observations with long-term data from the Multi-Agency Rocky Intertidal Network (MARINe). Specifically, we used data from MARINe’s long-term plots for the following species or species groups: Pisaster ochraceus, Lottia gigantea, Tetraclita rubescens, Mytilus californianus, Pollicipes polymerus, Semibalanus cariosus, Anthopleura elegantissima/sola, Chthamalus spp/Balanus glandula, Egregia menziesii, Eisenia arborea, Pelvetiopsis limitata, Sargassum muticum, Silvetia compressa, and Phyllospadix spp. To make MARINe data comparable, we summarized MARINe data yearly and at the statewide or regional levels, by averaging the percentage cover or individual counts over all surveys in a year at MARINe sites included in the respective spatial boundaries (Statewide, North coast, Central coast, or South coast). To estimate the agreement between yearly trends from iNaturalist and MARINe data, we calculated Pearson’s linear correlation coefficient between the two yearly trends. In addition, to gain insight into the correlates of positive versus negative agreement between iNaturalist- and MARINe-derived trends, we built linear models of how agreement varied with attributes of the species modeled. The species attributes we considered were the local mean population size of the species (estimated from MARINe data), the number of iNaturalist observations for the species, the number of iNaturalist observations across all associated species, and the species’ mean proportional rate of observation on iNaturalist, calculated across all years.

3.4.4 Species distribution models

We built species distribution models for a subset of species across the California Current System (as defined in 3.1.1) using the point process modeling framework of Fithian et al. (2014). We chose this model because it enables taking advantage of all five steps to minimize bias in crowd-sourced citizen science data outlined in section 2.1. Specifically, in addition to accepting data pre-processed based on filters (see section 2.1) and aggregation rules (see section 2.2), this framework enables borrowing strength across taxa by jointly modeling multiple species (see section 2.3), incorporating statistical correction factors for uneven observation effort in space and time (see section 2.4), and integrating opportunistic presence-only observations with presence-absence data from standardized surveys (see section 2.5). As a result, this and similar frameworks are considered to be the emerging gold standard for estimating changes in the abundance and distribution of species in space and time (Isaac et al. 2020). We believe that this approach has the highest potential to make the best use of crowd-sourced citizen science data in combination with additional data sources. Our species distribution models integrated crowd-sourced citizen science data from iNaturalist with standardized survey data from MARINe’s long-term plots and counts for multiple species across the California Current System (as defined in 3.1.1) from 2012 to 2019. Specifically, we built two separate models: an invertebrate model with 8 species and a kelp and seaweed model with 6 species. The invertebrate model included iNaturalist and MARINe data for the following 8 species or MARINe species lumping codes: Pisaster ochraceus, Lottia gigantea, Tetraclita rubescens, Mytilus californianus, Pollicipes polymerus, Semibalanus cariosus, Anthopleura elegantissima/sola, and Chthamalus spp/Balanus glandula. The kelp and seaweed model included iNaturalist and MARINe data for the following 8 species or MARINe species lumping codes: Egregia menziesii, Eisenia arborea, Pelvetiopsis limitata, Sargassum muticum, Silvetia compressa, and Phyllospadix spp. For species observed individually on iNaturalist but lumped into a cluster in MARINe, we grouped iNaturalist observations to mirror the corresponding MARINe lumping code; that is, we grouped iNaturalist observations of Anthopleura elegantissima, Anthopleura sola, and Anthopleura xanthogrammica in a Anthopleura elegantissima/sola group, iNaturalist observations of Chthamalus dalli, Chthamalus fissus, and Balanus glandula in a Chthamalus spp/Balanus glandula group, and iNaturalist observations of Phyllospadix scouleri and Phyllospadix torreyi in a Phyllospadix spp group. We converted MARINe estimates of percentage cover for each species in each survey at each site into binary presence-absence data over a year at each site. Importantly, we assumed that MARINe long-term plots included information on the absence of species from a MARINe site, that is that species with a 0% cover or 0 count in all surveys in a year at a site were absent from that site in that year. In parallel, we used iNaturalist observations as presence-only observations. We modeled presence-only and presence-absence biological data as a function of two sets of predictors-predictors of observation effort and ecological predictors-at a spatial resolution of 10km2 (approximately 1/10 degree) over the extent of the California Current System (similar to the spatial scale of the West Coast ROMS model; see section 3.2.3). We included the following predictors of bias in the presence-only observation process portion of the model: the total number of Research Grade observations and the total number of distinct iNaturalist observers. These predictors of bias were calculated from the full iNaturalist dataset described in section 3.2.1, not exclusively based on the iNaturalist observations for the 14 species included in the models. We included the following predictors in the ecological process portion of the model: minimum sea surface temperature, median sea surface salinity, median bathymetry, maximum sea level, and shore type. Shore type was derived from the ESI database (see section 3.2.4), while all other physical variables were derived from the California Current implementation of the ROMS model (see section 3.2.3). We included in the models linear effects for the number of observers, median bathymetry, and shore type, and modeled all other predictors as natural cubic splines to allow for non-linear effects. We ran two instances each of the invertebrate and kelp and seaweed models: one overall model for all years combined from 2013 to 2019, and one yearly model separating all data types among years. Both models used the same response and predictor variables, but the yearly model included an additional term for year. We derived two types of outputs from the species distribution models. First, we generated maps of density for each of the 14 species modeled at the resolution of 10km2 over the extent of the California Current System for all years combined or yearly from 2013 to 2019. Second, we generated response curves to represent the modeled relationships between species density and ecological predictors in the model; response curves visualize how species density changes across ecological gradients of sea surface temperature, sea surface salinity, bathymetry, sea level, and shore type. We validated species distribution models using either spatially or temporally independent validation. For overall models, we ran block cross-validation to set aside 3 spatially autocorrelated subsets of 10-15% of all data, and used those set to validate models built using the remaining 85-90% of the data. For yearly models, we left out data in each year in turn to validate models built using data from all other years. We quantified the performance of models on validation data (either on presence-only points or presence-absence data) using the Area Under the receiver operating characteristic Curve (AUC) averaged over all validation runs.

4 Literature Cited

Ball S., Morris R., Rotheray G., and Watt K. (2011). Atlas of the Hoverflies of Great Britain (Diptera, Syrphidae). Centre for Ecology and Hydrology: Wallingford, UK.
Beck J., Holloway J. D., and Schwanghart W. (2013). Undersampling and the measurement of beta diversity. Methods in Ecology & Evolution: 370-382.
Bird T. J., Bates A. E., Lefcheck J. S., Hill N. A., Thomson R. J., Edgar G. J., et al. (2014). Statistical solutions for error and bias in global citizen science datasets. Biological Conservation, 173: 144-154.
Botts E.A., Erasmus B.F.N. and Alexander G.J. (2012). Methods to detect species range size change from biological atlas data: a comparison using the South African Frog Atlas Project. Biological Conservation, 146: 72-80.
Chao A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics 11: 265-270.
Crisp M.D., Laffan S., Linder H.P., and Monro A. (2001). Endemism in the Australian flora. Journal of Biogeography 28: 183-198.
Fink D., Damoulas T., Dave. J. (2013). Adaptive Spatio-Temporal Exploratory Models: Hemisphere-wide species distributions from massively crowdsourced eBird data. In Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13) July 14-18, 2013. AAAI Press, Bellevue, Washington, USA.
Fink D., Auer T., Johnston A., Ruiz-Gutierrez V., Hochachka W. M., and Kelling S. (2020). Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications 30: 1-16.
Fithian W., Elith J., Hastie T., and Keith D. A. (2015). Bias correction in species distribution models: Pooling survey and collection data for multiple species. Methods in Ecology and Evolution 6: 424-438.
GBIF: The Global Biodiversity Information Facility (2020). What is GBIF?. Available from https://www.gbif.org/what-is-gbif [1 May 2020].
Gotelli N. J., and Chao A. (2013). “Measuring and estimating species richness, species diversity, and biotic similarity from sampling data,” in Encyclopedia of Biodiversity, second Edition, ed. S. A. Levin. Academic Press: Waltham, MA.
Guerin G.R., Ruokolainen L., and Lowe A.J. (2015). A georeferenced implementation of weighted endemism. Methods in Ecology & Evolution 6: 845-852.
Hickling R., Roy D. B., Hill J. K., Fox R., and Thomas C. D. (2006). The distributions of a wide range of taxonomic groups are expanding polewards. Global Change Biology 12: 450-455.
Hill M.O. (2012) Local frequency as a key to interpreting species occurrence data when recording effort is not known. Methods in Ecology and Evolution, 3: 195- 205.
Horns J.J., Adler F.R., Sekercioglu Ç.H. (2018). Using opportunistic citizen science data to estimate avian population trends. Biological Conservation. 221: 151-159.
iNaturalist. 2020. Available from https://www.inaturalist.org. Accessed 1 May 2020.
Isaac N. J. B., Jarzyna M. A., Keil P., Dambly L. I., Boersch-Supan P. H., Browning E., et al. (2020). Data Integration for Large-Scale Models of Species Distributions. Trends in Ecology and Evolution 35: 56-67.
Isaac N. J. B., and Pocock M. J. O. (2015). Bias and information in biological records. Biological Journal of the. Linnean Society 115: 522-531.
Isaac N. J. B., van Strien A. J., August T. A., de Zeeuw M. P., and Roy D. B. (2014). Statistics for citizen science: Extracting signals of change from noisy ecological data. Methods in Ecology and Evolution 5: 1052-1060.
Johnston A., Fink D., Hochachka W. M., and Kelling S. (2018). Estimates of observer expertise improve species distributions from citizen science data. Methods in Ecology and Evolution 9: 88-97.
Kelling S., Fink D., La Sorte F. A., Johnston A., Bruns N. E., and Hochachka W. M. (2015). Taking a ‘Big Data’ approach to data quality in a citizen science project. Ambio 44: 601-611.
Kosmala M., Wiggins A., Swanson A., and Simmons B. (2016). Assessing data quality in citizen science. Frontiers in Ecology and the Environment 14: 551-560.
Kuussaari M., Heliölä J., Pöyry J., and Saarinen K. (2007) Contrasting trends of butterfly species preferring semi-natural grasslands, field margins and forest edges in northern Europe. Journal of Insect Conservation 11: 351-366.
Loarie, S. (2017). Identification Quality Experiment Update. iNaturalist.org. https://www.inaturalist.org/journal/loarie/9260-identification-quality-experiment-update.
MacKenzie D.I. (2006) Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence, pp. 324. Academic Press: Burlington, Massachusetts, USA.
Moilanen A., Meller L., Leppänen J., Montesino Pouzols F., Arponen A. & Kujala H. (2012). Zonation spatial conservation planning framework and software v. 3.1, User manual, 287 pp. www.helsinki.fi/bioscience/consplan.
Moritz C., Patton J. L., Conroy C. J., Parra J. L., White G. C., and Beissinger S. R. (2008). Impact of a century of climate change on small-mammal communities in Yosemite National Park, USA. Science 322: 261-264.
Outhwaite C.L., Gregory R.D., Chandler R.E. et al. (2020) Complex long-term biodiversity change among invertebrates, bryophytes and lichens. Nature Ecology & Evolution 4: 384-392.
Powney G. D., Carvell C., Edwards M., Morris R. K. A., Roy H. E., Woodcock B. A., et al. (2019). Widespread losses of pollinating insects in Britain. Nature Communications 101: 10.
Rapacciuolo G., Ball-Damerow J. E., Zeilinger A. R., and Resh V. H. (2017). Detecting long-term occupancy changes in Californian odonates from natural history and citizen science records. Biodiversity and Conservation 26: 2933-2949.
Roy H.E., Adriaens T., Isaac N.J.B., Kenis M., Martin G.S., Brown P.M.J., et al. (2012) Invasive alien predator causes rapid declines of native European ladybirds. Diversity and Distributions, 18: 717-725.
Shaffer H. B., Fisher R. N., and Davidson C. (1998). The role of natural history collections in documenting species declines. Trends in Ecology and Evolution 13: 27-30.
Shchepetkin A.F., and McWilliams J.C. (2005). The regional oceanic modeling system (ROMS): A split-explicit, free-surface, topography-following-coordinate oceanic model. Ocean Modelling 9: 347-404.
Sullivan B.L., Aycrigg J.L., Barry J.H., Bonney R.E., Bruns N., Cooper C.B., Damoulas T., Dhondt A.A., Dietterich T., Farnsworth A., Fink D., Fitzpatrick J.W., Fredericks T., Gerbracht J., Gomes C., Hochachka W.M., Iliff M.J., Lagoze C., La Sorte F.A., Merrifield M., Morris W., Phillips T.B., Reynolds M., Rodewald A.D., Rosenberg K.V., Trautmann N.M., Wiggins A., Winkler D.W., Wong W.-K., Wood C.L., Yu J., and Kelling S. (2014). The eBird enterprise: an integrated approach to development and application of citizen science. Biological Conservation 169: 31-40.
Szabo J.K., Vesk P.A., Baxter P.W.J., and Possingham, H.P. (2010) Regional avian species declines estimated from volunteer-collected long-term data using List Length Analysis. Ecological Applications 20: 2157-2169.
Telfer M.G., Preston C.D. & Rothery P. (2002) A general method for measuring relative change in range size from biological atlas data. Biological Conservation 107: 99-109.
Tingley M. W., and Beissinger S. R. (2009). Detecting range shifts from historical species occurrences: new perspectives on old data. Trends in Ecology and Evolution 24: 625-633.
Ueda K. (2019). Identification Quality On iNaturalist. iNat Forum. https://forum.inaturalist.org/t/identification-quality-on-inaturalist/7507.
van Strien A.J., van Swaay C.A.M. & Termaat T. (2013) Opportunistic citizen science data of animal species produce reliable estimates of distribution trends if analysed with occupancy models. Journal of Applied Ecology 50: 1450-1458.
van Strien A.J., Termaat T., Groenendijk D., Mensing V. & Kéry M. (2010) Site-occupancy models may offer new opportunities for dragonfly monitoring based on daily species lists. Basic and Applied Ecology 11: 495-503.
Warren M. S., Hill J. K., Thomas J. A., Asher J., Fox R., and Huntley B. (2001). Rapid responses of British butterflies to opposing forces of climate and habitat change. Nature 414: 65-69.
Warton D.I., Renner I.W., and Ramp D. (2013). Model-based control of observer bias for the analysis ofpresence-only data in ecology. PLoS ONE 8: e79168.

Author: Gio Rapacciuolo | grapacciuolo@calacademy.org | https://github.com/giorap