Sustainable Agri-Finance: Leveraging Synthetic Data & Scientific Models to Estimate Greenhouse Gas Emissions

Authors: Hajarah Nantege, Souvik Roy, Tiago Machado, Takashi Nishikawa

Introduction

The global imperative to address climate change and promote sustainable development has placed financial institutions, particularly banks, at the forefront of supporting environmentally responsible practices. In the realm of sustainable finance, accurately estimating and reporting on emissions from financed agricultural activities have emerged as a critical challenge.

These activities, encompassing crop production, livestock farming, and land-use change, significantly contribute to greenhouse gas (GHG) emissions and other environmental impacts. Understanding the emissions associated with agricultural financing is vital for banks to assess their environmental footprint, inform investment decisions, and drive positive change in the agricultural sector. However, this task is fraught with complexities and obstacles that necessitate the use of innovative approaches.

Challenges in Estimating and Reporting Agricultural Emissions

Accurately estimating and reporting agricultural emissions is a significant challenge for banks, as they navigate the unique characteristics of the agricultural sector and the complexities involved in data collection, emissions quantification, and reporting. There is a critical need to promote regenerative agriculture and mitigate the adverse effects of synthetic fertilizers on the nitrogen cycle, soil health, water quality, and the environment. The following key challenges were identified based on insights from publications by the Principles for Responsible Investment [1], Ceres [2], and the Task Force on Climate-related Financial Disclosures (TCFD) [3,4]:

Data Availability and Quality. Banks face the obstacle of ensuring comprehensive and accurate emissions data from various actors in the agricultural supply chain. To enable reliable emissions estimation, it is essential that the data collection processes are robust, standardized, and transparent.
Scope and Boundaries. Defining the scope and boundaries for emissions estimation within agricultural supply chains is critical. Collaborating with stakeholders is necessary to establish consistent methodologies and determine the responsibilities of different actors in the supply chain.
Measurement and Verification. Accurately quantifying and verifying emissions across diverse agricultural supply chains requires appropriate measurement methodologies and consistent application. Verification of emissions data enhances credibility and reliability.
Supply Chain Complexity and Transparency. Agricultural supply chains are complex, involving numerous actors and intermediaries. Banks must navigate this complexity and foster transparency, collaboration, and data sharing among stakeholders to trace emissions effectively.
Integration of Emerging Technologies. Adopting and integrating emerging technologies such as remote sensing, satellite imagery, Internet of Things (IoT) sensors, and specialized tools for assessing GHG emissions, can improve estimation accuracy. Overcoming technical barriers, data compatibility issues, and cost considerations are necessary steps to leverage the potential of these technologies effectively.
Alignment with Reporting Standards. Aligning emissions estimation and reporting with recognized reporting standards, such as the TCFD guidelines, enhances transparency and comparability. Climate-related financial disclosures provide consistent and standardized information to stakeholders for informed decision-making and risk assessment.

This Omdena Challenge project aimed to address some of the challenges listed above by developing a machine learning-based system for estimating nitrous oxide (N2O) emissions from input data, such as soil properties, weather conditions, and crop information. The initial part of the project focused on exploring and analyzing available data sets, particularly satellite data. We identified available APIs for accessing satellite data and gathered information on their functionalities and wrote short summaries (see Table 1 below). We also acquired information on the characteristics of available satellites, including the spatial and temporal resolution and available bands (see Table 2 below).

Name	API Link	Satellite Data Available
Google Earth Engine	https://developers.google.com/earth-engine	Landsat
STAC	https://stacindex.org	STAC Catalogs
Satellite Imaging Corporation	https://www.satimagingcorp.com/applications/natural-resources/agriculture/	NDVI
Planet Explorer	https://account.planet.com	Includes imagery from Planet’s catalog (PlanetScope, SkySat, and RapidEye) as well as public imagery from Sentinel-2 and Landsat 8.
SentinelSat python API	https://pypi.org/project/sentinelsat/	Sentinel satellite images

Table 1: Available satellite data APIs

Name	Link	Spatial Resolution	Temporal Resolution
Sentinel-2	https://eos.com/find-satellite/sentinel-2/	60m	5 days
Landsat 7	https://eos.com/find-satellite/landsat-7/	15m	16 days
Pleiades-1A	https://www.satimagingcorp.com/satellite-sensors/pleiades-1/	0.5m	1 day
MODIS	https://lpdaac.usgs.gov/data/get-started-data/collection-overview/missions/modis-overview/	250m	2 day
SPOT-6/7	https://www.satimagingcorp.com/satellite-sensors/spot-6/	1.5m	26 days

Table 2: Characteristics of available satellites

A Solution for Predicting Nitrous Oxide Emissions

Knowledge-Guided Machine Learning (KGML)

The machine learning framework adopted in this project to develop a solution for predicting N2O emissions is known as Knowledge-Guided Machine Learning (KGML) [5]. This solution leverages synthetic data and scientific models to address the limitations of current approaches and provide more accurate predictions. The developed technology can promote regenerative agriculture and contribute to the transition to a low-carbon economy by empowering both banks and farmers with information to reduce reliance on synthetic fertilizers, mitigate nitrogen loss, and adopt sustainable practices.

In the KGML framework, models are designed using a combination of approaches based on scientific rigor and data-driven methodologies. In this project, the scientific rigor comes from software that implements the physical laws and scientific principles underlying agricultural processes and biophysical chemistry to generate synthetic data. The data-driven aspect of the KGML comes from the use of machine learning algorithms to learn the relation between the dependent and independent variables from ground-truth data.

Software implementations of agricultural processes are referred to as process-based models and have been in use by researchers to predict N2O emissions. Systems such as Ecosys, DNDC, and APSIM are among the most cited in the scientific literature. Despite their success in educating and assisting in agricultural practices, such systems have limitations: they are constrained by the assumptions implemented in the software and may have outdated code bases, poor usability, and steep learning curves. However, empirical evidence suggests [5] that many of these issues can be addressed with the use of adapted machine learning models.

To overcome these limitations, KGML performs model training in two steps. First, the model learns input-output relations from the synthetic data generated by simulation with a process-based model. This initial, pre-training step allows the model to incorporate knowledge from the scientific principles encoded in the software. In the second step, the model is enhanced using ground-truth data on measured N2O emissions (along with measured inputs). This fine-tuning step combines the scientific knowledge from the process-based model with the ground-truth information, defining a KGML model.

As the main reference, we used the work developed by Liu et al. [6], which, to the best of our knowledge, was the first KGML model designed for agricultural processes. In their studies, the model was first trained with Ecosys and then fine-tuned with ground-truth measurement data from 6 chambers located in a farm in Minnesota, US. Among the model architectures based on Gate Recurrent Units (GRUs) that are discussed in Ref. [6], two caught our attention. The first one (Figure 1, left) stacks layers of GRU units and maps soil and crop properties (SCP), weather conditions (W), nitrogen fertilizer application rate (N), and intermediate variables (IMVs) to the N2O emission (flux). The second one (Figure 1, right), has two GRU modules: one module maps the variables SCP, W, N, and a subset of the IMVs to the remaining IMVs, and the other module maps SCP, W, N, and the resulting full set of IMVs to the N2O flux.

Figure 1: Two KGML architectures. The left architecture stacks layers of GRU units and directly maps fertilizer rate, soil and crop properties, weather conditions, and IMVs to the N2O flux. The right architecture contains two independent modules of GRU layers, one for predicting IMVs and the other for predicting the N2O flux (reproduced and adapted from Ref. [6]).

According to the authors, the first architecture gives the best results. However, it requires feeding a large number of IMVs (76 to be exact) into the model. The second architecture was designed to overcome this data dependency and requires only four IMVs: the fluxes of NH4+, CO2, and NO3-, and the soil’s volumetric water content (VWC). While this made the second architecture our preferred choice initially, we found that ground-truth data containing all four IMVs was not available. As an alternative solution, we tried an architecture with an IMV module requiring only the NH4+ flux (the only IMV for which we could find ground-truth data). However, this modified IMV module was not able to learn well and was only adding unnecessary complexity. Therefore, we ultimately chose the first architecture since it could map the variables SCP, W, N, and the NH4+ flux directly to the N2O flux without the need for a module specifically to predict IMVs. After substantial work in data collection, exploration, analysis, and synthesis, we were able to prepare an adequate training dataset and build a model that can accurately predict N2O flux without relying on a large number of IMVs as inputs.

Besides modifying the architecture, we also chose a different process-based model for synthetic data generation, using DNDC instead of Ecosys. The reason for our choice was Ecosys’s usability issues and steep learning curve. We also used two additional variables, soil clay content and nitrogen inhibitor rate, which were not used in Ref. [6]. In particular, we found evidence in the literature that soil clay content is an important factor when analyzing soil properties in the UK [7].

Data Collection

Two types of data were required for the development of the KGML model for predicting N2O emissions: 1) input data for DNDC to generate synthetic data for pre-training, and 2) ground-truth data for fine-tuning.

DNDC Input Data for Pre-Training

To run DNDC to generate synthetic data, we used the GUI interface of the DNDC software to supply location-specific input data on climate, soil characteristics, vegetation, and management practice. A tabular dataset was created including daily and annual climate data for the selected years, various soil properties for the specific locations, and crop and management practices for each year, as listed in Table 3 below. Vegetation was assumed to be crops, and management practices (tillage and fertilizer application) were set specifically for the tests/experiments considered. Multiple DNDC runs were performed for a range of configurations to capture various scenarios.

	Variable	Data Type	Unit	Source
Site Information	Longitude	Site Specific	Decimal degrees	COSMOS
	Latitude	Site Specific	Decimal degrees	COSMOS
	Elevation	Site Specific	m	COSMOS
Crop	Name of Crop	Crop in each scenario		CROME
Climate	N concentration in rainfall	Annual Avg	mg N/l or ppm
	Atmospheric background NH3 concentration	Annual Avg	ug N/m^3	NASA
	Atmospheric background CO2 concentration	Annual Avg	ppm
	Annual increase rate of atmospheric CO2 concentration	Annual Avg	(ppm/yr)	STATISTA
	Min and Max Air Temperature	Daily	deg C	NASA Power Data
	Precipitation	Daily	cm	NASA Power Data
	Wind Speed	Daily	m/s	NASA Power Data
	Humidity	Daily	%	NASA Power Data
	annual rainfall	Annual Avg	mm	COSMOS
Soil	Land-use type	Site Specific		COSMOS
	Soil Texture	Site Specific		UKSO,HWSD
	Bulk density	Site Specific	g/cm^3	COSMOS
	Soil pH:	Site Specific		UKSO,HWSD
	SOC at surface soil (0-5 cm)	Site Specific	g g-1	COSMOS
	Topsoil Clay Fraction	Site Specific		HWSD
	Slope	Site Specific	deg	UKSO
Tillage	Number of tilling applications in the year	Test/ Experiment Specific
	Tilling Application date	Test/ Experiment Specific
	Tilling method	Test/ Experiment Specific
Nitrogen Fertilizer Application	Number of fertiliser applications in the year	Test/ Experiment Specific
	Fertiliser Application date	Test/ Experiment Specific
	Application depth: Surface or injection	Test/ Experiment Specific	m
	Applied quantity of fertilisers	Test/ Experiment Specific	kg N/ha	Based on Nutrient Management Guide RB209

Table 3: Climate and soil data collected

Running DNDC simulations with these input data generated output files containing daily values for each variable for the selected site, including soil temperature, moisture, oxygen content, microbial activity, pools and fluxes of elements (carbon, nitrogen, phosphorus), soil water, field management, crop information, and grazing. These outputs were essential for further analysis, ensuring a robust dataset for training the KGML model to predict N2O emissions.

Ground-Truth Data for Fine-Tuning

We obtained the ground-truth N2O flux measurements for real UK sites from GHG Nitrous Oxide Datasets in the Agricultural and Environmental Data Archive (AEDA). The raw files from these datasets included most of the required input variables, such as geographical coordinates and daily values for soil moisture, soil mineral nitrogen, rainfall, and air temperature. However, they lacked some crucial variables, which we obtained from other sources. For example, we obtained wind and humidity data from NASA’s data access viewer for the specific sites of the experiments where the N2O flux was measured.

To prepare the dataset for fine-tuning, the raw data was split into time series samples based on location, block number, and treatment. The Harmonized World Soil Database (HWSD) provided sand and silt content, which was used for selecting, renaming, unit-converting, and calculating the variables needed for the model. The datasets were then arranged as a full-year time series for each input variable, with missing time steps generated and filled with known constants and values from the weather data sources. The missing NH4 flux values were filled using interpolation, and the missing N2O flux values were imputed with values predicted by DNDC.

Training and Results

With design choices and definitions as well as clean data, we had everything in place for the two steps required to train our KGML model. The synthetic data obtained from DNDC was used for the first training step, and the ground-truth data for the UK sites was used to fine-tune the model.

For both pre-training and fine-tuning, we analyzed our results using K-fold cross-validation to prevent overfitting. The dataset was divided into K = 5 folds. In each fold iteration, the training phase was performed using just four of the five folds, leaving one out for validation. Figure 2 below illustrates the results for one such iteration, which is representative of the whole training and validation process. The validation loss was found to be consistently lower than the training loss. This is most likely because the 20% dropout applied to each GRU layer promotes regularization in the training phase. Dropout layers are automatically removed from the model when using it for inferencing during the validation phase. Table 4 below presents the averaged train and validation results after completing all the iterations of the leave-one-out five-fold cross-validation.

The results from our trained model indicate that the machine learning approach has the potential to overcome the limitations of relying solely on process-based models. By combining the ground-truth data with the synthetic data coming from DNDC, we can mitigate the challenges of data scarcity in certain locations. These findings are consistent with recent publications from the scientific community and aligned with the guidelines of the Intergovernmental Panel on Climate Change (IPCC).

By utilizing the KGML model, banks can improve their estimation and reporting of emissions from financed agricultural activities. This solution leverages synthetic data to train the machine learning model, allowing it to learn from existing scientific knowledge in addition to real-world observations.

Figure 2: Results from pre-training (top) and fine-tuning (bottom) after 1000 epochs.

Table 4: Pre-training and fine-tuning results for the root mean square error (RMSE) and the coefficient of determination (R-squared) after 1000 epochs.

Enabling Sustainable Farming with Artificial Intelligence (AI)

The integration of AI technologies, such as the developed KGML model as a B2B SaaS platform, can empower banks to facilitate sustainable farming practices and support the transition to a low-carbon economy.

The KGML model is built upon the foundations of AI and harnesses the power of machine learning and scientific insights to predict N2O emissions from agricultural systems. By leveraging synthetic data and scientific models, the KGML model provides accurate estimates of emissions, enabling banks to make informed financing decisions and promote sustainable farming practices. This integration of AI enhances the precision and efficiency of emissions estimation, supporting banks in their commitment to sustainable finance.

Furthermore, the solution offers remote sensing insights on farm-level N2O emissions. Through the utilization of satellite imagery, drones, and other remote sensing tools, the platform collects comprehensive data on agricultural activities, enabling banks to gain valuable insights into emissions hotspots and identify opportunities for emission reduction. The AI-powered analysis and visualization capabilities of the solution empower banks to navigate the complexities of sustainable farming by providing them with actionable information to support decision-making and risk assessment.

The KGML model and the B2B SaaS platform would work in tandem to enhance the accuracy of emissions estimation, improve risk assessment, and promote environmentally conscious lending practices. Through the integration of AI, banks can proactively drive positive change in the agricultural sector, aligning financial incentives with sustainable farming practices and promoting the adoption of regenerative agriculture. In addition, banks can leverage the transformative potential of advanced technologies to facilitate sustainable farming practices.

The use of AI in enabling sustainable farming can also extend beyond emissions estimation and risk assessment. AI technologies can be leveraged to optimize resource management, improve crop yield predictions, and support precision agriculture practices. By analyzing vast amounts of data and generating actionable insights, AI empowers farmers to make data-driven decisions, maximize resource efficiency, and minimize environmental impact.

Toward Regenerative Agriculture and Low-carbon Economy

Estimating and reporting emissions from financed agricultural activities pose significant challenges for banks in their efforts to promote regenerative agriculture and facilitate the transition to a low-carbon economy. However, by leveraging improved emission estimates data from the proposed KGML model along with other data, banks can overcome these obstacles and have a substantial impact on sustainable finance.

Accurate estimation and reporting of emissions enable banks to make more informed financing decisions, prioritize regenerative agricultural practices, and support sustainable land management. By considering the environmental impact of agricultural activities, banks can reduce carbon emissions, enhance ecosystem services, and promote soil health, biodiversity, and water conservation. More specifically, the KGML solution can help banks with the following:

Risk Assessment. More accurate emissions estimation from the solution allows banks to assess climate-related risks associated with agricultural investments. Understanding the emissions profile of agricultural supply chains helps banks identify potential risks from regulatory changes, market shifts, and climate change impacts. By implementing proactive risk mitigation strategies and supporting the transition to low-carbon agricultural systems, banks can contribute to a more resilient and sustainable agricultural sector.
Market Incentives. Accurate estimation and reporting of emissions also contribute to the development of market incentives for the transition to a low-carbon economy. Transparent information on emissions intensity incentivizes practices that reduce carbon footprints and promote regenerative agriculture. By encouraging farmers and agribusinesses to adopt sustainable practices, invest in renewable energy, and implement climate-smart technologies, banks play a crucial role in driving the shift towards a sustainable and low-carbon agricultural sector.
Standardization and Transparency. Addressing the challenges of estimating and reporting agricultural emissions requires collaboration, knowledge sharing, and the integration of emerging technologies. The KGML model can be used to promote transparency in emissions estimates, enabling banks to work together with financial institutions, agricultural stakeholders, and scientific communities to develop standardized methodologies and share best practices. This collaborative approach fosters innovation, accelerates the adoption of sustainable practices, and strengthens the resilience of the agricultural sector.
Influencing Policy and Regulations. By accurately estimating and reporting agricultural emissions, banks gain credibility and a basis for engaging in policy discussions and advocating for supportive regulatory frameworks. Banks can leverage their insights and data to support the development of policies that incentivize regenerative agricultural practices, promote carbon pricing mechanisms, and facilitate the transition to a low-carbon economy. Through policy influence and advocacy, banks can create an enabling environment for sustainable finance and drive systemic changes in the agricultural sector.

In summary, accuracy in the estimation and reporting of agricultural emissions is critical for promoting regenerative agriculture and facilitating the transition to a low-carbon economy. By addressing the challenges, leveraging synthetic data, incorporating scientific models, and collaborating with stakeholders, banks can make informed financing decisions, assess risks, create market incentives, foster collaboration, and advocate for supportive policies. These actions contribute to sustainable land management, reduced carbon emissions, enhanced ecosystem services, and the development of a resilient and sustainable agricultural sector.

Acknowledgment

The authors would like to thank Reed Walker and his team at Agreed Earth (the partner company for this project) for the critical reading and feedback that significantly improved this article. The authors would also like to thank the following people for the fruitful collaboration that led to the development of the KGML model described in this article: Janice Wong, Guy Maskall, and Kelly Price at Agreed Earth, as well as the collaborators Alice Lépissier, Amanda Eames, Ameya Chaudhari, Ananthakrishnan S, Bakhtiyar Babashli, Deepali Bidwai, Deniz Can Elçi, Farhan Hasan, Gaylyn Ruvere, Gizem Dulat, Isaiah LeBlanc, Krishna Anand V G, Michael Adeyeri, Sairam Kannan, and Satish Satpal, who all worked hard in the Omdena AI Challenge.

References

[1] Principles for Responsible Investment (PRI), Understanding the data needs of responsible investors: The PRI’s investor data needs framework.

[2] Ceres, Measure the Chain: Managing GHG Emissions in Agricultural Supply Chains.

[3] Task Force on Climate-related Financial Disclosures, Guidance on Metrics, Targets, and Transition Plans.

[4] Task Force on Climate-related Financial Disclosures, 2022 Status Report.

[5] Karpatne A, Kannan R, Kumar V, editors. Knowledge Guided Machine Learning: Accelerating Discovery Using Scientific Knowledge and Data. CRC Press; 2022 Aug 15.

[6] Liu L, Xu S, Tang J, Guan K, Griffis TJ, Erickson MD, Frie AL, Jia X, Kim T, Miller LT, Peng B. KGML-ag: a modeling framework of knowledge-guided machine learning to simulate agroecosystems: a case study of estimating N2O emission using data from mesocosm experiments. Geoscientific Model Development. 2022 Apr 7;15(7):2839-58.

[7] Fitton N, Datta A, Cloy JM, Rees RM, Topp CF, Bell MJ, Cardenas LM, Williams J, Smith K, Thorman R, Watson CJ. Modelling spatial and inter-annual variations of nitrous oxide emissions from UK cropland and grasslands using DailyDayCent. Agriculture, Ecosystems & Environment. 2017 Dec 1;250:1-1.

Related articles:

The post Sustainable Agri-Finance: Leveraging Synthetic Data & Scientific Models to Estimate Greenhouse Gas Emissions appeared first on Omdena | Building AI Solutions for Real-World Problems.