Crop yield prediction with ensemble algorithms and Artificial Neural Networks (ANN)
Autor/es:
Ruiz Moreno, Tobías
Tutor/es:
Merener, Martín
Carrera de la tesis:
Master in Management + Analytics
Fecha:
2019Resumen
World cereal production is set to grow by around 1% per year for the next decade, and while crop areas are not expanding, the major driver for the growth production is expected to come from yield improvements. Crop yields have been commonly modelled in two ways: process-based modelling (also known as crop simulation) and statistical modelling. Recently, machine learning started to deliver interesting results, mainly because it has the advantage of dealing with non-linear relationships between factors. Weather plays an important role in defining crop yields. Being able to simulate accurate weather conditions and predict crop yield has been an important topic in the industry. The objective of this work is to model crop yields using Random Forest regressor and Long Short-Term Memory (LSTM) Neural Networks (NN) in 9 annual crops in Argentina: wheat, barley, maize, soybean, sunflower, sorghum, rice, cotton and peanut. Soil and weather data was collected and transformed for 80 counties in Argentina. Hyperparameters for the 2 models were optimized and accuracy metrics were compared. Weather information was simulated estimating the distribution of the historical information using KDE (Kernel Density Estimator) and Monte Carlo to generate random sampling. Feature importance analysis allowed to reduce the number of factors up to 7 without compromising model accuracy. From the 9 crops studied, soybean, maize, sunflower, sorghum, wheat and barley models returned reasonable accuracy metrics. Except for the last two (wheat and barley) which are winter crops, the remaining 4 summer crops (soybean, maize, sorghum and sunflower) were forecasted simulating rainfall in different stages of the growing season and returned estimations with an error below 20% (MAPE) before harvest. Random forest outperformed classic MLR statistical model by more than 30% on average over all the crops, but overfitting was significantly high. LSTM did not perform as well as Random forest: although LSTM did not overfit, performance was slightly better than baseline with large variations between crops. This work demonstrates that machine learning algorithms are a competitive alternative to statistical modelling for crop yield prediction, and weather simulations can return reasonably accurate predictions before harvest. This allows the agricultural community to anticipate strategic decisions based on crop production forecasts.