Datasets used in a published paper with the title "Next-generation hybrid precipitation forecasts that integrate indigenous knowledge using machine learning"

Author: Samuel Sutanto 1), Joep Bosdijk 2)
Earth Systems and Global Change Group, Environmental Sciences Department, Wageningen University and Research, Droevendaalsesteeg 3, 6708 PB, Wageningen, the Netherlands
Weather Impact, Amersfoort, the Netherlands

Contact information: samuel.sutanto@wur.nl, joep.bosdijk@weatherimpact.com


These datasets consist of two folders, which are:

1. Data
2. Script

Data
All input and output data used in the paper are provided in this folder.


Script
There are four main python scripts used in the paper described as follow:
1. Pre-processing SF and FF.py. SF means scientific forecast and FF means farmer forecast
2. Indicator_pre-processing.py. Here the important dataframe df_time_to_rain is created, which only contains days on which indicators are seen, and contains the value of rain on that day
3. Farmer forecast figure.py. Here 3 tasks were carried out as follows:
   a. The forecasts of Nakpanzoo and Yapalsi are combined into one dataframe.
   b. The best farmers are selected, and a dataframe with their forecasts overwriting the other forecasts is created (part 1)
   c. In this part all of the bar graphs of my thesis are made, also the last one
4. Indicator multiple days complete model.py. Indicator multiple days complete model contains an attempt at using multiple previous days to improve the model.
5. kfold cross.py. The model used to train the hybrid forecast.  Kfold cross contains a more elaborate figure creation section. 

Extra
In the Extra folder, extra scripts were used in the analysis and described as follow:
Data_janina.py. loads Janina’s data and puts it into the correct format.
Testing forecasts.py. Multiple scripts which can be used to make plots as seen in publication.
Feature importance calculations.py. Generates the figures as seen in the appendix (using another method of calculating feature importance).
Feature selector.py. Calculates the skill when using less features using sequential feature selection and creates the figure as seen in publication.
Nayive Bayes Classification.py. I created these scripts at the start when I was still figuring out how things worked.


Note: Not all data included in the folder is used in the scripts. Actually most are not. But I included them anyway to be sure in cased I missed some.
