Modeling Change In Per Hectare Crop Yield Over Time

Last updated on Dec 24, 2021 18 min read Data Science

In this blog post, I’ll walk my readers through the R code I wrote in one of my Kaggle Notebooks to visualize the trends of different crop yields in most populated countries over time (years). Besides that, I’ll also visualize the population growth in the world and the populated countries. Initially, I have worked on a few countries in this notebook to optimally utilize the time. However, readers can use this code for the countries and crops of their choice. At the end, I’ll model and visualize the change in per hectare crop yield over time in the selected countries.

Load libraries

In this data visualization and analysis post, I’ll use the following R packages. I have labeled the category of each package as data science, modeling, visualization and tables.

# Data science
library(tidyverse)
library(janitor)
library(DataExplorer)
library(skimr)

# Modeling
library(tidymodels)

# Visualization
library(ggrepel)

# Tables
library(gt)

Upload data

In this blog post, I have used the data for crop production (yield) and population in countries provided on github. I used the read_csv function to remotely read the data from github.

crop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv")

land <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv")

Exploratory data analysis

The first task of any data analysis and visualization project after data loading is the Exploratory Data Analysis or EDA. Exploratory data analysis helps analysts swiftly understanding the data structure and patterns in it.

Data structure

The first and foremost thing we do in EDA is understanding the data structure. Luckily, developers have built few packages such as skimr that can help us in quickly understanding the data structure. Here, I used the skim function of skimr package to see different attributes of the data.

See the structure of crop data below.

skim(crop)

Table 1: Data summary
Name	crop
Number of rows	13075
Number of columns	14
_______________________
Column type frequency:
character	2
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Entity	0	1.00	4	39	0	249	0
Code	1919	0.85	3	8	0	214	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1.00	1990.37	16.73	1961.00	1976.00	1991.00	2005.00	2018.00	▇▆▇▇▇
Wheat (tonnes per hectare)	4974	0.62	2.43	1.69	0.00	1.23	1.99	3.12	10.67	▇▅▂▁▁
Rice (tonnes per hectare)	4604	0.65	3.16	1.85	0.20	1.77	2.74	4.16	10.68	▇▇▃▁▁
Maize (tonnes per hectare)	2301	0.82	3.02	3.13	0.03	1.14	1.83	3.92	36.76	▇▁▁▁▁
Soybeans (tonnes per hectare)	7114	0.46	1.45	0.75	0.00	0.86	1.33	1.90	5.95	▇▇▂▁▁
Potatoes (tonnes per hectare)	3059	0.77	15.40	9.29	0.84	8.64	13.41	20.05	75.30	▇▅▁▁▁
Beans (tonnes per hectare)	5066	0.61	1.09	0.82	0.03	0.59	0.83	1.35	9.18	▇▁▁▁▁
Peas (tonnes per hectare)	6840	0.48	1.48	1.01	0.04	0.72	1.15	1.99	7.16	▇▃▁▁▁
Cassava (tonnes per hectare)	5887	0.55	9.34	5.11	1.00	5.55	8.67	11.99	38.58	▇▇▁▁▁
Barley (tonnes per hectare)	6342	0.51	2.23	1.50	0.09	1.05	1.88	3.02	9.15	▇▆▂▁▁
Cocoa beans (tonnes per hectare)	8466	0.35	0.39	0.28	0.00	0.24	0.36	0.49	3.43	▇▁▁▁▁
Bananas (tonnes per hectare)	4166	0.68	15.20	12.08	0.66	5.94	11.78	20.79	77.59	▇▃▁▁▁

skim function provides information about:

Data dimensions i.e., number of rows and columns in the data.
Data types i.e., character, factor or numeric, and their frequency.
Descriptive statistics or attributes of each column based on data type.
For example, if the data is of numeric type, it provides statistics such as mean, standard deviation, quantiles and missing values.
And if the data is of character type, it provides character data descriptive and missing values.

In short, this function is very useful and you can use it to quickly understand the structure of any data you are working on.

Below is the structure of population data.

skim(land)

Table 2: Data summary
Name	land
Number of rows	49259
Number of columns	6
_______________________
Column type frequency:
character	3
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Entity	0	1.00	4	50	286
Code	3178	0.94	3	8	235
Year	0	1.00	1	10	258

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Cereal yield index	38837	0.21	169.29	8.43900e+01	0	111.0	147	209	1335	▇▁▁▁▁
Change to land area used for cereal production since 1961	38837	0.21	133.71	1.22230e+02	0	94.0	112	145	3503	▇▁▁▁▁
Total population (Gapminder)	2376	0.95	29827896.89	2.53086e+08	905	201733.5	1542937	5886795	7713467904	▇▁▁▁▁

If you have looked through the summaries of both data sets, you might have noticed that both the crop and population data have missing values in some of the columns.
We can also visualize the amount of missing data in each column using the plot_missing function from the DataExplorer library. These missing data plots categorize the amount of missing data into 3 categories such as OK, Good and Bad. Columns under Bad category needs to be fixed before moving towards data analysis.

Crop data

crop %>%
  plot_missing()

Population data

land %>% 
  plot_missing()

After the data structure, we can start exploring the data distributions. Here, I ’ll start with the population data. Before visualizing the distribution of population, I’ll calculate the population in million column to use instead of raw population data. At first, I’ll only visualize the worldwide population data from 2019.

Population data distribution (2019)

# Population in Millions
Pop <- land %>%
  filter(!is.na(Code)) %>%
  mutate(Year = as.numeric(Year)) %>% 
  mutate(pop = `Total population (Gapminder)`/1000000)

## Distribution (2019)
Pop %>% 
  filter(Entity != "World") %>%
  group_by(Entity) %>%
  filter(Year == max(Year)) %>%
  ungroup() %>% 
  ggplot(aes(x = pop)) +
  geom_histogram(fill = "steelblue") +
  labs(x = "Population", y = "Count") +
  theme_bw()

The population data is right skewed.

Now, I’ll extract the top 10 populated countries and present them along with their population using gt function. gt library is another wonderful package of R that builds professional quality tables in Rmarkdown.

Pop %>%
  filter(Entity != "World") %>%
  group_by(Entity) %>%
  filter(Year == max(Year)) %>%
  ungroup() %>% 
  slice_max(pop, n = 10) %>% 
  select(Entity, pop) %>%
  rename(Country = Entity, `Population (millions)` = pop) %>% 
  gt() %>% 
  tab_header(
    title = "Top 10 Populated Countries of the World"
  )

Top 10 Populated Countries of the World
Country	Population (millions)
China	1433.784
India	1366.418
United States	329.065
Indonesia	270.626
Pakistan	216.565
Brazil	211.050
Nigeria	200.964
Bangladesh	163.046
Russia	145.872
Mexico	127.576

These are the top 10 populated countries according to the data. However, I’ll use only the data from China, India, United States, Pakistan, Brazil and Russia for comparison in subsequent steps.

Now, I’ll plot the worldwide population growth trend from 1900 to 2019. Afterwards, I’ll plot the population growth trends in the selected 6 countries.

Worldwide Population Growth

Pop %>% 
  filter(Entity == "World") %>% 
  filter(Year>=1900) %>% 
  ggplot(aes(x = Year, y = pop)) +
  geom_line() +
  geom_point() +
  labs(x= "Year", y = "Population (Millions)") +
  theme_bw()

Populated countries

Pop %>% 
  filter(Entity %in% c("Pakistan", "United States", "India", "China", "Brazil", "Russia")) %>%
  filter(Year>=1900) %>%  
  ggplot(aes(x = Year, y = pop, color = Entity)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Population (millions)", color = NULL) +
  theme_bw() +
  theme(legend.position = "top")

Now, I’ll move towards crop data and will do some cleaning before making visualizations.

At first, I’ll gather the data in a tidy format for subsequent visualizations and data analysis.
Also, I’ll clean the crop names to make them easily readable.

Crop <- crop %>% 
  gather(crop, yield, `Wheat (tonnes per hectare)`:`Bananas (tonnes per hectare)`, na.rm = FALSE, convert = TRUE) %>%
  mutate(crop = str_remove(crop, regex(" \\(tonnes per hectare\\)", ignore_case = TRUE)))

Crop

## # A tibble: 143,825 x 5
##    Entity      Code   Year crop  yield
##    <chr>       <chr> <dbl> <chr> <dbl>
##  1 Afghanistan AFG    1961 Wheat 1.02 
##  2 Afghanistan AFG    1962 Wheat 0.974
##  3 Afghanistan AFG    1963 Wheat 0.832
##  4 Afghanistan AFG    1964 Wheat 0.951
##  5 Afghanistan AFG    1965 Wheat 0.972
##  6 Afghanistan AFG    1966 Wheat 0.867
##  7 Afghanistan AFG    1967 Wheat 1.12 
##  8 Afghanistan AFG    1968 Wheat 1.16 
##  9 Afghanistan AFG    1969 Wheat 1.19 
## 10 Afghanistan AFG    1970 Wheat 0.956
## # ... with 143,815 more rows

Distribution of crop yields

Distribution of crop yield data obtained from 1961 to 2019.

Crop %>% 
  ggplot(aes(x = yield)) +
  geom_histogram(fill = "steelblue") +
  theme_bw() +
  labs(x = "Yield") +
  facet_wrap(~crop, scales = "free")

Jitter plots

Crop %>% 
  ggplot(aes(x = reorder(crop, yield), y = yield, color = crop)) +
  geom_jitter(position = position_jitter(0.2), alpha = 0.1) +
  labs(x = "Crop", y = "Yield") +
  theme_bw() +
  theme(legend.position = "none")

Yield trends

Crop %>% 
  group_by(Year, crop) %>% 
  summarise(yield = mean(yield, na.rm = TRUE)) %>% 
  ggplot(aes(x = Year, y = yield, color = crop)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Yield (tons per hectare)", color = NULL) +
  theme_bw()

  theme(legend.position = "top")

## List of 1
##  $ legend.position: chr "top"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Trends by Countries

`%nin%` = Negate(`%in%`)
  
Crop %>% 
  filter(Entity %in% c("Pakistan", "United States", "India", "China", "Brazil", "Russia")) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  group_by(Entity, Year, crop) %>% 
  summarise(yield = mean(yield, na.rm = TRUE)) %>% 
  ggplot(aes(x = Year, y = yield, color = Entity)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Yield (tons per hectare)", color = NULL) +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
  theme(legend.position = "top")

Now, I’ll plot both the population growth and per hectare yield change over time in same graph. These plots will visualize the change in population and crop yield over time simultaneously. At first, I’ll join both the crop and population data sets for the six most populated countries using the left_join function in dplyr package. For population data, I’ll filter the data after 1960 to match with crop production data.

Pop_Crop <- Pop %>% 
  filter(Entity %in% c("Pakistan", "United States", "India", "China", "Brazil", "Russia")) %>%
  filter(Year >= 1960) %>% 
  unite(ID, Entity, Year, sep = "_", remove = FALSE, na.rm = FALSE) %>% 
  left_join(
            Crop %>% 
              filter(Entity %in% c("Pakistan", "United States", "India", "China", "Brazil", "Russia")) %>%
              unite(ID, Entity, Year, sep = "_", remove = FALSE, na.rm = FALSE) %>% 
              select(-Entity, -Code, -Year),
            by = c("ID" = "ID")
  )

Joint Trends

At first, I’ll present the joint crop yield and population trends of each country separately.

By Country

China

cols <- c("Yield" = "#08519c", "Population" = "red")

Pop_Crop %>% 
  filter(Entity == "China" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

India

Pop_Crop %>% 
  filter(Entity == "India" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

United States

Pop_Crop %>% 
  filter(Entity == "United States" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

Pakistan

Pop_Crop %>% 
  filter(Entity == "Pakistan" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

Brazil

Pop_Crop %>% 
  filter(Entity == "Brazil" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

Russia

Pop_Crop %>% 
  filter(Entity == "Russia" & !is.na(crop)) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans")) %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~crop, scales = "free") +
   theme(legend.position = "top")

By Crop

Now, I’ll plot these joint trends by crop to compare the populated countries simultaneously.

Wheat

Pop_Crop %>% 
  filter(!is.na(crop)) %>% 
  filter(crop == "Wheat") %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~Entity, scales = "free") +
  theme(legend.position = "top")

Rice

Pop_Crop %>% 
  filter(!is.na(crop)) %>% 
  filter(crop == "Rice") %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~Entity, scales = "free") +
  theme(legend.position = "top")

Maize

Pop_Crop %>% 
  filter(!is.na(crop)) %>% 
  filter(crop == "Maize") %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~Entity, scales = "free") +
  theme(legend.position = "top")

Potatoes

Pop_Crop %>% 
  filter(!is.na(crop)) %>% 
  filter(crop == "Potatoes") %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~Entity, scales = "free") +
  theme(legend.position = "top")

Soybeans

Pop_Crop %>% 
  filter(!is.na(crop)) %>% 
  filter(crop == "Soybeans") %>% 
  ggplot(aes(x = Year)) +
  geom_line(aes(y = yield, colour = "Yield")) +
  geom_line(aes(y = 0.05*pop, color = "Population"), group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./0.05, name = "Population (millions)")) +
  scale_colour_manual("", values=cols) +
  labs(y = "Yield (tons per hectare)", color = "Legend") +
  theme_bw() +
  facet_wrap(~Entity, scales = "free") +
  theme(legend.position = "top")

Note: Readers can use this code to produce maps for other crops.

Modeling per hectare crop yield over time

Now, I’ll use simple linear models to estimate the improvement in crop yield (tons per hectare) per year. For this purpose, I’ll use tidymodels library. In tidymodels, we can optimally use the tidy functionality to analyze data without creating subsets of data. We can nest the variables for which we want to run the model simultaneously.

For instance, I want to model yield per hectare over years for all of the provided crops in the selected six populated countries. I can easily remove the unwanted columns from the data and then can nest the yield and year information for subsequent linear regression analysis. After the analysis, I can easily un-nest the tidied information of the model fits including slop coefficients, standard error, confidence intervals and p-value.

Slop coefficient = change in crop yield (tons per hectare ) per year.

slopes <- Pop_Crop %>% 
  select(-ID, -Code, -`Cereal yield index`, -`Change to land area used for cereal production since 1961`, -`Total population (Gapminder)`, -pop) %>% 
  filter(crop %nin% c("Cassava", "Cocoa beans") & !is.na(yield)) %>%
  nest(yields = c(Year, yield)) %>% 
  mutate(
    fit = map(yields, ~ lm(yield ~ Year, data = .x))
    ) %>% 
  mutate(tidied = map(fit, tidy)) %>%
  unnest(tidied) %>% 
  filter(term == "Year") %>%
  mutate(p.value = p.adjust(p.value)) %>% 
  select(-yields, -fit)

slopes

## # A tibble: 53 x 7
##    Entity crop     term  estimate std.error statistic  p.value
##    <chr>  <chr>    <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1 Brazil Wheat    Year   0.0366    0.00222     16.5  1.17e-21
##  2 Brazil Rice     Year   0.0755    0.00490     15.4  2.22e-20
##  3 Brazil Maize    Year   0.0709    0.00395     18.0  1.96e-23
##  4 Brazil Soybeans Year   0.0388    0.00154     25.1  1.48e-30
##  5 Brazil Potatoes Year   0.440     0.0148      29.7  2.51e-34
##  6 Brazil Beans    Year   0.00770   0.00113      6.81 9.10e- 8
##  7 Brazil Barley   Year   0.0466    0.00319     14.6  2.30e-19
##  8 Brazil Bananas  Year  -0.0349    0.0169      -2.06 4.36e- 2
##  9 Brazil Peas     Year   0.0385    0.00717      5.37 7.30e- 5
## 10 China  Wheat    Year   0.0880    0.00141     62.6  8.23e-52
## # ... with 43 more rows

Visualization of slope coefficients

Now, I’ll visualize the slope coefficients against p-values of null hypothesis.
This visualization will help us in understanding that which countries have demonstrated the improved yield over time in each crop.

slopes %>%
  ggplot(aes(x = estimate, y = p.value, label = Entity)) +
  geom_vline(
    xintercept = 0, lty = 2,
    size = 1.5, alpha = 0.7, color = "gray50"
  ) +
  geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
  scale_y_log10() +
  facet_wrap(~crop) +
  geom_text_repel(size = 3) +
  theme_light() +
  theme(strip.text = element_text(size = 12)) +
  labs(x = "Change in yield (tons per hectare) per year", y = "P-value")

The vertical line indicates slope = 0 no change in per hectare crop yield.
Countries on the right side of the dashed line have demonstrated improvement in per hectare yield over time.
While, countries on the left side have demonstrated decrease in yield over time.

That’s all for this blog post. I hope you have enjoyed reading it and might get to know some new functions and R packages. You can use this code freely to clean, visualize and analyze your data. If you got any questions, leave those in comments and I’ll be happy to reply.

R Food Security Data Science Data visualization ggplot2