library(tidyverse)
library(kableExtra)
library(readxl)
library(janitor)
Data Exercise, Week 1
This week, our Data Exercise focuses on an interesting source of data, the Statistical Review of World Energy, now published by the bp-affiliated Energy Institute. We’ll use this to learn a couple of new commands in R. The idea is to ramp you on-board with R slowly but surely, and this exercise is not graded. Later this month, you will have a graded exercise that uses some of these skills. For today, let’s make a graph of primary energy consumption by source.
The reason I wanted you to use this data this week is so that you can set yourself up in parallel in Excel and R so that you can visualize what’s happening in R behind the scenes and so that you can also get used to using spreadsheet-based data in R. You can do this exercise all in scripts, but I’d encourage you to build an RMarkdown document so that you get used to using them as you’ll need them for assignments later on. If you want to do so, then you can adapt all of these code chunks into your own R-Markdown (RMD) file, and you can use week2.Rmd
as your starting point. Knit it before you start playing with it, just so you see that it works, then make a clean version to use for this week’s exercise.
There is also a lot of data cleaning to do, which will help you get your head around how data are stored in R.
First, as you likely normally would, download the XLSX file for the Statistical Review. Open it in Excel or Sheets, whichever you normally use. For this exercise, we’re going to use the third tab, for Total Primary Energy Consumption By Fuel.
We’re going to learn about a few things here: reading from Excel using library(readxl)
(make sure you have the library installed), more filtering skills, and making stacked bar charts. I’m going to use some kableExtra
tables in this document just to make things a bit easier for you to see, but you don’t need to use those parts to make the graph at the end.
As usual, we’ll start with loading the packages we need. Create a code chunk in your document that looks like this:
Let’s download the data. Create a code chunk to do this and then to read it into R.
#check to see if the file exists. The first line after the if will only run if it doesn't
#you don't need this line, but it's a nice efficiency improvement for your code and one that I should use more often
if(!file.exists("bp_2023.xlsx"))
#we'll download the data giving it a local file name that's easy to deal with - remember to use the mode="wb" so that you get the windows binary file
download.file("https://www.energyinst.org/__data/assets/excel_doc/0007/1055545/EI-stats-review-all-data.xlsx",destfile = "bp_2023.xlsx",mode="wb")
#and now, let's read the data in, but specify the range for the data to read in from the Excel that you have open!
<-read_excel("bp_2023.xlsx",sheet = "Primary Energy - Cons by fuel",range="A3:O94")%>%clean_names() bp_data
Now’s the first lesson of the day: how to execute code chunks while working on your markdown document.
These parts are code chunks and you can execute them individually (you should start from the top and do them in sequence to make sure packages are loaded - by clicking that green triangle). You can also basically start over by clicking the grey triangle with the green bar, which will run all your code chunks from the start of your document. This is the equivalent to running the commands in sequence in a script.
Important note: running a code chunk is like running an R script or executing commands from the console. When you eventually go to knit your RMarkdown document, it is basically running in a new, clean environment, so it will re-do all the code chunks in sequence starting from scratch.
So, try it - copy and add the first two code chunks in this document to a new document of your own (or use the template) and, once you run then, you’ll see that you have the bp_data
in memory. You can see that in the top right corner of your RStudio screen.
Click on it and you’ll open a viewer window and you’ll be able to see how R read in the data. It will look something like this:
It’s ugly. But, you know what you’re dealing with. Let’s clean it up. In this case, it will be much easier for you to do while looking at the Excel as well.
First thing we know is that, for 2023 data, we only want columns 1 and 9-15 of the worksheet, so let’s do that using the select
command.
#select is for picking columns. You can use it with numbers or names
#right now, our names aren't clear, so let's use numbers
<-bp_data %>% select(1,seq(9,15)) bp_data
That gives us a much cleaner data set, but the names are all messed up. So, let’s fix them.
#assign new column names
names(bp_data)<-c("country","Oil","Natural Gas","Coal","Nuclear","Hydro","Renewables","Total")
#make sure they're "clean"
<-bp_data%>%clean_names() bp_data
So, now we have a much nicer data set, but there’s still the problem of all the blank rows. So, let’s clean that up too, using filter
:
#remove observations that have an NA for country (ie. choose the ones that don't)
#this is going to annoy you, but ! is like "not" in this context.
<-bp_data%>%filter(!is.na(country)) bp_data
If you mess code up at some point, you can always go back and re-read the data and start over, executing each of the chunks on the way back to here to make sure it works.
Now, I’m going to use a little bit of a fancier table maker to show you the data I’ve got after running all of this, using the kableExtra
package:
%>%
bp_data kbl(escape = FALSE) %>%
kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"))%>%
scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%.
# It makes it easier to paste in code from elsewhere
I()
country | oil | natural_gas | coal | nuclear | hydro | renewables | total |
---|---|---|---|---|---|---|---|
Canada | 4.267 | 4.379 | 0.386 | 0.780 | 3.740 | 0.591 | 14.143 |
Mexico | 4.118 | 3.477 | 0.251 | 0.098 | 0.335 | 0.446 | 8.725 |
US | 36.150 | 31.724 | 9.868 | 7.315 | 2.427 | 8.427 | 95.910 |
Total North America | 44.535 | 39.579 | 10.506 | 8.193 | 6.502 | 9.464 | 118.778 |
Argentina | 1.378 | 1.645 | 0.052 | 0.067 | 0.224 | 0.238 | 3.604 |
Brazil | 5.007 | 1.151 | 0.586 | 0.131 | 4.009 | 2.526 | 13.410 |
Chile | 0.797 | 0.269 | 0.220 | 0.000 | 0.209 | 0.294 | 1.789 |
Colombia | 0.961 | 0.453 | 0.103 | 0.000 | 0.604 | 0.073 | 2.194 |
Ecuador | 0.527 | 0.020 | 0.003 | 0.000 | 0.231 | 0.006 | 0.787 |
Peru | 0.497 | 0.346 | 0.029 | 0.000 | 0.279 | 0.054 | 1.206 |
Trinidad & Tobago | 0.049 | 0.542 | 0.000 | 0.000 | 0.000 | 0.000 | 0.591 |
Venezuela | 0.532 | 1.050 | 0.001 | 0.000 | 0.625 | 0.000 | 2.209 |
Other S. & Cent. America | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total S. & Cent. America | 12.372 | 5.822 | 1.186 | 0.198 | 7.004 | 3.526 | 30.108 |
Austria | 0.484 | 0.285 | 0.102 | 0.000 | 0.334 | 0.169 | 1.374 |
Belgium | 1.156 | 0.524 | 0.115 | 0.394 | 0.003 | 0.259 | 2.450 |
Czech Republic | 0.413 | 0.266 | 0.589 | 0.279 | 0.020 | 0.104 | 1.671 |
Finland | 0.335 | 0.039 | 0.122 | 0.227 | 0.128 | 0.325 | 1.176 |
France | 2.911 | 1.381 | 0.214 | 2.654 | 0.418 | 0.808 | 8.388 |
Germany | 4.259 | 2.782 | 2.330 | 0.313 | 0.164 | 2.450 | 12.299 |
Greece | 0.616 | 0.223 | 0.072 | 0.000 | 0.043 | 0.183 | 1.137 |
Hungary | 0.347 | 0.330 | 0.050 | 0.142 | 0.002 | 0.086 | 0.957 |
Italy | 2.470 | 2.350 | 0.305 | 0.000 | 0.264 | 0.756 | 6.144 |
Netherlands | 1.785 | 0.977 | 0.232 | 0.037 | 0.000 | 0.509 | 3.541 |
Norway | 0.361 | 0.144 | 0.034 | 0.000 | 1.198 | 0.161 | 1.896 |
Poland | 1.459 | 0.645 | 1.805 | 0.000 | 0.018 | 0.386 | 4.314 |
Portugal | 0.457 | 0.202 | 0.000 | 0.000 | 0.061 | 0.208 | 0.928 |
Romania | 0.447 | 0.352 | 0.165 | 0.100 | 0.130 | 0.105 | 1.298 |
Spain | 2.657 | 1.191 | 0.169 | 0.528 | 0.171 | 1.039 | 5.755 |
Sweden | 0.499 | 0.026 | 0.073 | 0.464 | 0.655 | 0.559 | 2.276 |
Switzerland | 0.384 | 0.107 | 0.004 | 0.208 | 0.277 | 0.070 | 1.049 |
Turkey | 2.101 | 1.844 | 1.745 | 0.000 | 0.631 | 0.691 | 7.012 |
Ukraine | 0.386 | 0.694 | 0.519 | 0.559 | 0.104 | 0.069 | 2.331 |
United Kingdom | 2.668 | 2.591 | 0.211 | 0.430 | 0.050 | 1.365 | 7.315 |
Other Europe | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total Europe | 28.719 | 17.956 | 10.070 | 6.678 | 5.321 | 11.064 | 79.809 |
Azerbaijan | 0.246 | 0.436 | 0.000 | 0.000 | 0.015 | 0.003 | 0.699 |
Belarus | 0.313 | 0.665 | 0.039 | 0.042 | 0.003 | 0.007 | 1.069 |
Kazakhstan | 0.779 | 0.782 | 1.435 | 0.000 | 0.086 | 0.040 | 3.122 |
Russian Federation | 7.055 | 14.690 | 3.194 | 2.015 | 1.855 | 0.084 | 28.893 |
Turkmenistan | 0.301 | 1.351 | 0.000 | 0.000 | 0.000 | 0.000 | 1.652 |
Uzbekistan | 0.216 | 1.739 | 0.103 | 0.000 | 0.050 | 0.002 | 2.110 |
Other CIS | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total CIS | 9.096 | 19.844 | 4.871 | 2.083 | 2.326 | 0.139 | 38.359 |
Iran | 3.691 | 8.241 | 0.076 | 0.059 | 0.070 | 0.019 | 12.156 |
Iraq | 1.591 | 0.679 | 0.000 | 0.000 | 0.031 | 0.004 | 2.305 |
Israel | 0.461 | 0.406 | 0.156 | 0.000 | 0.000 | 0.071 | 1.095 |
Kuwait | 0.806 | 0.784 | 0.005 | 0.000 | 0.000 | 0.002 | 1.596 |
Oman | 0.450 | 1.027 | 0.004 | 0.000 | 0.000 | 0.015 | 1.495 |
Qatar | 0.558 | 1.321 | 0.000 | 0.000 | 0.000 | 0.005 | 1.884 |
Saudi Arabia | 7.151 | 4.333 | 0.005 | 0.000 | 0.000 | 0.008 | 11.496 |
United Arab Emirates | 2.186 | 2.515 | 0.103 | 0.181 | 0.000 | 0.065 | 5.050 |
Other Middle East | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total Middle East | 17.967 | 20.180 | 0.370 | 0.240 | 0.116 | 0.256 | 39.130 |
Algeria | 0.859 | 1.594 | 0.005 | 0.000 | 0.001 | 0.006 | 2.466 |
Egypt | 1.516 | 2.187 | 0.051 | 0.000 | 0.129 | 0.095 | 3.980 |
Morocco | 0.570 | 0.008 | 0.280 | 0.000 | 0.003 | 0.064 | 0.924 |
South Africa | 1.064 | 0.164 | 3.313 | 0.091 | 0.029 | 0.158 | 4.819 |
Other Africa | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total Africa | 8.387 | 5.850 | 3.973 | 0.091 | 1.471 | 0.485 | 20.257 |
Australia | 2.067 | 1.499 | 1.551 | 0.000 | 0.161 | 0.704 | 5.980 |
Bangladesh | 0.550 | 1.051 | 0.180 | 0.000 | 0.007 | 0.006 | 1.795 |
China | 28.157 | 13.525 | 88.414 | 3.763 | 12.231 | 13.303 | 159.393 |
China Hong Kong SAR | 0.470 | 0.163 | 0.150 | 0.000 | 0.000 | 0.003 | 0.786 |
India | 10.050 | 2.095 | 20.093 | 0.416 | 1.642 | 2.149 | 36.444 |
Indonesia | 3.064 | 1.333 | 4.379 | 0.000 | 0.256 | 0.741 | 9.774 |
Japan | 6.609 | 3.618 | 4.916 | 0.466 | 0.703 | 1.530 | 17.842 |
Malaysia | 1.727 | 1.778 | 0.938 | 0.000 | 0.305 | 0.089 | 4.837 |
New Zealand | 0.301 | 0.130 | 0.046 | 0.000 | 0.247 | 0.115 | 0.839 |
Pakistan | 0.991 | 1.382 | 0.638 | 0.201 | 0.329 | 0.062 | 3.602 |
Philippines | 0.894 | 0.111 | 0.839 | 0.000 | 0.095 | 0.169 | 2.108 |
Singapore | 2.659 | 0.470 | 0.017 | 0.000 | 0.000 | 0.018 | 3.164 |
South Korea | 5.470 | 2.229 | 2.875 | 1.586 | 0.033 | 0.515 | 12.708 |
Sri Lanka | 0.199 | 0.000 | 0.061 | 0.000 | 0.067 | 0.015 | 0.342 |
Taiwan | 1.766 | 1.010 | 1.581 | 0.214 | 0.055 | 0.156 | 4.783 |
Thailand | 2.386 | 1.595 | 0.711 | 0.000 | 0.062 | 0.309 | 5.063 |
Vietnam | 1.031 | 0.281 | 2.049 | 0.000 | 0.901 | 0.327 | 4.589 |
Other Asia Pacific | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Total Asia Pacific | 69.615 | 32.655 | 130.499 | 6.646 | 17.940 | 20.241 | 277.595 |
Total World | 190.691 | 141.887 | 161.475 | 24.128 | 40.679 | 45.176 | 604.036 |
of which: OECD | 87.979 | 64.676 | 28.899 | 16.116 | 13.217 | 23.533 | 234.420 |
Non-OECD | 102.712 | 77.211 | 132.575 | 8.013 | 27.462 | 21.643 | 369.616 |
European Union | 22.134 | 12.361 | 6.979 | 5.481 | 2.599 | 8.627 | 58.180 |
I still don’t have exactly what I want, because I still have a lot of lines that don’t contain data I want to graph - I just want to use the totals, so I’ll show you a couple of tricks to clean those out.
#keep only observations that have total in the country cell using the grepl command
#grepl is a logical search function, so it will keep any observation (row) for which the term Total (case sensitive) appears in the country column
<-bp_data%>%filter(grepl("Total",country)) bp_data
grepl
is a really handy command that returns true for any cell that contains the word “Total” and false otherwise. It will be case sensitive, so check to make sure it worked. There are tricks to make it more general, but we’ll deal with those later.
Now, in this case, I want to make a graph by region, so I’m going to clean up the data a bit more. I’ll use gsub
to find and replace text to strip the word “Total” out of all my regions.
#mutate to strip out the totals (note the space after total inside the quotes)
#gsub is a text substitution command, like a find and replace
<-bp_data%>%mutate(country=gsub("Total ","",country)) bp_data
And now we have the data that we want to graph:
%>%
bp_data kbl(escape = FALSE) %>%
kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"))%>%
#scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%.
# It makes it easier to paste in code from elsewhere
I()
country | oil | natural_gas | coal | nuclear | hydro | renewables | total |
---|---|---|---|---|---|---|---|
North America | 44.53 | 39.58 | 10.51 | 8.193 | 6.502 | 9.464 | 118.8 |
S. & Cent. America | 12.37 | 5.82 | 1.19 | 0.198 | 7.004 | 3.526 | 30.1 |
Europe | 28.72 | 17.96 | 10.07 | 6.678 | 5.321 | 11.064 | 79.8 |
CIS | 9.10 | 19.84 | 4.87 | 2.083 | 2.326 | 0.139 | 38.4 |
Middle East | 17.97 | 20.18 | 0.37 | 0.240 | 0.116 | 0.256 | 39.1 |
Africa | 8.39 | 5.85 | 3.97 | 0.091 | 1.471 | 0.485 | 20.3 |
Asia Pacific | 69.61 | 32.66 | 130.50 | 6.646 | 17.940 | 20.241 | 277.6 |
World | 190.69 | 141.89 | 161.47 | 24.128 | 40.679 | 45.176 | 604.0 |
This is going to be the part that you find counter-intuitive, and is one of the hardest things about graphing data in groups (or using flat data files), but there are going to be a lot of those in our future, so let’s learn about them.
To make this work well, we want to transform our data from wide, with the variables one in each column, to long with data stacked vertically and an indicator variable for the groupings we need. It’s easier to see than to describe, so let’s do it and then talk about it:
# we're going to pivot the data to a longer style, by country
# we're going to end up values going to a variable called tpes and names going to a variable called source
<-bp_data%>%pivot_longer(-country,names_to = "source", values_to = "tpes")
bp_data
#show me the data
%>%
bp_data kbl(escape = FALSE) %>%
kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%.
# It makes it easier to paste in code from elsewhere
I()
country | source | tpes |
---|---|---|
North America | oil | 44.535 |
North America | natural_gas | 39.579 |
North America | coal | 10.506 |
North America | nuclear | 8.193 |
North America | hydro | 6.502 |
North America | renewables | 9.464 |
North America | total | 118.778 |
S. & Cent. America | oil | 12.372 |
S. & Cent. America | natural_gas | 5.822 |
S. & Cent. America | coal | 1.186 |
S. & Cent. America | nuclear | 0.198 |
S. & Cent. America | hydro | 7.004 |
S. & Cent. America | renewables | 3.526 |
S. & Cent. America | total | 30.108 |
Europe | oil | 28.719 |
Europe | natural_gas | 17.956 |
Europe | coal | 10.070 |
Europe | nuclear | 6.678 |
Europe | hydro | 5.321 |
Europe | renewables | 11.064 |
Europe | total | 79.809 |
CIS | oil | 9.096 |
CIS | natural_gas | 19.844 |
CIS | coal | 4.871 |
CIS | nuclear | 2.083 |
CIS | hydro | 2.326 |
CIS | renewables | 0.139 |
CIS | total | 38.359 |
Middle East | oil | 17.967 |
Middle East | natural_gas | 20.180 |
Middle East | coal | 0.370 |
Middle East | nuclear | 0.240 |
Middle East | hydro | 0.116 |
Middle East | renewables | 0.256 |
Middle East | total | 39.130 |
Africa | oil | 8.387 |
Africa | natural_gas | 5.850 |
Africa | coal | 3.973 |
Africa | nuclear | 0.091 |
Africa | hydro | 1.471 |
Africa | renewables | 0.485 |
Africa | total | 20.257 |
Asia Pacific | oil | 69.615 |
Asia Pacific | natural_gas | 32.655 |
Asia Pacific | coal | 130.499 |
Asia Pacific | nuclear | 6.646 |
Asia Pacific | hydro | 17.940 |
Asia Pacific | renewables | 20.241 |
Asia Pacific | total | 277.595 |
World | oil | 190.691 |
World | natural_gas | 141.887 |
World | coal | 161.475 |
World | nuclear | 24.128 |
World | hydro | 40.679 |
World | renewables | 45.176 |
World | total | 604.036 |
You’re probably not used to seeing data presented this way, and it’s not as nice for table visuals, but it’s great for graphing.
Let me show you what I mean: let’s make a stacked bar plot of the individual constituents of primary energy consumption by region.
%>%
bp_data #take out the global observations
filter(country!="World")%>%
#take out the total column
filter(source!="total")%>%
#send the data to ggplot
ggplot()+ #make a graph
# here, we are going to add a column "geom" to our graph
# the data are country on the x axis and tpes on the y axis
geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)#the color "black" are the "linewidth" describe the outlines between blocks
That’s a basic graph.
I think we might like to clean it up a bit, and get some colors to suit our sources a bit better. Let’s use a couple of other libraries, ggthemes
and tools
, so make sure you have those installed if you’re going to try to use them.
library(tools)
library(ggthemes)
%>%
bp_data filter(country!="World")%>%
filter(source!="total")%>%
mutate(source=toTitleCase(gsub("_"," ",source)))%>% #strip out the underscore, convert to title case
#take out the global observations and total column
ggplot()+
geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)+
theme_economist_white()+#for fun, let's use the economist theme
guides(fill=guide_legend(nrow = 1))+
theme(legend.position = "bottom")+
scale_fill_manual("",values=c("Coal"="black","Oil"="grey30","Natural Gas"="grey70","Nuclear"="orange","Hydro"="dodgerblue","Renewables"="darkgreen"))+
theme(text = element_text(size=16))+
labs(y="Primary Energy Consumption (Exajoules)",x="",
title="Primary Energy Consumption by Region and Source, 2023",
caption="Source: BP Statistical Review of World Energy")
And, one of the frustrations in ggplot is that it assigns the variable groups as factors and, by default, they are sorted alphabetically. In this case, it doesn’t matter much, but just so you know what’s out there, let me show you one additional step which is to create your own factor for the sources of energy, and use the package forcats
to reorder them. We’ll deal with these more in the future, so this is just a little added feature you might decide you need at some point.
library(forcats)
%>%
bp_data filter(country!="World")%>%
filter(source!="total")%>%
mutate(source=toTitleCase(gsub("_"," ",source)))%>% #strip out the underscore, convert to title case
mutate(source=as_factor(source))%>% #make the sources factors in the order they appear in the data
mutate(source=fct_relevel(source,"Coal"))%>% #use fct-relevel to make coal the first factor level
mutate(source=fct_rev(source))%>% #reverse the order
#take out the global observations and total column
ggplot()+
geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)+
theme_economist_white()+#for fun, let's use the economist theme
guides(fill=guide_legend(nrow = 1))+
theme(legend.position = "bottom")+
scale_fill_manual("",values=c("Coal"="black","Oil"="grey30","Natural Gas"="grey70","Nuclear"="orange","Hydro"="dodgerblue","Renewables"="darkgreen"))+
theme(text = element_text(size=16))+
labs(y="Primary Energy Consumption (Exajoules)",x="",
title="Primary Energy Consumption by Region and Source, 2023",
caption="Source: BP Statistical Review of World Energy")
And, that’s this week’s data lesson. Try it out for yourself in your own document, using these code chunks as you see fit, to start to get ready for the first graded assignment later in the month. Here’s my week1_ajl.Rmd
file for your reference. If you got lost here a bit, then you can download that file, knit it, and then work backwards through it.