ECON 366 — Energy Economics
  • Syllabus
  • Schedule
  • Course Deliverables
  • Data Exercises

Data Exercise, Week 1

BP Statistical Review of World Energy

This week, the first part of our Data Exercise focuses on an interesting source of data, the Statistical Review of World Energy, now published by the bp-affiliated Energy Institute. We’ll use this to learn a couple of new commands in R. The idea is to ramp you on-board with R slowly but surely, and this exercise is not graded. Later this month, you will have a graded exercise that uses some of these skills. For today, let’s make a graph of primary energy consumption by source.

The reason I wanted you to use this data this week is so that you can set yourself up in parallel in Excel and R so that you can visualize what’s happening in R behind the scenes and so that you can also get used to using spreadsheet-based data in R. You can do this exercise all in scripts, but I’d encourage you to build an RMarkdown document so that you get used to using them as you’ll need them for assignments later on. If you want to do so, then you can adapt all of these code chunks into your own R-Markdown (RMD) file, and you can use week2.Rmd as your starting point. Knit it before you start playing with it, just so you see that it works, then make a clean version to use for this week’s exercise.

There is also a lot of data cleaning to do, which will help you get your head around how data are stored in R.

First, as you likely normally would, download the XLSX file for the Statistical Review. Open it in Excel or Sheets, whichever you normally use. For this exercise, we’re going to use the third tab, for Total Primary Energy Consumption By Fuel (labelled PE Cons by fuel EJ).

We’re going to learn about a few things here: reading from Excel using library(readxl) (make sure you have the library installed), more filtering skills, and making stacked bar charts. I’m going to use some kableExtra tables in this document just to make things a bit easier for you to see, but you don’t need to use those parts to make the graph at the end.

As usual, we’ll start with loading the packages we need. Create a code chunk in your document that looks like this:

library(tidyverse)
library(kableExtra)
library(readxl)
library(janitor)

Let’s download the data. Create a code chunk to do this and then to read it into R.

#check to see if the file exists. The first line after the if will only run if it doesn't
#you don't need this line, but it's a nice efficiency improvement for your code and one that I should use more often
if(!file.exists("bp_2024.xlsx"))
#we'll download the data giving it a local file name that's easy to deal with - remember to use the mode="wb" so that you get the windows binary file
    download.file("https://www.energyinst.org/__data/assets/excel_doc/0020/1540550/EI-Stats-Review-All-Data.xlsx",destfile = "bp_2024.xlsx",mode="wb")
#and now, let's read the data in, but specify the range for the data to read in from the Excel that you have open!
bp_data<-read_excel("bp_2024.xlsx",sheet = "PE Cons by fuel EJ",range="A3:O94")%>%clean_names()

Now’s the first lesson of the day: how to execute code chunks while working on your markdown document.

These parts are code chunks and you can execute them individually (you should start from the top and do them in sequence to make sure packages are loaded - by clicking that green triangle). You can also basically start over by clicking the grey triangle with the green bar, which will run all your code chunks from the start of your document. This is the equivalent to running the commands in sequence in a script.

Code Chunks

Important note: running a code chunk is like running an R script or executing commands from the console. When you eventually go to knit your RMarkdown document, it is basically running in a new, clean environment, so it will re-do all the code chunks in sequence starting from scratch.

So, try it - copy and add the first two code chunks in this document to a new document of your own (or use the template) and, once you run then, you’ll see that you have the bp_data in memory. You can see that in the top right corner of your RStudio screen.

Data Environment

Click on it and you’ll open a viewer window and you’ll be able to see how R read in the data. It will look something like this:

Data Viewer

It’s ugly. But, you know what you’re dealing with. Let’s clean it up. In this case, it will be much easier for you to do while looking at the Excel as well.

First thing we know is that, for 2023 data, we only want columns 1 and 9-15 of the worksheet, so let’s do that using the select command.

#select is for picking columns. You can use it with numbers or names
#right now, our names aren't clear, so let's use numbers
bp_data<-bp_data %>% select(1,seq(9,15))

That gives us a much cleaner data set, but the names are all messed up. So, let’s fix them.

#assign new column names
names(bp_data)<-c("country","Oil","Natural Gas","Coal","Nuclear","Hydro","Renewables","Total")
#make sure they're "clean"
bp_data<-bp_data%>%clean_names()

So, now we have a much nicer data set, but there’s still the problem of all the blank rows. So, let’s clean that up too, using filter:

#remove observations that have an NA for country (ie. choose the ones that don't)
#this is going to annoy you, but ! is like "not" in this context.
bp_data<-bp_data%>%filter(!is.na(country))

If you mess code up at some point, you can always go back and re-read the data and start over, executing each of the chunks on the way back to here to make sure it works.

Now, I’m going to use a little bit of a fancier table maker to show you the data I’ve got after running all of this, using the kableExtra package:

  bp_data %>%
    kbl(escape = FALSE) %>%
    kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"))%>%
  scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%. 
# It makes it easier to paste in code from elsewhere
      I() 
country oil natural_gas coal nuclear hydro renewables total
Canada 4.352 4.346 0.371 0.799 3.406 0.676 13.950
Mexico 3.837 3.513 0.264 0.111 0.191 0.537 8.453
US 35.858 31.913 8.197 7.324 2.210 8.779 94.281
Total North America 44.048 39.772 8.831 8.234 5.807 9.993 116.684
Argentina 1.409 1.618 0.046 0.080 0.280 0.238 3.671
Brazil 5.108 1.080 0.573 0.130 4.009 2.974 13.873
Chile 0.832 0.253 0.176 0.000 0.225 0.329 1.815
Colombia 0.977 0.470 0.156 0.000 0.560 0.093 2.255
Ecuador 0.579 0.018 0.002 0.000 0.237 0.007 0.844
Peru 0.519 0.364 0.029 0.000 0.272 0.054 1.238
Trinidad & Tobago 0.068 0.522 0.000 0.000 0.000 0.000 0.590
Venezuela 0.843 1.069 0.007 0.000 0.613 0.000 2.531
Other S. & Cent. America 2.676 0.430 0.176 0.000 0.819 0.359 4.459
Total S. & Cent. America 13.010 5.821 1.165 0.211 7.013 4.055 31.276
Austria 0.486 0.248 0.096 0.000 0.363 0.192 1.385
Belgium 1.134 0.493 0.099 0.295 0.004 0.286 2.311
Czech Republic 0.418 0.241 0.471 0.273 0.022 0.100 1.524
Finland 0.327 0.043 0.088 0.307 0.142 0.288 1.196
France 2.764 1.219 0.178 3.035 0.519 0.948 8.664
Germany 4.012 2.724 1.826 0.065 0.184 2.600 11.410
Greece 0.619 0.194 0.048 0.000 0.037 0.201 1.099
Hungary 0.337 0.295 0.036 0.143 0.002 0.098 0.911
Italy 2.473 2.110 0.217 0.000 0.364 0.782 5.945
Netherlands 1.703 0.928 0.158 0.036 0.001 0.611 3.436
Norway 0.385 0.137 0.035 0.000 1.273 0.163 1.991
Poland 1.405 0.705 1.506 0.000 0.022 0.478 4.116
Portugal 0.445 0.161 0.000 0.000 0.113 0.229 0.948
Romania 0.455 0.327 0.114 0.100 0.171 0.108 1.276
Spain 2.568 1.055 0.119 0.510 0.238 1.173 5.662
Sweden 0.466 0.026 0.065 0.435 0.618 0.542 2.151
Switzerland 0.387 0.099 0.004 0.209 0.346 0.088 1.133
Turkey 2.301 1.743 1.649 0.000 0.597 0.713 7.003
Ukraine 0.393 0.675 0.487 0.470 0.119 0.076 2.220
United Kingdom 2.688 2.286 0.184 0.366 0.049 1.378 6.950
Other Europe 2.566 0.973 1.015 0.360 0.791 0.808 6.513
Total Europe 28.333 16.681 8.395 6.604 5.973 11.860 77.846
Azerbaijan 0.257 0.544 0.000 0.000 0.016 0.003 0.820
Belarus 0.304 0.604 0.041 0.105 0.003 0.009 1.066
Kazakhstan 0.689 0.771 1.386 0.000 0.082 0.053 2.981
Russian Federation 7.212 16.321 3.834 1.951 1.878 0.095 31.291
Turkmenistan 0.279 1.323 0.000 0.000 0.000 0.000 1.602
Uzbekistan 0.222 1.677 0.137 0.000 0.065 0.003 2.104
Other CIS 0.199 0.215 0.089 0.024 0.321 0.010 0.857
Total CIS 9.161 21.457 5.486 2.080 2.366 0.172 40.722
Iran 3.502 8.840 0.079 0.060 0.212 0.020 12.712
Iraq 1.803 0.673 0.000 0.000 0.032 0.004 2.512
Israel 0.445 0.452 0.138 0.000 0.000 0.079 1.114
Kuwait 0.762 0.808 0.006 0.000 0.000 0.002 1.577
Oman 0.466 1.062 0.005 0.000 0.000 0.015 1.548
Qatar 0.613 1.591 0.000 0.000 0.000 0.014 2.218
Saudi Arabia 7.431 4.109 0.005 0.000 0.000 0.054 11.598
United Arab Emirates 2.204 2.408 0.102 0.290 0.000 0.129 5.133
Other Middle East 1.059 0.854 0.044 0.000 0.014 0.074 2.046
Total Middle East 18.285 20.798 0.379 0.349 0.258 0.390 40.459
Algeria 0.858 1.667 0.007 0.000 0.001 0.006 2.538
Egypt 1.494 2.162 0.050 0.000 0.129 0.103 3.938
Morocco 0.571 0.032 0.292 0.000 0.002 0.081 0.977
South Africa 1.086 0.171 3.325 0.080 0.016 0.175 4.853
Other Africa 4.484 2.133 0.408 0.000 1.364 0.174 8.564
Total Africa 8.492 6.164 4.082 0.080 1.512 0.539 20.869
Australia 2.166 1.444 1.508 0.000 0.143 0.759 6.021
Bangladesh 0.518 1.013 0.282 0.000 0.006 0.012 1.831
China 32.726 14.574 91.939 3.901 11.465 16.134 170.739
China Hong Kong SAR 0.574 0.174 0.147 0.000 0.000 0.008 0.903
India 10.573 2.254 21.981 0.433 1.395 2.380 39.016
Indonesia 3.096 1.636 4.320 0.000 0.230 0.826 10.108
Japan 6.655 3.327 4.536 0.695 0.697 1.493 17.404
Malaysia 1.787 1.659 0.983 0.000 0.295 0.085 4.807
New Zealand 0.319 0.137 0.041 0.000 0.248 0.116 0.861
Pakistan 0.782 1.362 0.617 0.201 0.350 0.060 3.372
Philippines 0.933 0.115 0.882 0.000 0.095 0.168 2.193
Singapore 2.994 0.444 0.014 0.000 0.000 0.018 3.470
South Korea 5.363 2.163 2.694 1.620 0.035 0.559 12.434
Sri Lanka 0.218 0.000 0.059 0.000 0.055 0.025 0.358
Taiwan 1.635 1.006 1.490 0.160 0.037 0.201 4.529
Thailand 2.311 1.698 0.605 0.000 0.062 0.331 5.007
Vietnam 1.195 0.260 2.323 0.000 0.757 0.356 4.891
Other Asia Pacific 1.256 0.407 1.274 0.000 0.855 0.037 3.829
Total Asia Pacific 75.102 33.674 135.695 7.009 16.723 23.571 291.774
Total World 196.430 144.366 164.034 24.567 39.651 50.580 619.629
of which: OECD 87.145 63.252 25.150 16.437 12.957 24.956 229.897
Non-OECD 109.285 81.114 138.884 8.130 26.694 25.625 389.732
European Union 21.446 11.501 5.483 5.559 3.046 9.349 56.383

I still don’t have exactly what I want, because I still have a lot of lines that don’t contain data I want to graph - I just want to use the totals, so I’ll show you a couple of tricks to clean those out.

#keep only observations that have total in the country cell using the grepl command
#grepl is a logical search function, so it will keep any observation (row) for which the term Total (case sensitive) appears in the country column
bp_data<-bp_data%>%filter(grepl("Total",country))

grepl is a really handy command that returns true for any cell that contains the word “Total” and false otherwise. It will be case sensitive, so check to make sure it worked. There are tricks to make it more general, but we’ll deal with those later.

Now, in this case, I want to make a graph by region, so I’m going to clean up the data a bit more. I’ll use gsub to find and replace text to strip the word “Total” out of all my regions.

#mutate to strip out the totals (note the space after total inside the quotes)
#gsub is a text substitution command, like a find and replace
bp_data<-bp_data%>%mutate(country=gsub("Total ","",country))

And now we have the data that we want to graph:

  bp_data %>%
    kbl(escape = FALSE) %>%
    kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"))%>%
  #scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%. 
# It makes it easier to paste in code from elsewhere
      I() 
country oil natural_gas coal nuclear hydro renewables total
North America 44.05 39.77 8.831 8.234 5.807 9.993 116.7
S. & Cent. America 13.01 5.82 1.165 0.211 7.013 4.055 31.3
Europe 28.33 16.68 8.395 6.604 5.973 11.860 77.8
CIS 9.16 21.46 5.486 2.080 2.366 0.172 40.7
Middle East 18.29 20.80 0.379 0.349 0.258 0.390 40.5
Africa 8.49 6.16 4.082 0.080 1.512 0.539 20.9
Asia Pacific 75.10 33.67 135.695 7.009 16.723 23.571 291.8
World 196.43 144.37 164.034 24.567 39.651 50.580 619.6

This is going to be the part that you find counter-intuitive, and is one of the hardest things about graphing data in groups (or using flat data files), but there are going to be a lot of those in our future, so let’s learn about them.

To make this work well, we want to transform our data from wide, with the variables one in each column, to long with data stacked vertically and an indicator variable for the groupings we need. It’s easier to see than to describe, so let’s do it and then talk about it:

# we're going to pivot the data to a longer style, by country
# we're going to end up values going to a variable called tpes and names going to a variable called source
  bp_data<-bp_data%>%pivot_longer(-country,names_to = "source", values_to = "tpes")

#show me the data
bp_data %>%
    kbl(escape = FALSE) %>%
    kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  scroll_box(width = "1000px", height = "400px")%>%
#adding this I() at the end of a chain of commands is like a placeholder in case you would otherwise end with a %>%. 
# It makes it easier to paste in code from elsewhere
      I() 
country source tpes
North America oil 44.048
North America natural_gas 39.772
North America coal 8.831
North America nuclear 8.234
North America hydro 5.807
North America renewables 9.993
North America total 116.684
S. & Cent. America oil 13.010
S. & Cent. America natural_gas 5.821
S. & Cent. America coal 1.165
S. & Cent. America nuclear 0.211
S. & Cent. America hydro 7.013
S. & Cent. America renewables 4.055
S. & Cent. America total 31.276
Europe oil 28.333
Europe natural_gas 16.681
Europe coal 8.395
Europe nuclear 6.604
Europe hydro 5.973
Europe renewables 11.860
Europe total 77.846
CIS oil 9.161
CIS natural_gas 21.457
CIS coal 5.486
CIS nuclear 2.080
CIS hydro 2.366
CIS renewables 0.172
CIS total 40.722
Middle East oil 18.285
Middle East natural_gas 20.798
Middle East coal 0.379
Middle East nuclear 0.349
Middle East hydro 0.258
Middle East renewables 0.390
Middle East total 40.459
Africa oil 8.492
Africa natural_gas 6.164
Africa coal 4.082
Africa nuclear 0.080
Africa hydro 1.512
Africa renewables 0.539
Africa total 20.869
Asia Pacific oil 75.102
Asia Pacific natural_gas 33.674
Asia Pacific coal 135.695
Asia Pacific nuclear 7.009
Asia Pacific hydro 16.723
Asia Pacific renewables 23.571
Asia Pacific total 291.774
World oil 196.430
World natural_gas 144.366
World coal 164.034
World nuclear 24.567
World hydro 39.651
World renewables 50.580
World total 619.629

You’re probably not used to seeing data presented this way, and it’s not as nice for table visuals, but it’s great for graphing.

Let me show you what I mean: let’s make a stacked bar plot of the individual constituents of primary energy consumption by region.

bp_data %>%
  #take out the global observations
  filter(country!="World")%>%
  #take out the total column
  filter(source!="total")%>%
  #send the data to ggplot
  ggplot()+ #make a graph
  # here, we are going to add a column "geom" to our graph
  # the data are country on the x axis and tpes on the y axis
  geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)#the color "black" are the "linewidth" describe the outlines between blocks

That’s a basic graph.

I think we might like to clean it up a bit, and get some colors to suit our sources a bit better. Let’s use a couple of other libraries, ggthemes and tools, so make sure you have those installed if you’re going to try to use them.

library(tools)
library(ggthemes)
bp_data %>%
  filter(country!="World")%>%
  filter(source!="total")%>%
  mutate(source=toTitleCase(gsub("_"," ",source)))%>% #strip out the underscore, convert to title case
  #take out the global observations and total column

  ggplot()+
  geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)+
  theme_economist_white()+#for fun, let's use the economist theme
  guides(fill=guide_legend(nrow = 1))+
  theme(legend.position = "bottom")+
  scale_fill_manual("",values=c("Coal"="black","Oil"="grey30","Natural Gas"="grey70","Nuclear"="orange","Hydro"="dodgerblue","Renewables"="darkgreen"))+
  theme(text = element_text(size=16))+
  labs(y="Primary Energy Consumption (Exajoules)",x="",
       title="Primary Energy Consumption by Region and Source, 2023",
       caption="Source: BP Statistical Review of World Energy")

And, one of the frustrations in ggplot is that it assigns the variable groups as factors and, by default, they are sorted alphabetically. In this case, it doesn’t matter much, but just so you know what’s out there, let me show you one additional step which is to create your own factor for the sources of energy, and use the package forcats to reorder them. We’ll deal with these more in the future, so this is just a little added feature you might decide you need at some point.

library(forcats)
bp_data %>%
  filter(country!="World")%>%
  filter(source!="total")%>%
  mutate(source=toTitleCase(gsub("_"," ",source)))%>% #strip out the underscore, convert to title case
  mutate(source=as_factor(source))%>% #make the sources factors in the order they appear in the data
  mutate(source=fct_relevel(source,"Coal"))%>% #use fct-relevel to make coal the first factor level
  mutate(source=fct_rev(source))%>% #reverse the order
  #take out the global observations and total column

  ggplot()+
  geom_col(aes(country,tpes,fill=source),color="black",linewidth=.25)+
  theme_economist_white()+#for fun, let's use the economist theme
  guides(fill=guide_legend(nrow = 1))+
  theme(legend.position = "bottom")+
  scale_fill_manual("",values=c("Coal"="black","Oil"="grey30","Natural Gas"="grey70","Nuclear"="orange","Hydro"="dodgerblue","Renewables"="darkgreen"))+
  theme(text = element_text(size=16))+
  labs(y="Primary Energy Consumption (Exajoules)",x="",
       title="Primary Energy Consumption by Region and Source, 2023",
       caption="Source: BP Statistical Review of World Energy")

CANSIM and R for Statistics Canada Tables

I also wanted to quickly introduce to you the idea of specialized packages in R that make access to certain data repositories very easy. For this week, I’ll show you the cansim (or StatCan) data access package. Install the cansim package, and then let’s grab some energy supply data to make one of the graphs in the first sets of slides.

Table 25-10-0029-01 from Statistics Canada has Canadian primary energy supply data:

StatCan Data

The nice thing about the r cansim package for R is that we can use it to download and process any of the data from those tables.

library(cansim)#load the cansim package
options(scipen = 999) #force it to use decimals not scientific notation
tpes_data<-get_cansim("25-10-0029-01")%>% clean_names()

This is quite a large set of data, going back to 1995 with a whole lot of different data points by energy source and region. We’re going to do a little bit of manipulation of the data to make a relatively simple graph of natural gas production, imports, and exports over time.

The first step is to modify the data using filter so that we get only the national data (geo==Canada) and for natural gas (fuel_type=="Natural gas") and we want only a select set of supply and demand charateristics (supply_and_demand_characteristics%in% c("Production","Exports","Imports")). Since the data are also provided for more than one unit of measurement, we also filter our to uom==Terajoules to leave us with only one. Here’s a nice example of how to use the filter command to pare down your data:

tpes_data<-tpes_data%>%
  filter(supply_and_demand_characteristics %in% c("Production","Exports","Imports"),
         geo=="Canada",
         fuel_type =="Natural gas",
         uom=="Terajoules")

and then we can make a graph:

 tpes_data %>% #pass the data to the graph
  ggplot(aes(ref_date,value/1000))+ #divide by 1000 to put it in petajoules
  #add lines for each group of data
  geom_line(aes(group=supply_and_demand_characteristics,color=supply_and_demand_characteristics),linewidth=1.25)+
  #create a colour scale
  scale_color_manual("",values=c("black","red","dodgerblue"))+
  labs(y="Primary Energy Production, Imports and Exports (PJ)", x="",
       title="Canadian Primary Energy Production, Imports, and Exports of Natural Gas")+
  scale_y_continuous(expand=c(0,0))+
  scale_x_discrete(expand=c(0,0))+
  expand_limits(y=0)+
  theme_classic()+
  NULL

We’ll likely use the cansim package a few more times this year, so make sure you’re familiar with it.

And, that’s this week’s data lesson. Try it out for yourself in your own document, using these code chunks as you see fit, to start to get ready for the first graded assignment later in the month.

Content 2025 by Andrew Leach
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)

 

Made with and Quarto
View the source at GitHub