ECON 366 — Energy Economics
  • Syllabus
  • Schedule
  • Course Deliverables
  • Data Exercises

Filtering Demo

The concept of a filter is going to be really useful to you through the year, so I thought it was worth a quick look into the idea of binary (TRUE/FALSE) conditions, since we use them a lot in economics data. I’m going to show you all the code for this demo so that you have it all in front of you.

library(kableExtra)
library(readxl)
library(janitor)
library(tidyverse)
library(lubridate)
library(scales)

knitr::opts_chunk$set(message=F,
                      warning=F)

True or False

Let’s start with some basic ideas: filtering is based around conditions being true or false, which the computer will recognize with a binary or [0,1] variable. For example, if I enter (1==2) in r, I’ll get a reply of FALSE. Try it. And, if I force a numeric value, I’ll get as.numeric(1==2)=``0.

You can do this with standard operators:

  • ==, which means “is equal to” and the double-equal-sign is used to distinguish it from assignment 1 to be equal to 2, which you would not want to do;
  • <, for example, try (1<2) in r and you’ll get TRUE. For ≤, you can use <=. Try to enter (1<=2) in r and you’ll get TRUE;
  • >, for example, try (1>2) in r and you’ll get FALSE. For ≥, you can use >=. Try to enter (1>=2) in r and you’ll get FALSE;
  • you can also check for elements in a set. For example, lets’ use the set seq(1,10,1) which is a sequence of 1 to 10 counting by ones:
test_set<-seq(1,10,1)
print(test_set)
 [1]  1  2  3  4  5  6  7  8  9 10

I can then create set of tested items:

print(test_set>=5)
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Or, I could check whether a particular number is in my set:

print(6 %in% test_set)
[1] TRUE

and, just to show you that it works:

print(6.5 %in% test_set)
[1] FALSE

I can also stack conditions:

print( (6.5 %in% test_set)|(6 %in% test_set)) #or
[1] TRUE
print( (6.5 %in% test_set)&(6 %in% test_set)) #and      
[1] FALSE

And, one other function which might be useful is a logical lookup, grepl:

grepl(6, test_set)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
grepl(6.5, test_set)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Using this with data

Let me show you this with one of the other data sets from the CER Energy Futures report, on crude oil production.

#we'll download the data - remember to use the mode="wb" so that you get the windows binary file
download.file("https://www.cer-rec.gc.ca/open/energy/energyfutures2023/crude-oil-production-2023.csv",destfile = "cer_crude.csv",mode="wb")
cer_crude<- read_csv("cer_crude.csv")%>%clean_names()

Now, let’s look at the first 10 observations, just as they were read into memory with not alterations:

#use the first 10 observations from the data 
cer_crude %>% head(10)%>% kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value
1 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2005 522
2 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2006 619
3 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2007 652
4 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2008 620
5 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2009 722
6 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2010 703
7 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2011 810
8 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2012 817
9 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2013 835
10 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2014 842

Now, just to build your initution, let’s do a couple of things with a column we’ll call test. First, let’s use a new column called test to identify all the observations with a value greater than 100:

#use the first 20 observations from the data 
cer_crude %>% head(20)%>% 
  mutate(test=value>100)%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value test
1 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2005 522 TRUE
2 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2006 619 TRUE
3 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2007 652 TRUE
4 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2008 620 TRUE
5 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2009 722 TRUE
6 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2010 703 TRUE
7 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2011 810 TRUE
8 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2012 817 TRUE
9 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2013 835 TRUE
10 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2014 842 TRUE
11 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2015 850 TRUE
12 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2016 931 TRUE
13 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2017 1028 TRUE
14 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2018 1059 TRUE
15 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2019 1114 TRUE
16 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2020 1099 TRUE
17 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2021 1145 TRUE
18 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2022 1186 TRUE
19 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2023 1204 TRUE
20 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2024 1204 TRUE

Now, let’s do the same test but, instead of creating a new column, I’m going to filter on that test and keep only the observations with value>100:

#use the first 20 observations from the data 
cer_crude %>% head(20)%>% 
  filter(value>100)%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value
1 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2005 522
2 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2006 619
3 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2007 652
4 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2008 620
5 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2009 722
6 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2010 703
7 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2011 810
8 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2012 817
9 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2013 835
10 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2014 842
11 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2015 850
12 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2016 931
13 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2017 1028
14 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2018 1059
15 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2019 1114
16 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2020 1099
17 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2021 1145
18 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2022 1186
19 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2023 1204
20 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2024 1204

Now, let’s do the same thing with a string, to generate a column for a true/false test on whether the variable is "Conventional Light":

#use the first 20 observations from the data 
cer_crude %>% head(20)%>% 
  mutate(test=variable=="Conventional Light")%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value test
1 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2005 522 FALSE
2 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2006 619 FALSE
3 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2007 652 FALSE
4 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2008 620 FALSE
5 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2009 722 FALSE
6 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2010 703 FALSE
7 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2011 810 FALSE
8 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2012 817 FALSE
9 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2013 835 FALSE
10 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2014 842 FALSE
11 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2015 850 FALSE
12 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2016 931 FALSE
13 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2017 1028 FALSE
14 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2018 1059 FALSE
15 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2019 1114 FALSE
16 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2020 1099 FALSE
17 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2021 1145 FALSE
18 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2022 1186 FALSE
19 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2023 1204 FALSE
20 Canada Net-zero Thousand Barrels per day Alberta (Upgraded Bitumen) 2024 1204 FALSE

Now, let’s do the same test but, instead of creating a new column, I’m going to filter on that test and keep only the observations for which variable is "Conventional Light":

#use the first 20 observations from the data 
cer_crude %>% head(20)%>% 
  filter(variable=="Conventional Light")%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value

Groups, Step by Step

Here are a couple of hints for filtering by groups of characteristics. For example, you might wish to look at the CER data for a set of oil production types.

Let’s see what we’re dealing with. You can use the command unique with a single column to see all the different values that appear in that column. For example:

unique(cer_crude%>%select(variable))
# A tibble: 8 × 1
  variable          
  <chr>             
1 (Upgraded Bitumen)
2 C5+               
3 Total             
4 Conventional Heavy
5 Conventional Light
6 Field Condensate  
7 In Situ Bitumen   
8 Mined Bitumen     

The nice thing about printing it out like that is you can cut and paste from the output, so you won’t miss any weird spacing that was in the data.

Let’s assume I want to make a graph or a table using only bitumen production data. I could approach this a couple of different ways:

  • I could use a filter with two conditions (variable=="Mined Bitumen") and (variable=="In Situ Bitumen") separated by an | (or):
cer_crude %>% 
  filter((variable=="Mined Bitumen")|(variable=="In Situ Bitumen"))%>%
#use the first 20 observations from the data 
  head(20)%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value
6385 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2005 438
6386 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2006 494
6387 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2007 536
6388 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2008 583
6389 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2009 664
6390 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2010 752
6391 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2011 847
6392 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2012 990
6393 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2013 1106
6394 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2014 1263
6395 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2015 1377
6396 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2016 1392
6397 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2017 1544
6398 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2018 1570
6399 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2019 1546
6400 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2020 1494
6401 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2021 1656
6402 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2022 1671
6403 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2023 1762
6404 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2024 1797
  • I could create a set bitumen_set=c("Mined Bitumen","In Situ Bitumen") and filter based on membership in the set:
bitumen_set<-c("Mined Bitumen","In Situ Bitumen")
cer_crude %>% 
  filter(variable %in% bitumen_set)%>%
  head(20)%>% #use the first 20 observations from the filtered data 
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>%
  I() 
x1 scenario unit region variable year value
6385 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2005 438
6386 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2006 494
6387 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2007 536
6388 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2008 583
6389 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2009 664
6390 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2010 752
6391 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2011 847
6392 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2012 990
6393 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2013 1106
6394 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2014 1263
6395 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2015 1377
6396 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2016 1392
6397 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2017 1544
6398 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2018 1570
6399 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2019 1546
6400 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2020 1494
6401 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2021 1656
6402 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2022 1671
6403 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2023 1762
6404 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2024 1797

Layered filters

You can layer filters.

For example, if I wanted just Alberta bitumen production, I could do this (note, I’m using all the data at the start, then trimming after the filters):

#use all the data 
bitumen_set<-c("Mined Bitumen","In Situ Bitumen")
cer_crude %>% 
  filter(variable %in% bitumen_set)%>%
  filter(region == "Alberta")%>%
  head(10)%>%
  kbl(escape = FALSE,table.attr = "style='width:80%;'",digits=2,align=rep('c', 2)) %>%
  kable_styling(fixed_thead = T,bootstrap_options = c("hover", "condensed","responsive"),full_width = T)%>% 
  I() 
x1 scenario unit region variable year value
6385 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2005 438
6386 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2006 494
6387 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2007 536
6388 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2008 583
6389 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2009 664
6390 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2010 752
6391 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2011 847
6392 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2012 990
6393 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2013 1106
6394 Canada Net-zero Thousand Barrels per day Alberta In Situ Bitumen 2014 1263

Bonus

And, since you’ve made it this far, here’s a bonus that you might find useful:

library(viridis)
bitumen_set<-c("In Situ Bitumen","Mined Bitumen")
cer_crude %>%
  mutate(scenario=as_factor(scenario),
         scenario=fct_relevel(scenario,"Current Measures")
         )%>%
  filter(variable %in% bitumen_set)%>%
  filter(unit=="Thousand Barrels per day")%>%
  filter(region %in% c("Alberta"))%>%
  ggplot()+ #make a graph
  geom_line(aes(year,value,group=interaction(variable,scenario),color=variable,linetype=scenario),linewidth=.65)+
  scale_x_continuous(breaks=pretty_breaks(), expand=c(0,0))+
  scale_y_continuous(breaks=pretty_breaks(), expand=c(0,0))+
  theme_minimal()+
  expand_limits(y=c(0,3000))+
  scale_linetype_manual("",values=c("solid","21","dotted"))+
  scale_color_viridis("",option = "A",discrete=T,begin = 0,end = .8, direction=-1)+
  theme(legend.position = "bottom", legend.box = "vertical",
        legend.margin=margin(t = 0,b=0, unit='cm'))+
  #t, r, b, l (To remember order, think trouble)
  theme(plot.margin = unit(c(1,1,0.2,1), "cm"))+
  guides(color = guide_legend(keywidth = unit(1,"cm"),keyheight = unit(1,"cm"),nrow = 1),
    linetype = guide_legend(keywidth = unit(1.6,"cm"),nrow = 1))+
  labs(y="Production (Thousands bbl/d)",x="",
         title="Alberta Bitumen Production",
         subtitle="CER Current Measures, Canada Net-Zero and Global Net-Zero Scenarios",
         caption = "Data via Canadian Energy Regulator, Energy Futures (2023). Graph by Andrew Leach"
         )+
  NULL

I hope some of this is helpful for you.

Content 2025 by Andrew Leach
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
 
Made with and Quarto
View the source at GitHub