library(ggplot2)
library(dplyr)
library(lubridate)
library(tidyr)
library(readr)
library(lubridate)
Interactive Learning with WebR
Lab Demonstration
I built an interactive lab using webR to run R code directly in the browser, streamlining the setup process and providing students with immediate access to real-time data wrangling and visualization tools. I demonstrated key techniques—filtering, selecting, mutating, summarizing, reshaping data, and handling missing values—integrated with dynamic visualizations using ggplot2 to make complex concepts tangible and accessible. This lab showcases how webR transforms the learning experience by offering instant feedback and hands-on exploration, enabling students to master data manipulation skills efficiently.
The Case for webR
Integrating webR
into your educational tools offers a robust way to elevate interactive learning experiences. This demonstration focuses on how webR enables real-time data wrangling and visualization within a web-based environment, allowing educators to create engaging, hands-on labs for students.
Why use webR in Education?
webR
seamlessly brings the power of R to the browser, removing the need for complex local installations. This accessibility is crucial for modern classrooms, where students might be working on different devices and operating systems. By embedding R code directly into educational materials, webR facilitates instant feedback, interactive exercises, and dynamic visualizations, making data science education more intuitive and approachable.
Key Features of webR
Cross-Platform Compatibility: Students can run R code on any device with a web browser, ensuring a consistent learning experience.
Immediate Feedback: webR allows for instant execution of code, enabling students to see the results of their actions right away.
Interactive Visualizations: By integrating with packages like
ggplot2
, educators can create interactive plots that students can manipulate directly in their browser.
Implementation in Data Wrangling Labs
For example, when teaching data wrangling concepts such as filtering, selecting, mutating, and summarizing, webR allows students to experiment with real datasets interactively. They can instantly see the impact of different data manipulation techniques, reinforcing their understanding through practice.
Benefits for Educators
webR
enables educators to:
Simplify Setup: No more lengthy instructions on installing R and its dependencies.
Enhance Engagement: Interactive content keeps students involved, making complex topics more digestible.
Facilitate Learning: By allowing students to explore data manipulation and visualization in real time, webR helps solidify their understanding of essential concepts.
Summary
webR
transforms how data science is taught, making R’s powerful features accessible and interactive in the classroom. Whether you’re teaching undergraduates new to R or advanced students refining their skills, webR
offers a dynamic platform that enhances learning and encourages exploration.
Lab Demonstration
What is Data Wrangling?
Data wrangling is the process of converting messy, untidy data into a tidy format, making it suitable for data visualization and analysis.
Data is often messy: Real-world data is rarely provided in a tidy format.
Industry challenges: Many industries have poorly designed data structures, requiring data preparation before visualization.
Rarely tidy datasets: It is uncommon to receive a dataset that is already tidy.
What is Tidy Data?
Tidy data is a structured format that aligns the organization of a dataset with its underlying meaning. In tidy data:
Each variable has its own column: Every column in the dataset corresponds to a specific variable or attribute.
Each observation has its own row: Every row captures a single observation or data entry.
Each cell contains a single value: Each cell holds one distinct piece of information for a particular variable and observation.
In most cases, data is often imported using SQL to create narrower datasets. While we won’t cover SQL in this course, it’s a valuable skill to learn in the future. For now, we’ll focus on using R to manipulate and create subset datasets from larger datasets for focused analysis.
What Causes Untidy Data?
Incorrect/Inconsistent Dates: Dates can be tricky because they might be formatted differently across datasets or have errors like typos or missing parts. For example, some data might use “MM/DD/YYYY” while others use “YYYY-MM-DD,” leading to confusion and potential errors when analyzing time-based data.
Wide Format Times: Time data is sometimes presented in a wide format, where each column represents a different time period. This structure can make it difficult to perform certain types of analysis, as many statistical and visualization tools prefer data in a long format, where each row represents a single observation at a specific time.
Void or Misspelled Descriptions: Descriptions and labels are often incomplete, missing, or contain typos. These errors can make it challenging to interpret the data correctly, especially when variables are not clearly defined or are inconsistent across different parts of the dataset.
Missing Values: Missing data is common and can occur in any part of a dataset, leading to gaps that can skew analysis or result in errors. Handling these missing values is crucial for ensuring that any conclusions drawn from the data are accurate.
Condensed or Incorrect Headers: Column names might be too short, unclear, or incorrectly labeled, leading to confusion about what the data actually represents. For example, a column labeled “Pop” might be ambiguous—does it refer to population, popularity, or something else?
Row Content Split: Sometimes, a single column contains data that should be divided into multiple columns, such as when a “Location” column includes both city and state. This issue can make it difficult to analyze the data separately or perform operations that rely on more granular details. These common issues contribute to untidy data, which can complicate analysis and lead to inaccurate results.
Mastering data wrangling is crucial because you might have to handle datasets with millions of rows and hundreds of columns.
Getting Started
First, ensure you have the necessary packages installed and loaded. We will use the dplyr
, lubridate
, readr
, ggplot2
, and tidyr
packages for our examples.
Use install.packages('Name of Package')
to install an R package. Careful! Package names are case sensitive, so install.packages(‘GGplot2’) will not work, but install.packages('ggplot2')
will.
Download the Data
Mac users
: Use ⌘ + return to run single or highlighted line(s). Use shift + return to run entire code block
Windows users
: Use ctrl + enter to run single or highlighted line(s). Use shift + enter to run entire code block
Verify Datasets with head()
Description of the Dataset
country
: Country name from a predefined list of 10 countries.year
: Years between 2010 and 2023.population
: Real-world population size for each country and year.gdp
: Gross Domestic Product (GDP) in USD millions for each country and year.gdp_per_capita
: GDP per capita, calculated as GDP divided by population.life_expectancy
: Life expectancy of citzensbirth_rate
: Birth rate for each country and yeartemperature
: Average temperature in celsius per countryregion
: Geographical region corresponding to each country.category
: Classification of the country as “First World,” “Second World,” or “Third World.”
Tidy Data Wrangling
Important Concepts
Filtering
: Filter data based on conditions such as year, country, or region.Selecting
: Select specific columns for focused analysis.Mutating
: Create new columns, such as cases per 100,000 population.Summarizing
: Aggregate data by country, year, or region to find totals and averages.
Filtering
Filtering is essential for narrowing down datasets to the most relevant information, making patterns easier to identify.
Example 1
You are tasked with visualizing trends in life expectancy in Asian countries between 2010 and 2020.
Filtering Process
Visualization
What did we do here?
By focusing on specific countries and years, filtering allows for more targeted and relevant visualizations, making it easier to analyze trends and patterns specific to the context.
Example 2
Analyze the relationship between GDP per capita and life expectancy in European countries with a GDP per capita above $30,000 for the years 2015-2020.
Filtering Process
Visualization
What did we do here?
Filtering based on economic indicators allows for focused analysis on the relationship between wealth and life expectancy, removing noise from countries with different economic conditions.
Filtering Challenges
- Population Size in Africa:
- Filter: Only countries from Africa where the population exceeds 50 million.
- Purpose: Isolate data for large African nations to analyze trends specific to highly populated areas.
- Economic Data in Europe:
- Filter: Show data only for the years 2015-2020 for European countries with GDP per capita above $30,000.
- Purpose: Focus on wealthy European countries during a specific period to study economic outcomes.
- High Birth Rates in Asia:
- Filter: Data for Asian countries where the birth rate is above 2.5.
- Purpose: Analyze regions with high birth rates, possibly indicating population growth trends.
- Cold Regions in Asia:
- Filter: Asian countries where the average temperature is below 10°C between 2010 and 2020.
- Purpose: Focus on colder regions in Asia to study how temperature may correlate with other demographic factors.
- GDP Data with Missing Values:
- Filter: Remove any entries with missing
gdp_per_capita
values for the years 2010-2020. - Purpose: Ensure clean data for economic analysis, removing incomplete records that could skew results.
- Filter: Remove any entries with missing
Selecting
Selecting allows you to focus on specific columns relevant to your analysis.
Example 1
Visualization
Selecting Challenges
- Select columns related to economic indicators (e.g.,
country
,gdp_per_capita
,population
) for further analysis. - Create a dataset with only the
year
,life_expectancy
, andtemperature
columns for all countries and show the first 5 rows. - Choose columns that exclude any geographical information and check the first 10 rows.
- Select and rename the
country
andpopulation
columns tonation
andpop_size
, respectively. - Create a new dataset with only the
year
,population
, and a newly created column,population_in_millions
(which should be calculated aspopulation / 1e6
).
Mutating
Mutating helps create new columns based on existing data.
Example 1
Visualization
Mutating Challenges
- Create a new column called
gdp_total
that multipliesgdp_per_capita
bypopulation
. - Add a new column that indicates whether a country’s GDP per capita is above or below a certain threshold (e.g., $20,000).
- Mutate the
temperature
column to create a new column,temperature_f
, that converts Celsius to Fahrenheit. - Create a
population_density
column by dividingpopulation
by a given area (assuming you have area data). - Generate a column that calculates the ratio of birth rate to life expectancy for each country.
Summarizing
Summarizing aggregates data by country, year, or region to find totals and averages.
Example 1
Visualization
Summarizing Challenges
- Summarize the dataset by finding the average temperature for each region.
- Aggregate the data to find the total population and average life expectancy for each continent.
- Group the data by country and summarize to find the maximum and minimum GDP per capita for each country.
- Summarize by year to find the total population and average birth rate each year.
- Create a summary that calculates the total population and average GDP per capita for countries classified as “First World.”
Untidy Data Wrangling
Handling Wide vs. Long Formats
Untidy data often comes in a “wide” format, where multiple variables are stored across columns rather than in a long format where each observation is a row.
Creating an Untidy Version
To demonstrate this, let’s take our dataset and convert it into a wide format, then back to a long format.
Now, let’s convert this wide data back to a long format.
This process shows how data can be reshaped for different analytical needs. Wide format is useful for certain analyses but often needs to be converted to long format for modeling and visualization.
Handling Misspelled Header Column Names
Sometimes datasets come with misspelled or inconsistent column names, which can lead to errors in data manipulation.
Creating Misspelled Header Names
Let’s create a dataset with intentionally misspelled column names and then fix them.
Fixing the Misspelled Column Names
This illustrates how to identify and correct misspelled column names, which is a crucial step in data cleaning.
Handling Row Content Split and Reverse
Sometimes, data stored in a single column needs to be split into multiple columns or vice versa.
Merging Two Columns into One
Let’s take the country
and year
columns and merge them into a single column.
Splitting the Merged Column Back into Two
This demonstrates how to handle situations where data needs to be recombined or separated for different purposes.
Handling Dates
In some cases, it might be necessary to convert a year
column from a numeric format (double) into a proper date format for time series analysis or plotting purposes. Here’s how you can do that in R using the lubridate
package.
Example 1
Converting year
to a Date
Let’s convert the year
column into a date format, setting it as January 1st of that year.
What did we do here?
paste0(year, "-01-01")
: Combines the year with the string “-01-01” to create a date string like “2010-01-01”.ymd()
: Converts the resulting string into a date object in the “Year-Month-Day” format.
This creates a new column, year_date
, which is now in the proper date format.
Challenges [Solutions]
Convert year to end of year date:
- Task: Convert the
year
column to a date format, but set it as December 31st of that year. - Purpose: Useful for representing data that summarizes annual results.
- Task: Convert the
Create a quarterly date:
- Task: Convert the
year
column into a date representing the first quarter (e.g., “2010-03-31”). - Purpose: Useful for quarterly analysis.
- Task: Convert the
Mid-year date conversion:
- Task: Convert the
year
column to a date format, setting it as June 30th of each year. - Purpose: Represents mid-year data points.
- Task: Convert the
Use year as a dynamic time period:
- Task: Convert the
year
column to represent the last day of a chosen month (e.g., November). - Purpose: Allows for flexibility depending on the analysis context.
- Task: Convert the
Convert year to fiscal year start date:
- Task: Convert the
year
column into a date representing the start of the fiscal year (e.g., April 1st). - Purpose: Useful for financial and budgetary analyses.
- Task: Convert the
Handling Missing Data
Dealing with missing data is a crucial aspect of data wrangling. Missing data can occur for various reasons, and understanding the nature of these missing values is essential for appropriate handling.
Types of Missing Data
MCAR (Missing Completely at Random): Data is missing entirely at random, with no relationship between the missing data and any other observed or unobserved data. The analysis remains unbiased if this missing data is ignored.
MAR (Missing at Random): The likelihood of missing data on a variable is related to other observed variables but not to the value of the variable itself .
MNAR (Missing Not at Random): The missingness is related to the unobserved data itself, meaning the missing values are related to the actual value that is missing.
Simulating MAR Data
We’ll create a scenario where gdp_per_capita
is more likely to be missing if the population is below a certain threshold, making it “missing at random” based on population size.
Simulating MNAR Data
We’ll simulate a scenario where the likelihood of life_expectancy
being missing is higher if life expectancy is lower than 60 years, making it “missing not at random.”
Missing Data | Before Wrangling
Before Handling Missing Data
Let’s visualize the data before handling missing values, focusing on the relationship between GDP per capita and life expectancy.
Visualization with MAR Data
Visualization with MNAR Data
Missing Data | How to Wrangle
To handle the missing data, we’ll apply a simple imputation strategy, filling in missing values with the median of the respective variable.
Imputing Missing Values for MAR Data
Missing Data | After Wrangling
Let’s visualize the data again after imputing the missing values.
Visualization with MAR Data (Imputed)
Imputing Missing Values for MNAR Data
Visualization with MNAR Data (Imputed)
Summary
Understanding the different types of missing data—MCAR, MAR, and MNAR—is crucial for choosing the right approach to handle them. We explored how to simulate and visualize MAR and MNAR scenarios in our dataset, highlighting the importance of addressing missing data for accurate analysis. By comparing visualizations before and after imputation, students can grasp the significant impact missing data can have on their results and learn effective strategies to mitigate these issues.
Creating a Pseudo-Publication Ready Visualization
We’ll combine all the data wrangling techniques you’ve learned—filtering
, selecting
, mutating
, summarizing
—to perform a detailed analysis and produce a polished, publication-ready visualization.
Creating a Professional-Quality Visualization
Here’s a step-by-step guide to transform and visualize data from 10 countries in the dataset:
Review & Scrutinize
What is this visualization?: Write a brief paragraph describing the design, purpose, and key message of the plot. Explain what the visualization is intended to show and how it effectively communicates the data.
Why is this visualization nearly publication-ready?: In 2-3 sentences, discuss what makes the plot polished and professional, highlighting any elements that could make it suitable for publication.
WebR in Pure Browser Form
As the package gets support from the community, students could run their entire r scripts on the browser. Of course, learning how to use the IDE is still important.