Pokédex Database

R
PostgreSQL
Data Visualization
Data Engineering
Unveil the power of PostgreSQL! Explore the intricate ETL process, advanced tools, and meticulous schema design that created a functional database. Join the journey of unraveling Pokémon mysteries through data engineering.
Author

Brian Cervantes Alvarez

Published

December 5, 2022

Modified

April 8, 2025

Yapper Labs | AI Summary Model: ChatGPT o3-mini-high

I built a fully operational PostgreSQL database for a comprehensive Pokémon dataset in under two weeks using the ETL methodology, which involved extracting data from multiple sources, cleansing it to remove inconsistencies, and standardizing it before loading. I designed and implemented a robust schema with primary and foreign key constraints for tables such as Pokémon, types, abilities, generations, and moves, then transformed and exported the refined data into a CSV file for further analysis. This project demonstrates my proficiency in data engineering, SQL, and complex data integration, enabling efficient and accurate analysis of detailed Pokémon insights.

Purpose

The main objective of this project was to construct a fully operational Postgresql database in a time frame of fewer than two weeks by employing the Extract, Transform, Load (ETL) methodology. The purpose of this approach was to extract data from various sources, transform it into a format that could be easily integrated into the database, and finally load the transformed data into the database.

The process involved several intricate steps, including identifying the relevant data sources, cleansing the extracted data to remove inconsistencies, standardizing the data to a uniform format, and applying data validation and verification techniques to ensure accuracy and completeness. Furthermore, it required careful consideration of the database schema, including the design of tables, relationships between tables, and the use of appropriate data types.

The successful implementation of this project was dependent on the utilization of cutting-edge technologies and tools, such as data integration software, data profiling tools, and scripting languages. The result was a functional database that can efficiently store and manage data, making it readily available for analysis, decision-making, and reporting purposes.

Summary

The inquiry of identifying the optimal base stat Pokemon type sparked my interest, prompting me to delve into the realm of data engineering. In order to craft a well-informed response to this question, I began by utilizing the expansive and multifaceted “Pokémon of Kanto, Johto, and Hoenn Region” dataset to establish a structured and organized database.

Unleashing the Power of PostgreSQL: Building a Database

This SQL code creates several tables for storing Pokémon data. The tables include information about Pokémon, their types, abilities, generations, and moves. The code establishes primary keys, foreign key constraints, and defines the data types for each column. These tables form the foundation for a comprehensive Pokémon database, enabling efficient storage and retrieval of Pokémon-related information.

Data Transformation and CSV Preparation in SQL: A Step-by-Step Guide

This section of the SQL file focuses on transforming and preparing a CSV file for analysis. It involves multiple SELECT statements that extract relevant data from different tables and join them together. The extracted data is then inserted into temporary tables, including ‘temp1’, ‘temp2’, and ‘temp3’, with each step refining the data further. Finally, the transformed data in ‘temp3’ is selected and filtered based on specific conditions, ordered, and then exported to a CSV file named ‘scuffed_pokedex.csv’. This process prepares the data for further analysis and exploration in external tools or applications.


SELECT 
    identifier AS pokemon_name, 
    pokemon_types.type_id,
    pokemon_abilities.ability_id
INTO temp1
FROM pokemon
LEFT JOIN pokemon_types
ON pokemon.id = pokemon_types.pokemon_id
LEFT JOIN pokemon_abilities
ON pokemon.id  = pokemon_abilities.pokemon_id;


SELECT 
    pokemon_name,
    types.identifier AS pokemon_type,
    abilities.identifier AS pokemon_ability,
    types.generation_id AS gen_id,
    types.id AS type_id
INTO temp2
FROM temp1
LEFT JOIN types
ON temp1.type_id = types.id
LEFT JOIN abilities
ON temp1.ability_id = abilities.id;


DROP TABLE temp3;
SELECT 
    pokemon_name,
    pokemon_type,
    pokemon_ability,
    generations.identifier AS pokemon_generation,
    moves.identifier AS pokemon_move,
    moves.power AS pokemon_power,
    moves.accuracy AS pokemon_accuracy,
    moves.pp AS pokemon_pp
INTO temp3
FROM temp2
LEFT JOIN generations
ON temp2.gen_id = generations.main_region_id
LEFT JOIN moves
ON temp2.type_id = moves.type_id;

SELECT *
FROM temp3
WHERE pokemon_power IS NOT NULL 
    AND pokemon_accuracy IS NOT NULL
ORDER BY pokemon_accuracy, pokemon_power;


COPY temp3
TO '/Users/Shared/Data_503/Datasets/scuffed_pokedex.csv'
WITH (FORMAT CSV, HEADER);

Unveiling Pokémon Insights: Analyzing Damage Output and Accuracy

Through the power of data analysis and visualization, we have delved into the world of Pokémon to uncover insights about types, damage output, accuracy, and their relationships. Our exploration has shed light on the best Pokémon type for damage output, the most accurate contenders, and the interplay between power and accuracy. By combining the captivating nature of Pokémon with the analytical capabilities of R, we have gained valuable knowledge and set the stage for further investigations in the vast Pokémon universe.

library(tidyverse)
library(RColorBrewer)
pokemon <- read_csv("../../../Assets/Datasets/scuffed_pokedex.csv")

names(pokemon)

nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "YlOrRd"))(nb.cols)

pokemon %>% 
  mutate(pokemon_type = str_to_title(pokemon_type)) %>%
  group_by(pokemon_type) %>%
  summarise(avg_power = mean(pokemon_power, na.rm = TRUE),
            avg_accuracy = mean(pokemon_accuracy, na.rm = TRUE)) %>%
  ggplot(aes(x = avg_power, y = reorder(pokemon_type, avg_power), fill = reorder(pokemon_type, avg_power)))+
  geom_col(show.legend = FALSE, color = "black") +
  labs(x = "Average power",
       y = "Pokemon type",
       title = "FIRE! The Best Pokemon Type For Damage Output Is...?",
       subtitle = "Based on an Average of All Moves Per Pokemon Type",
       caption = "Source: Pokédex of Kanto, Johto, and Hoenn Regions @ Kaggle.com") +
  scale_fill_manual(values = mycolors) +
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.x = element_line(color = "grey"))


nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "Blues"))(nb.cols)

pokemon %>% 
  mutate(pokemon_type = str_to_title(pokemon_type)) %>%
  group_by(pokemon_type) %>%
  summarise(avg_power = mean(pokemon_power, na.rm = TRUE),
            avg_accuracy = mean(pokemon_accuracy, na.rm = TRUE)) %>%
  ggplot(aes(x = avg_accuracy, y = reorder(pokemon_type, avg_accuracy), fill = reorder(pokemon_type, avg_accuracy)))+
  geom_col(show.legend = FALSE, color = "black") +
  labs(x = "Average accuracy",
       y = "Pokemon type",
       title = "Ouch! Who wins the bullseye competition?",
       subtitle = "Based on an Average of All Moves Per Pokemon Type",
       caption = "Source: Pokédex of Kanto, Johto, and Hoenn Regions @ Kaggle.com") +
  scale_fill_manual(values = mycolors) +
  theme(plot.background = element_blank(),
        panel.background = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.x = element_line(color = "grey"))


pokemon %>% 
  group_by(pokemon_type) %>%
  summarise(avg_power = mean(pokemon_power, na.rm = TRUE),
            avg_accuarcy = mean(pokemon_accuracy, na.rm = TRUE)) %>%
  ggplot(aes(x = avg_power, y = avg_accuarcy, color = pokemon_type)) +
  geom_point()


stats <- pokemon %>% 
  group_by(pokemon_type) %>%
  summarise(avg_power = mean(pokemon_power, na.rm = TRUE),
            avg_accuarcy = mean(pokemon_accuracy, na.rm = TRUE))
            
model <- lm(data = stats, avg_accuracy ~ avg_power)
plot(model)

Presentation

Data References

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Müller, Kirill, and Hadley Wickham. 2023. Tibble: Simple Data Frames. https://tibble.tidyverse.org/.
Neuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2024. Lubridate: Make Dealing with Dates a Little Easier. https://lubridate.tidyverse.org.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2023a. Forcats: Tools for Working with Categorical Variables (Factors). https://forcats.tidyverse.org/.
———. 2023b. Stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org.
———. 2023c. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Wickham, Hadley, and Lionel Henry. 2025. Purrr: Functional Programming Tools. https://purrr.tidyverse.org/.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://readr.tidyverse.org.
Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2024. Tidyr: Tidy Messy Data. https://tidyr.tidyverse.org.