Pitch, Hit, Profit: A Data-Backed Approach to Gaining an Edge in MLB Betting

Author

Andrew Cerqui, Jace Higa

Published

June 24, 2025

1 Data Summary

1.1 Data Ingestion

Our main data source was the StatsAPI, which provided MLB box score data in JSONL format. After uploading the raw files to Railway and our Postgres database, we had to figure out how to parse the nested JSON using SQL. This process involved trial and error until we found queries that matched our table structure. Scraping setup took time as we explored different sources before settling on an API that supported our needs. Once we finalized our approach, the ingestion process was fairly smooth, mainly focused on correctly inserting parsed JSONL into our schema.

The second data source was Statcast data from Baseball Savant, provided as CSV files. We cleaned these by deleting irrelevant columns, renaming headers, and standardizing the format to match our database requirements. Uploading from the terminal into our PostgreSQL database went smoothly, resulting in well-structured and usable tables.

Most of the challenges we encountered involved experimentation rather than critical issues.

1.2 Data Organization

Entity Relationship Diagram of MLB Data Schema

This is what our Entity Relationship Diagram looks like. The main table stems from games, which connects to many other tables. The teams table adds context by joining via team_id, allowing us to analyze team-level data.

From batter_stats and pitcher_stats, we can access game context using game_id, and then retrieve player-specific details through the batters and pitchers tables. These two tables are keyed on batter_id, pitcher_id, team_id, and position or type, which lets us track players who change positions (or teams) throughout the season.

For example, Cal Raleigh is primarily a catcher. Over 227 at-bats in that role, he has a .295 average and 1.136 OPS—MVP-caliber numbers. As a DH over 67 at-bats, however, he has just a .194 average and .659 OPS. That kind of role-based discrepancy highlights why tracking positional splits is important.

Finally, we have Statcast data: batter_statcast and pitcher_statcast, along with batter_pitches and pitcher_pitches. These contain advanced metrics, including expected stats and run values per pitch, and are joined via player IDs.

1.3 Data Compilation

1.3.1 Figure 1

This is an example of a table that we created to take a look at all players from the different teams that have played shortstop (SS). We used the teams table to extract the team_name, the batters table to find players who have logged game time at shortstop, and the batter_pitches table to get the player_name of the batters. We chose that table because the name of the players was in a nicer format on Statcast than they were in the box scores data we pulled.

-- Players Who Have Played SS Across Each Team --
SELECT DISTINCT
  t3.player_name, 
  t1.team_name, 
  t2.position
FROM teams AS t1
JOIN batters AS t2 ON t1.team_id = t2.team_id
JOIN batter_pitches AS t3 ON t2.batter_id = t3.batter_id
WHERE t2.position = 'SS'
ORDER BY t1.team_name, t3.player_name;

player_name	team_name	position
Perdomo, Geraldo	ARI	SS
Allen, Nick	ATL	SS
Henderson, Gunnar	BAL	SS
Story, Trevor	BOS	SS
Hoerner, Nico	CHC	SS
Swanson, Dansby	CHC	SS
De La Cruz, Elly	CIN	SS
Arias, Gabriel	CLE	SS
Schneemann, Daniel	CLE	SS
Meidroth, Chase	CWS	SS
Báez, Javier	DET	SS
Sweeney, Trey	DET	SS
Dubón, Mauricio	HOU	SS
Peña, Jeremy	HOU	SS
Witt Jr., Bobby	KC	SS
Neto, Zach	LAA	SS
Betts, Mookie	LAD	SS
Rojas, Miguel	LAD	SS
Lopez, Otto	MIA	SS
Ortiz, Joey	MIL	SS
Correa, Carlos	MIN	SS
Lee, Brooks	MIN	SS
Lindor, Francisco	NYM	SS
Peraza, Oswald	NYY	SS
Volpe, Anthony	NYY	SS
Wilson, Jacob	OAK	SS
Turner, Trea	PHI	SS
Kiner-Falefa, Isiah	PIT	SS
Bogaerts, Xander	SD	SS
Iglesias, Jose	SD	SS
Crawford, J.P.	SEA	SS
Adames, Willy	SF	SS
Winn, Masyn	STL	SS
Caballero, José	TB	SS
Walls, Taylor	TB	SS
Seager, Corey	TEX	SS
Bichette, Bo	TOR	SS
Giménez, Andrés	TOR	SS
Abrams, CJ	WSH	SS

1.3.2 Figure 2

With this table we found the highest rv or run values for individual pitch types. It just basically means which batters hit certain pitches the best, so Aaron Judge has the highest run value overall against a single pitch—and that happens to be the 4-Seam Fastball.

We used the batter_pitches table to get the player_name again, the teams table to get the team_name they are from, and had to use the batters table to get the position through a common table expression (CTE). We used the CTE to determine the player’s most frequently played position, since it makes the most sense to list the most common position they play. Then we used the batter_pitches table again to get the rv values for the various pitch_name types. Together it gave us the table pictured to the left.

-- Top 10 Highest Run Values Against Different Pitches --
WITH most_common_positions AS (
  SELECT 
    batter_id,
    position,
    COUNT(*) AS position_count,
    ROW_NUMBER() OVER (PARTITION BY batter_id ORDER BY COUNT(*) DESC) AS row_num
  FROM batter_stats
  GROUP BY batter_id, position
),
ranked_pitches AS (
  SELECT
    t3.player_name,
    t1.team_name,
    t4.position,  
    t3.pitch_name,
    t3.rv,
    ROW_NUMBER() OVER (PARTITION BY t3.batter_id ORDER BY t3.rv DESC) AS row
  FROM batter_pitches AS t3
  JOIN batters AS t2 ON t3.batter_id = t2.batter_id
  JOIN teams AS t1 ON t2.team_id = t1.team_id
  JOIN most_common_positions AS t4 
    ON t3.batter_id = t4.batter_id
  WHERE t4.row_num = 1
)
SELECT 
  player_name,
  team_name,
  position,
  pitch_name,
  rv
FROM ranked_pitches
WHERE row = 1
ORDER BY rv DESC
LIMIT 10;

player_name	team_name	position	pitch_name	rv
Judge, Aaron	NYY	RF	4-Seam Fastball	17
Schwarber, Kyle	PHI	DH	4-Seam Fastball	13
Tucker, Kyle	CHC	RF	4-Seam Fastball	12
Guerrero Jr., Vladimir	TOR	1B	4-Seam Fastball	12
Wilson, Jacob	OAK	SS	Slider	11
Ramírez, José	CLE	3B	Sinker	11
Díaz, Yandy	TB	DH	4-Seam Fastball	11
Castro, Willi	MIN	2B	4-Seam Fastball	10
Alonso, Pete	NYM	1B	Sinker	10
Marte, Ketel	ARI	2B	4-Seam Fastball	10

1.3.3 Figure 3

This table displays the pitchers with the highest diff_era. It basically shows which pitcher has had the worst luck, since diff_era is calculated by ERA - xERA. So, it’s a measure of how high their actual earned run average is in comparison to what it is expected to be.

To create this table, we used the pitchers table to get the position, the teams table to get team_name, the pitcher_pitches table to get the player_name in a nice format, and the pitcher_statcast table to get the diff_era stat that this analysis is based on.

-- Highest diff_era, Pitchers Who Have Been Unlucky --
WITH ranked_pitchers AS (
  SELECT 
    t1.pitcher_id,
    t3.player_name,
    t1.position,
    t2.team_name,
    t4.diff_era,
    ROW_NUMBER() OVER (PARTITION BY t1.pitcher_id ORDER BY t4.diff_era DESC) AS row
  FROM pitchers AS t1
  JOIN teams AS t2 ON t1.team_id = t2.team_id
  JOIN pitcher_pitches AS t3 ON t1.pitcher_id = t3.pitcher_id
  JOIN pitcher_statcast AS t4 ON t1.pitcher_id = t4.pitcher_id
  WHERE t3.pitches > 150
)
SELECT 
  player_name,
  position,
  team_name,
  diff_era
FROM ranked_pitchers
WHERE row = 1
ORDER BY diff_era DESC
LIMIT 10;

player_name	position	team_name	diff_era
Alcala, Jorge	RP	BOS	4.007
Kinley, Tyler	RP	COL	3.187
Romano, Jordan	RP	PHI	2.689
Ferguson, Tyler	RP	OAK	2.506
Fluharty, Mason	RP	TOR	2.043
Alcantara, Sandy	SP	MIA	1.999
Ferrer, Jose A.	RP	WSH	1.898
Soroka, Michael	SP	WSH	1.883
Miller, Mason	RP	OAK	1.834
Rodriguez, Eduardo	SP	ARI	1.700

--- title: "Pitch, Hit, Profit: A Data-Backed Approach to Gaining an Edge in MLB Betting" author: "Andrew Cerqui, Jace Higa" date: "June 24, 2025" format: html: theme: lux toc: true toc-depth: 2 code-fold: true code-tools: true fig-align: center number-sections: true page-layout: full smooth-scroll: true link-external-newwindow: true code-line-numbers: true embed-resources: true --- ## Data Summary ### Data Ingestion Our main data source was the **StatsAPI**, which provided MLB box score data in JSONL format. After uploading the raw files to Railway and our Postgres database, we had to figure out how to parse the nested JSON using SQL. This process involved trial and error until we found queries that matched our table structure. Scraping setup took time as we explored different sources before settling on an API that supported our needs. Once we finalized our approach, the ingestion process was fairly smooth, mainly focused on correctly inserting parsed JSONL into our schema. The second data source was **Statcast data from Baseball Savant**, provided as CSV files. We cleaned these by deleting irrelevant columns, renaming headers, and standardizing the format to match our database requirements. Uploading from the terminal into our PostgreSQL database went smoothly, resulting in well-structured and usable tables. Most of the challenges we encountered involved experimentation rather than critical issues. ------------------------------------------------------------------------ ### Data Organization  ![Entity Relationship Diagram of MLB Data Schema](images/erd.png){fig-align="center" width="100%"} This is what our Entity Relationship Diagram looks like. The main table stems from `games`, which connects to many other tables. The `teams` table adds context by joining via `team_id`, allowing us to analyze team-level data. From `batter_stats` and `pitcher_stats`, we can access game context using `game_id`, and then retrieve player-specific details through the `batters` and `pitchers` tables. These two tables are keyed on `batter_id`, `pitcher_id`, `team_id`, and `position` or `type`, which lets us track players who change positions (or teams) throughout the season. For example, **Cal Raleigh** is primarily a catcher. Over 227 at-bats in that role, he has a .295 average and 1.136 OPS—MVP-caliber numbers. As a DH over 67 at-bats, however, he has just a .194 average and .659 OPS. That kind of role-based discrepancy highlights why tracking **positional splits** is important. Finally, we have Statcast data: `batter_statcast` and `pitcher_statcast`, along with `batter_pitches` and `pitcher_pitches`. These contain advanced metrics, including expected stats and run values per pitch, and are joined via player IDs. ------------------------------------------------------------------------ ### Data Compilation #### Figure 1 This is an example of a table that we created to take a look at all players from the different teams that have played **shortstop (SS)**. We used the `teams` table to extract the **team_name**, the `batters` table to find players who have logged game time at **shortstop**, and the `batter_pitches` table to get the **player_name** of the batters. We chose that table because the name of the players was in a nicer format on **Statcast** than they were in the **box scores** data we pulled. ``` sql -- Players Who Have Played SS Across Each Team -- SELECT DISTINCT t3.player_name, t1.team_name, t2.position FROM teams AS t1 JOIN batters AS t2 ON t1.team_id = t2.team_id JOIN batter_pitches AS t3 ON t2.batter_id = t3.batter_id WHERE t2.position = 'SS' ORDER BY t1.team_name, t3.player_name; ``` ```{r ss-query, message=FALSE, warning=FALSE, echo=FALSE} library(DBI) library(RPostgres) library(knitr) # Connect to Railway PostgreSQL con <- dbConnect( RPostgres::Postgres(), dbname = "railway", host = "switchyard.proxy.rlwy.net", port = 42251, user = "postgres", password = "onfxNlZFioFScuucmNhZKHhzPggcMfvd" ) # Define and run query ss <- " SELECT DISTINCT t3.player_name, t1.team_name, t2.position FROM teams AS t1 JOIN batters AS t2 ON t1.team_id = t2.team_id JOIN batter_pitches AS t3 ON t2.batter_id = t3.batter_id WHERE t2.position = 'SS' ORDER BY t1.team_name, t3.player_name; " ss_players <- dbGetQuery(con, ss) # Display result kable(ss_players) ``` ------------------------------------------------------------------------ #### Figure 2 With this table we found the highest **rv** or **run values** for individual **pitch types**. It just basically means which batters hit certain pitches the best, so **Aaron Judge** has the highest run value overall against a single pitch—and that happens to be the **4-Seam Fastball**. We used the `batter_pitches` table to get the **player_name** again, the `teams` table to get the **team_name** they are from, and had to use the `batters` table to get the **position** through a **common table expression (CTE)**. We used the CTE to determine the player’s most frequently played **position**, since it makes the most sense to list the most common position they play. Then we used the `batter_pitches` table again to get the **rv** values for the various **pitch_name** types. Together it gave us the table pictured to the left. ``` sql -- Top 10 Highest Run Values Against Different Pitches -- WITH most_common_positions AS ( SELECT batter_id, position, COUNT(*) AS position_count, ROW_NUMBER() OVER (PARTITION BY batter_id ORDER BY COUNT(*) DESC) AS row_num FROM batter_stats GROUP BY batter_id, position ), ranked_pitches AS ( SELECT t3.player_name, t1.team_name, t4.position, t3.pitch_name, t3.rv, ROW_NUMBER() OVER (PARTITION BY t3.batter_id ORDER BY t3.rv DESC) AS row FROM batter_pitches AS t3 JOIN batters AS t2 ON t3.batter_id = t2.batter_id JOIN teams AS t1 ON t2.team_id = t1.team_id JOIN most_common_positions AS t4 ON t3.batter_id = t4.batter_id WHERE t4.row_num = 1 ) SELECT player_name, team_name, position, pitch_name, rv FROM ranked_pitches WHERE row = 1 ORDER BY rv DESC LIMIT 10; ``` ```{r rv-query, message=FALSE, warning=FALSE, echo=FALSE} # rv <- " WITH most_common_positions AS ( SELECT batter_id, position, COUNT(*) AS position_count, ROW_NUMBER() OVER (PARTITION BY batter_id ORDER BY COUNT(*) DESC) AS row_num FROM batter_stats GROUP BY batter_id, position ), ranked_pitches AS ( SELECT t3.player_name, t1.team_name, t4.position, t3.pitch_name, t3.rv, ROW_NUMBER() OVER (PARTITION BY t3.batter_id ORDER BY t3.rv DESC) AS row FROM batter_pitches AS t3 JOIN batters AS t2 ON t3.batter_id = t2.batter_id JOIN teams AS t1 ON t2.team_id = t1.team_id JOIN most_common_positions AS t4 ON t3.batter_id = t4.batter_id WHERE t4.row_num = 1 ) SELECT player_name, team_name, position, pitch_name, rv FROM ranked_pitches WHERE row = 1 ORDER BY rv DESC LIMIT 10; " top_rv <- dbGetQuery(con, rv) knitr::kable(top_rv) ``` ------------------------------------------------------------------------ #### Figure 3 This table displays the pitchers with the highest **diff_era**. It basically shows which pitcher has had the worst luck, since **diff_era** is calculated by **ERA - xERA**. So, it's a measure of how high their actual earned run average is in comparison to what it is expected to be. To create this table, we used the `pitchers` table to get the **position**, the `teams` table to get **team_name**, the `pitcher_pitches` table to get the **player_name** in a nice format, and the `pitcher_statcast` table to get the **diff_era** stat that this analysis is based on. ``` sql -- Highest diff_era, Pitchers Who Have Been Unlucky -- WITH ranked_pitchers AS ( SELECT t1.pitcher_id, t3.player_name, t1.position, t2.team_name, t4.diff_era, ROW_NUMBER() OVER (PARTITION BY t1.pitcher_id ORDER BY t4.diff_era DESC) AS row FROM pitchers AS t1 JOIN teams AS t2 ON t1.team_id = t2.team_id JOIN pitcher_pitches AS t3 ON t1.pitcher_id = t3.pitcher_id JOIN pitcher_statcast AS t4 ON t1.pitcher_id = t4.pitcher_id WHERE t3.pitches > 150 ) SELECT player_name, position, team_name, diff_era FROM ranked_pitchers WHERE row = 1 ORDER BY diff_era DESC LIMIT 10; ``` ```{r era-query, message=FALSE, warning=FALSE, echo=FALSE} # era <- " WITH ranked_pitchers AS ( SELECT t1.pitcher_id, t3.player_name, t1.position, t2.team_name, t4.diff_era, ROW_NUMBER() OVER (PARTITION BY t1.pitcher_id ORDER BY t4.diff_era DESC) AS row FROM pitchers AS t1 JOIN teams AS t2 ON t1.team_id = t2.team_id JOIN pitcher_pitches AS t3 ON t1.pitcher_id = t3.pitcher_id JOIN pitcher_statcast AS t4 ON t1.pitcher_id = t4.pitcher_id WHERE t3.pitches > 150 ) SELECT player_name, position, team_name, diff_era FROM ranked_pitchers WHERE row = 1 ORDER BY diff_era DESC LIMIT 10; " diff_era <- dbGetQuery(con, era) knitr::kable(diff_era) ```