Pitch, Hit, Profit: A Data-Backed Approach to Gaining an Edge in MLB Betting

Author

Andrew Cerqui, Jace Higa

Published

June 24, 2025

1 Data Summary

1.1 Data Ingestion

Our main data source was the StatsAPI, which provided MLB box score data in JSONL format. After uploading the raw files to Railway and our Postgres database, we had to figure out how to parse the nested JSON using SQL. This process involved trial and error until we found queries that matched our table structure. Scraping setup took time as we explored different sources before settling on an API that supported our needs. Once we finalized our approach, the ingestion process was fairly smooth, mainly focused on correctly inserting parsed JSONL into our schema.

The second data source was Statcast data from Baseball Savant, provided as CSV files. We cleaned these by deleting irrelevant columns, renaming headers, and standardizing the format to match our database requirements. Uploading from the terminal into our PostgreSQL database went smoothly, resulting in well-structured and usable tables.

Most of the challenges we encountered involved experimentation rather than critical issues.


1.2 Data Organization

Entity Relationship Diagram of MLB Data Schema

This is what our Entity Relationship Diagram looks like. The main table stems from games, which connects to many other tables. The teams table adds context by joining via team_id, allowing us to analyze team-level data.

From batter_stats and pitcher_stats, we can access game context using game_id, and then retrieve player-specific details through the batters and pitchers tables. These two tables are keyed on batter_id, pitcher_id, team_id, and position or type, which lets us track players who change positions (or teams) throughout the season.

For example, Cal Raleigh is primarily a catcher. Over 227 at-bats in that role, he has a .295 average and 1.136 OPS—MVP-caliber numbers. As a DH over 67 at-bats, however, he has just a .194 average and .659 OPS. That kind of role-based discrepancy highlights why tracking positional splits is important.

Finally, we have Statcast data: batter_statcast and pitcher_statcast, along with batter_pitches and pitcher_pitches. These contain advanced metrics, including expected stats and run values per pitch, and are joined via player IDs.


1.3 Data Compilation

1.3.1 Figure 1

This is an example of a table that we created to take a look at all players from the different teams that have played shortstop (SS). We used the teams table to extract the team_name, the batters table to find players who have logged game time at shortstop, and the batter_pitches table to get the player_name of the batters. We chose that table because the name of the players was in a nicer format on Statcast than they were in the box scores data we pulled.

-- Players Who Have Played SS Across Each Team --
SELECT DISTINCT
  t3.player_name, 
  t1.team_name, 
  t2.position
FROM teams AS t1
JOIN batters AS t2 ON t1.team_id = t2.team_id
JOIN batter_pitches AS t3 ON t2.batter_id = t3.batter_id
WHERE t2.position = 'SS'
ORDER BY t1.team_name, t3.player_name;
player_name team_name position
Perdomo, Geraldo ARI SS
Allen, Nick ATL SS
Henderson, Gunnar BAL SS
Story, Trevor BOS SS
Hoerner, Nico CHC SS
Swanson, Dansby CHC SS
De La Cruz, Elly CIN SS
Arias, Gabriel CLE SS
Schneemann, Daniel CLE SS
Meidroth, Chase CWS SS
Báez, Javier DET SS
Sweeney, Trey DET SS
Dubón, Mauricio HOU SS
Peña, Jeremy HOU SS
Witt Jr., Bobby KC SS
Neto, Zach LAA SS
Betts, Mookie LAD SS
Rojas, Miguel LAD SS
Lopez, Otto MIA SS
Ortiz, Joey MIL SS
Correa, Carlos MIN SS
Lee, Brooks MIN SS
Lindor, Francisco NYM SS
Peraza, Oswald NYY SS
Volpe, Anthony NYY SS
Wilson, Jacob OAK SS
Turner, Trea PHI SS
Kiner-Falefa, Isiah PIT SS
Bogaerts, Xander SD SS
Iglesias, Jose SD SS
Crawford, J.P. SEA SS
Adames, Willy SF SS
Winn, Masyn STL SS
Caballero, José TB SS
Walls, Taylor TB SS
Seager, Corey TEX SS
Bichette, Bo TOR SS
Giménez, Andrés TOR SS
Abrams, CJ WSH SS

1.3.2 Figure 2

With this table we found the highest rv or run values for individual pitch types. It just basically means which batters hit certain pitches the best, so Aaron Judge has the highest run value overall against a single pitch—and that happens to be the 4-Seam Fastball.

We used the batter_pitches table to get the player_name again, the teams table to get the team_name they are from, and had to use the batters table to get the position through a common table expression (CTE). We used the CTE to determine the player’s most frequently played position, since it makes the most sense to list the most common position they play. Then we used the batter_pitches table again to get the rv values for the various pitch_name types. Together it gave us the table pictured to the left.

-- Top 10 Highest Run Values Against Different Pitches --
WITH most_common_positions AS (
  SELECT 
    batter_id,
    position,
    COUNT(*) AS position_count,
    ROW_NUMBER() OVER (PARTITION BY batter_id ORDER BY COUNT(*) DESC) AS row_num
  FROM batter_stats
  GROUP BY batter_id, position
),
ranked_pitches AS (
  SELECT
    t3.player_name,
    t1.team_name,
    t4.position,  
    t3.pitch_name,
    t3.rv,
    ROW_NUMBER() OVER (PARTITION BY t3.batter_id ORDER BY t3.rv DESC) AS row
  FROM batter_pitches AS t3
  JOIN batters AS t2 ON t3.batter_id = t2.batter_id
  JOIN teams AS t1 ON t2.team_id = t1.team_id
  JOIN most_common_positions AS t4 
    ON t3.batter_id = t4.batter_id
  WHERE t4.row_num = 1
)
SELECT 
  player_name,
  team_name,
  position,
  pitch_name,
  rv
FROM ranked_pitches
WHERE row = 1
ORDER BY rv DESC
LIMIT 10;
player_name team_name position pitch_name rv
Judge, Aaron NYY RF 4-Seam Fastball 17
Schwarber, Kyle PHI DH 4-Seam Fastball 13
Tucker, Kyle CHC RF 4-Seam Fastball 12
Guerrero Jr., Vladimir TOR 1B 4-Seam Fastball 12
Wilson, Jacob OAK SS Slider 11
Ramírez, José CLE 3B Sinker 11
Díaz, Yandy TB DH 4-Seam Fastball 11
Castro, Willi MIN 2B 4-Seam Fastball 10
Alonso, Pete NYM 1B Sinker 10
Marte, Ketel ARI 2B 4-Seam Fastball 10

1.3.3 Figure 3

This table displays the pitchers with the highest diff_era. It basically shows which pitcher has had the worst luck, since diff_era is calculated by ERA - xERA. So, it’s a measure of how high their actual earned run average is in comparison to what it is expected to be.

To create this table, we used the pitchers table to get the position, the teams table to get team_name, the pitcher_pitches table to get the player_name in a nice format, and the pitcher_statcast table to get the diff_era stat that this analysis is based on.

-- Highest diff_era, Pitchers Who Have Been Unlucky --
WITH ranked_pitchers AS (
  SELECT 
    t1.pitcher_id,
    t3.player_name,
    t1.position,
    t2.team_name,
    t4.diff_era,
    ROW_NUMBER() OVER (PARTITION BY t1.pitcher_id ORDER BY t4.diff_era DESC) AS row
  FROM pitchers AS t1
  JOIN teams AS t2 ON t1.team_id = t2.team_id
  JOIN pitcher_pitches AS t3 ON t1.pitcher_id = t3.pitcher_id
  JOIN pitcher_statcast AS t4 ON t1.pitcher_id = t4.pitcher_id
  WHERE t3.pitches > 150
)
SELECT 
  player_name,
  position,
  team_name,
  diff_era
FROM ranked_pitchers
WHERE row = 1
ORDER BY diff_era DESC
LIMIT 10;
player_name position team_name diff_era
Alcala, Jorge RP BOS 4.007
Kinley, Tyler RP COL 3.187
Romano, Jordan RP PHI 2.689
Ferguson, Tyler RP OAK 2.506
Fluharty, Mason RP TOR 2.043
Alcantara, Sandy SP MIA 1.999
Ferrer, Jose A. RP WSH 1.898
Soroka, Michael SP WSH 1.883
Miller, Mason RP OAK 1.834
Rodriguez, Eduardo SP ARI 1.700