Baseball Strategies


Data Source: 

Two R packages: PitchRx, Lehman and Retrosheet (


Dave Campbell (

Of all sports, baseball has probably generated the most extensive and complex statistical analyses. Albert (2010) proposes a number of questions to explore with data obtained by combining three data sets that are readily available. Two data sets are available through packages in R (R Core Team, 2014): the Lehman package (Friendly, 2014) and the pitchRx package (Sievert, 2014). A third data set with play-by-play data, Retrosheet (Pankin, 2015), can be downloaded for use in R following instructions provided by Albert (2014).


Research Question: 


Question 1: What is the relative value to a team of a (non-pitcher) player's hitting and defence? How much would a batter's offence have to improve to compensate (in terms of runs scored/allowed or wins/losses) for weaker defence in the field? This will change depending on how critical the position is. (Green, 2015)

Question 2:What is the impact (on runs scored or wins) of the order of the batting lineup? Traditionally the fastest player (with high batting average) hits first. A strong contact hitter (low strikeouts) bats second. Third and fourth are "power" hitters who hit lots of homeruns. Some have argued against this tradition and claim that you should simply put your "best" overall hitter first, your "second best" hitter second, and so forth, on the justification that your best hitters will come to bat more often, on average, over the course of a season. How to measure "best", "second best" overall hitter engenders some controversy as well. (Green, 2015).



Each team chooses which of the two questions it wishes to explore and should present results on only one question.

The goal of this case study is to use the data in these datasets to explore one of two questions proposed in a personal communication by Green (2015)


  1. Jim Albert (2010). Baseball data at Season, Play-by-Play, and Pitch-by-Pitch Levels. Journal of Statistics Education, 18, Retrieved from
  2. Jim Albert (2014). Exploring Baseball Data with R, Retrieved from
  3. Christopher Green (2015). Personal communication.
  4. Michael Friendly (2014). Lahman: Sean Lahman's Baseball Database. R package version 3.0-1.
  5. Mark Pankin (2015). Retrosheet. Retrieved from
  6. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
  7. Carson Sievert (2014). Taming PITCHf/x Data with {pitchRx} and {XML2R} The R Journal, 6(1). Retrieved from
  8. Carson Sievert (2014). pitchRx: Tools for Harnessing MLBAM Gameday data and Visualizing PITCHf/x. R package version 1.6.