Not a story of Elly de la Cruz- Unraveling the Hidden Gems: Exploring Striking Similarities Among Baseball Prospects

14 min readJul 23, 2023

The world of baseball is constantly abuzz with talk of new superstars who have the potential to change the game. One such rising star is Elly De La Cruz, whose electrifying talent and meteoric rise to fame have captured the attention of everyone who likes baseball. In this article, we delve into the methodology used to explore similarities among baseball prospects, with a focus on identifying potential future stars similar to the likes of Elly De La Cruz. By employing the K-Nearest Neighbors (KNN) algorithm, we aim to provide valuable insights into the next generation of MLB standouts.

The Hype of Elly De La Cruz

Elly De La Cruz’s rapid ascent to stardom, as highlighted in various stories such as

As quoted by C. Trent Rosecrans, the excitement surrounding Elly De La Cruz continues to grow at Great American Ball Park, as he finds himself under the spotlight of attention. Notably, he even featured in a national ad alongside renowned figures like the Ravens’ Odell Beckham Jr. and USWNT star Alex Morgan for the new “Mission: Impossible” movie. De La Cruz possesses a combination of size, strength, speed, and skill that sets him apart as a potential superstar in the making. The overwhelming hype and the emergence of this young star, Elly De La Cruz, captivated my interest and prompted me to delve into his story through this article. As Sarah Langs and Codify tweeted:

Exploratory methodology

As we draw insights from the performances of past players like Jonathan India, Yordan Alvarez, Pete Alonso, and Aaron Judge, the burning curiosity arises — who might emerge as the next baseball sensation in the upcoming year? To quench this curiosity, I embarked on a structured exploratory journey, following these essential steps:

Collecting Data: The foundation of our analysis lies in gathering a comprehensive set of examples, each accurately labeled and categorized.
Getting Data Ready: To ensure seamless comprehension by the computer, meticulous data cleaning and organization were undertaken. We rid the data of any clutter or irrelevant information.
Making it Smarter: Our quest to identify top-performing future rookies akin to established stars led us to employ the powerful K-Nearest Neighbors (KNN) algorithm. This methodology parallels the way we would find fruits with similar color, size, and weight when encountering a new fruit in another domain.
Making a Guess: Armed with the KNN’s results, we meticulously study the categories to which similar examples belong. Based on this informed analysis, we venture into making educated predictions about the category of the future prospects.

In the forthcoming section, I shall dive deep into each step, providing comprehensive explanations that illuminate the inner workings of the K-Nearest Neighbors algorithm and its practical applications. Together, let us unravel the secrets behind this potent technique, as we eagerly seek the future stars of baseball!

Collecting Data: Selecting the Future Stars and Recent Rookies' top performers in Baseball

To construct a robust dataset, I delved into the performances of the recent rookie top performers from the 2023, 2022, 2021, and 2019 seasons. It is important to note that I did not conduct an exhaustive analysis of every rookie who debuted during these years; instead, I carefully extracted the names of the top performers from published reports. The resulting comprehensive set of players includes remarkable talents such as:

Adley Rutschman, Akil Baddoo, Alfonso Rivas III, Bo Bichette, Brett Baty, Bryan Reynolds, Cavan Biggio, Christopher Morel, Corbin Carroll, Elly de la Cruz, Eloy Jimenez, Esteury Ruiz, Fernando Tatis JR, Gunnar Henderson, James Outman, Jaren Durran, Jerar Encarnación, Jonathan India, Kyle Isbel, Luis Arraez, Matt Mclain, Nick Fortes, Oneil Cruz, Patrick Bailey, Pete Alonso, Spencer Steer, Tommy Edman, Vaughn Grissom, Vladimir Guerrero, Wander Franco, Wynton Bernard, and Yordan Alvarez.

The data was meticulously curated from the advanced table on FanGraphs, encompassing both the MiLB and MLB statistics, as exemplified in Figure 1, displaying Elly de la Cruz’s advanced stats.

Figure 1. Elly de la Cruz’s advanced stats from Fangraphs (source).

Future Prospects: Aiming to unearth the potential superstars of tomorrow, I carefully handpicked 15 players from the esteemed top 100 prospect ranking list. To ensure a focused analysis, I employed specific criteria for player selection, excluding those who are RHP or LHP (right-handed pitchers or left-handed pitchers), players who have already made their debut in the major leagues, and players currently below the AA level.

By implementing these rigorous criteria, I aimed to zero in on the talents who are on the precipice of reaching the majors, allowing us to identify and evaluate the promising future stars and compare them with recent top-performing players. This approach enables us to gain valuable insights into the next generation of MLB standouts and explore the potential similarities between these two categories of players. With a wealth of data at hand, we are now primed to delve deeper into the analysis and unravel the hidden gems within the world of baseball prospects. Image 2 shows the complete set of players considered for this study.

Image 2. Set of players considered for the study.

Getting Data Ready: Preparing for the Comparison of Future Prospects and Recent Top Rookie Performers

To conduct a meaningful comparison between the two groups of players — future prospects and recent rookie top performers — meticulous data preparation was of utmost importance. To begin with, I collected all the advanced statistics for each player group from FanGraphs, with a particular focus on crucial metrics such as OPS (On-Base Plus Slugging), wOBA (Weighted On-Base Average), and wRC+ (Weighted Runs Created Plus).

Once the data was gathered, the next step involved analyzing the rate of change between seasons for each player. This was a pivotal process as it allowed us to identify potential similarities or patterns with players who achieved remarkable success during their debut season. By exploring these patterns, we can gain valuable insights into the factors contributing to a successful rookie season.

Let’s illustrate this with a specific example, considering the recorded seasons and levels for Adley Rutschman, as shown in Image 3. To prepare Adley’s data for each season, I calculated the mean for his statistics, considering all levels he played in, ranging from R (Rookie level), A to A (Full-season A). This approach helped eliminate any potential outliers caused by small sample sizes and enabled us to focus on Adley’s overall development as a player, irrespective of the league level.

The resulting aggregated data for Adley Rutschman is displayed in Image 4, providing a clearer picture of his performance trajectory over the years. With this meticulous data preparation in place, we are now well-equipped to proceed with a comprehensive analysis of both future prospects and recent top rookie performers, unraveling the potential hidden gems and similarities that lie within the realm of baseball prospects.

Image 3. Advanced stats from Adley Rutschman. The data was extracted from FanGraphs.

Image 4. Aggregated results from Adley Rutschman.

Once the means for the three crucial statistics (OPS, wOBA, and wRC+) were calculated for each player, including Adley Rutschman, the next step was to analyze the rate of change for these metrics over consecutive seasons (e.g., from 2019 to 2021, 2021 to 2022, and so on). This examination was vital to understand how player performance evolved over time and identify trends that could uncover potential standout prospects.

To accurately interpret the rate of change, I factored in the magnitude of the statistics, focusing specifically on OPS and wOBA. It was essential to treat players fairly, recognizing that drastic changes in statistics from a high base level might not be comparable to similar changes from a lower base level. For instance, penalizing a player who went from a near 1.100 OPS and dropped to 0.900 in the next season in the same way as another player who went from a near 0.750 OPS to 0.600 would be unjust. The latter’s drop might be considered more significant in terms of performance.

To address this issue, I employed a linear function to adjust the rate of change, which ensured a fair and equitable evaluation. By applying this adjustment, players with lower OPS values experienced higher penalization, while players with higher OPS values were assessed more softly. This approach is highlighted in Image 5, where the impact of the linear function on Rutschman’s OPS changes from 0.955 to 0.815 in 2022 can be observed.

# Linear function
def calculate_penalty(woba, percentage_change):
    max_penalty = 0.5
    min_penalty = 0.2

    # Adjust the value of percentage_change to avoid division by zero
    adjusted_percentage_change = percentage_change + 0.001

    # Calculate the adjusted linear penalty
    penalty = max_penalty - (max_penalty - min_penalty) * (woba / adjusted_percentage_change)

    # Limit the penalty to ensure it is in the range [min_penalty, max_penalty]
    penalty = np.clip(penalty, min_penalty, max_penalty)

    return penalty

Originally, the rate of change was calculated at -14%, but with the application of the linear function, the adjusted rate of change became -7%. This adjustment not only provided a more balanced evaluation but also contributed to a comprehensive analysis of player progression over time.

Similar observations were made for other players, like Bo Bichette from 2016 to 2017, where the linear function helped level the playing field and provide meaningful insights into their performance trends.

By conducting this meticulous analysis, we were able to account for the nuances in player statistics, ensuring a fair and accurate assessment of their development and performance trajectories over consecutive seasons. These adjustments have contributed significantly to our ability to identify and uncover potential hidden gems among baseball prospects.

A crucial aspect of our analysis is the application of the penalty effect, which not only pertains to negative changes but also encompasses positive changes in player statistics. By implementing this adjustment, we ensure a fair and consistent evaluation across all players, regardless of whether their performance improves or declines over time. Adley Rutschman’s performance in 2021 serves as an excellent example of how positive changes are also subject to the same linear adjustment, maintaining the integrity of our study’s results.

While OPS and wOBA underwent the linear adjustment, I made a conscious decision not to consider a similar linear function for wRC+. The reason for this lies in the nature of wRC+, which provides a relative measure of a player’s offensive performance compared to the league average, factoring in various statistics. Given its standardized nature, applying a linear function to wRC+ values would not yield meaningful insights, as it could potentially distort the already normalized data.

Once the rate of change for OPS and wOBA was obtained and appropriately adjusted, the next crucial step involved calculating the mean and standard deviation across all seasons for each player. This statistical analysis allowed us to gauge both the average performance level and the degree of variability in a player’s statistics over time.

By computing the mean and standard deviation as seen in image 6, we gained valuable insights into the consistency of each player’s performance. Players with steady and reliable performances would exhibit smaller deviations, indicating a consistent level of play across seasons. On the other hand, players with more significant deviations might have experienced fluctuations in their statistics due to a range of factors.

Image 6. Mean and standard deviation of each top-performance player.

Making it smarter

To unleash the full potential of the K-Nearest Neighbors (KNN) algorithm, a critical step involved classifying the new top future rookies into categories like those of the recent top-performing players. Analogous to how we would identify a new fruit by seeking other fruits with similar color, size, and weight, the KNN method allows us to identify hidden patterns and similarities between player performances.

However, a challenge emerged during data preparation for the KNN analysis. The limited data availability resulted in only one instance of each class, representing one player per category, as depicted in Image 6. The insufficiency of data points (31 in total) posed a potential obstacle to conducting a robust KNN analysis, as the algorithm relies on a sufficient number of samples to identify close neighbors and make accurate predictions.

To surmount this obstacle, a powerful technique called “data augmentation” was implemented. This ingenious method facilitated an increase in the number of samples for each player by generating additional synthetic data points. By creating copies of each player and introducing random noise to each synthetic instance, we effectively expanded the dataset, as illustrated in Image 7.

Image 7. Synthetic data points of each top-performance player.

Next, before feeding the augmented data into the KNN algorithm, I aggregated the variables and reduced them to two components using Principal Component Analysis (PCA). This transformation enabled the visualization of a scatter plot, providing invaluable insights into the proximity of synthetic and original data points, as displayed in Image 8. This process was equally applied to future prospects without increasing the number of data points, ensuring a fair and balanced comparison.

Before feeding the data into KNN, I also aggregated the variables and reduced them to two components using Principal Component Analysis (PCA). This allowed me to visualize a scatter plot and observe the proximity of synthetic and original data points (Image 8). The same process was applied to future prospects without increasing the number of data points.

Image 8. Unraveling Insights: PCA Scatter Plot of aggregated variables for KNN Analysis. The proximity of synthetic and original data points.

Once this was done, the data was ready to be introduced to KNN, but first, I conducted a preliminary validation to ensure that KNN performed accurate classification using a confusion matrix. A confusion matrix is a table used to evaluate the performance of a classification model in supervised problems. It shows the number of times the model correctly and incorrectly predicted each class in the problem. In the implemented algorithm, the following evaluations were obtained:

With the data now fully prepared, it was time to introduce it to the KNN algorithm. However, prior to making definitive predictions, a preliminary validation was conducted to ascertain the accuracy of the KNN classification. This validation involved employing a confusion matrix, a powerful tool used to evaluate the performance of classification models in supervised problems.

The confusion matrix as seen in image 9, tabulated the number of correct and incorrect predictions made by the model for each class in the problem. As a result of this preliminary validation, a set of essential evaluations was obtained, providing critical feedback on the performance of the implemented algorithm, the following evaluations were obtained:

Precision: The precision is 0.9219, meaning that 92.19% of the model’s positive predictions are true positives. In other words, among all instances the model classified as belonging to a specific class, 92.19% of them belong to that class.
Recall (or Sensitivity): The recall is also 0.9219, indicating that the model correctly identifies 92.19% of all instances of a specific class. In other words, the model can capture most of the real instances of a class in the test set.
F1-Score: The F1-score value is 0.9219, which is a metric that combines precision and recall into a single score. An F1 score close to 1 indicates a good balance between precision and recall. The results of these metrics indicate that the model has a solid and reliable performance in classifying instances in the test set, with high precision and recall. The confusion matrix for the case study can be observed in Image 9.

Image 9. Confusion matrix of top-performance players.

In the above image, an incorrect classification of Alfonso Rivas III and Luis Arráez can be observed, indicating that the implemented model has difficulties distinguishing between these two players due to their similarity (Image 8). To overcome this, it is necessary to consider additional features or metrics and increase the data. In the specific case of the study, classifying a future prospect as Alfonso Rivas or Luis Arráez makes no difference as they have similar patterns.

Making an Educated Prediction

Now, let’s delve into the exciting realm of classifying new prospects, a process that unfolds through a series of crucial steps:

Step 1 - Loading the Trained KNN Model: To initiate the classification process, I load the KNN model that has been previously trained with historical player data. This model is armed with the feature matrix of known players, meticulously gathered through our exploratory methodology.
Step 2: - Introducing New Prospect Players: The moment of revelation has arrived for the yet unclassified new prospect players. I input their performance data into the model in the form of a feature matrix. While their potential remains unknown, their traits bear striking resemblance to those of established players, allowing the KNN model to seek out the most akin historical counterparts.
Step 3: - Calculating Distances and Selecting Closest Players: Intricate calculations ensue, as the algorithm measures the distances between the features of the new prospects and the known players. The KNN model proceeds to select the two (k) closest players in terms of similarity. This vital step enables me to discern hidden patterns and similarities in player profiles.
Step 4 - Majority Voting Process: Now, it’s time to harness the power of collective wisdom. I assign a label to each future player based on the labels of the nearest players. This collective insight sharpens our understanding of the potential trajectory of these promising prospects

Image 10. Similarity between future prospects and top-performance players.

Results: The Revelations Unveiled

In this passage, we witness the intriguing results of a comparison analysis that showcases striking similarities between prospective players and existing ones in the training set. These similarities offer valuable insights into the potential skill sets and playing styles of the prospects by drawing parallels with established players.

Firstly, let’s delve into the prospect ‘Brooks Lee,’ whose closest players are ‘Esteury Ruiz’ and ‘Curtis Mead’ with remarkably close distances of 0.2918 and 0.2926, respectively. This suggests that Brooks Lee shares comparable attributes and characteristics with these existing players, indicating a high potential for success if he follows a similar development path.

Another remarkable discovery emerges as we explore the case of ‘Curtis Mead,’ whose strong resemblance to ‘Oneil Cruz’ is evident through the consistent distances of approximately 34.92 in both instances. This similarity opens up exciting possibilities, implying that Curtis Mead might possess talents akin to those displayed by the established player Oneil Cruz.

The analysis further unfolds, revealing diverse scenarios with ‘George Valera’ and ‘Jackson Chourio.’ George Valera exhibits likenesses with two different players, ‘Bo Bichette’ and ‘Vaughn Grissom,’ implying that he might showcase a mix of skills and playing styles demonstrated by these established athletes. On the other hand, ‘Jackson Chourio’ showcases similarities with ‘Bo Bichette’ and ‘Wander Franco,’ suggesting a potential for versatility and adaptability in his game.

Overall, the revelation of high similarities between prospects and existing players carries tremendous significance. It provides valuable guidance to talent scouts, coaches, and team managers, allowing them to make informed decisions about player recruitment and development strategies. By recognizing these captivating parallels, the pathway to success for these prospective players becomes clearer, offering the potential to thrive and make a significant impact in the world of sports.

In summary

It is essential to highlight that this process can provide valuable information for decision-making in the sports field. Prospect players classified as like successful and talented players may have a high potential for future success. This classification can help teams and coaches identify emerging talents and make informed decisions about player selection and development. However, it is also essential to consider that the process of similarity classification is just a tool and does not guarantee players’ future performance. Evaluating players is a complex and multifaceted process that must consider additional factors, such as ethics, teamwork, and adaptability.

In conclusion, the process of similarity classification using machine learning techniques like KNN represents a valuable tool for evaluating the potential and performance of new prospect players in the sports field. Its application enables sports teams to make more informed and strategic decisions in forming competitive and successful teams in the future.

Feel free to reach out for any feedback or to chat on Twitter, LinkedIn, or via email (andres.mitre@outlook.com). More analytical projects can be found in my portfolio. The GitHub repository will be made accessible soon, containing all the files used for this study. Thank you for joining me on this exciting journey into baseball prospects and data analysis. Stay tuned for more updates!