## Sunday, February 27, 2011

### EV Data for Games 1-939

[If you're having difficulty viewing the document, click here to view the spreadsheet directly at googledocs.]

Notes

The document contains three worksheets. The first sheet shows even strength data for all situations. The second shows even strength data for when the score was close (i.e. whenever the score margin was 1 or 0 in the first two periods, or tied in the third period or overtime). The last sheet contains data for when the score was tied.

Empty net goals have been removed from the data.

Missing Games List

Game 124 - WSH@CAR

Game 429 - ATL@NYI

Abbreviations

GF: goals for

GA: goals against

SF: shots for, where shots = goals + saved shots

SA: shots against

SHOT%: shots for/(shots for + shots against)

SH%: shooting percentage

SV%: save percentage

PDO: shooting percentage + save percentage

FF: fenwick for, where fenwick = shots + missed shots

FA: fenwick against

F%: fenwick for/ (fenwick for + fenwick against)

CF: corsi for, where corsi = shots + missed shots + blocked shots

CA: corsi against

C%: corsi for/ (corsi for + corsi against)

## Wednesday, February 16, 2011

### Shots, Fenwick and Corsi

From time-to-time, I'll find myself surfing aimlessly throughout the hockey blogging world in search of articles, discussions, and other interesting stuff. In doing so, I'll occassionally find that others have referenced or linked to my blog. Typically, the link or reference will relate to the even strength data that I've been publishing periodically throughout the season.

There seems to be a decided preference for EV tied and EV close data over the raw numbers. That makes sense - the raw data is subject to score effects, which makes the information less valuable with respect to distinguishing good teams from bad ones.

Interestingly, however, there doesn't appear to be a general agreement as to which of shot, fenwick and corsi percentage serves as the best metric to use once score effects have been controlled for. While fenwick seems to be the most popular, there are some who like corsi, and there are even a few prefer shot percentage over both. This raises the question: which of the three measures ought to be looked to for the purpose of team evaluation?

As Gabe Desjardins once correctly observed, there is a stronger relationship between fenwick and winning percentage than there is between corsi and winning or between shot differential and winning. In fact, according to Gabe's numbers, the correlation between corsi and winning percentage was about the same as the correlation between shot differential and winning percentage, even though including blocked shots substantially increased the sample size. The upshot is that the inclusion of blocked shots in the analysis doesn't add much information.

Gabe's discovery may account for the slight preference towards fenwick discussed above.

However, the weaker relationship between corsi and winning can be partially accounted for by score effects. In particular, the trailing team does better in terms of corsi than it does with respect to either shot percentage or fenwick.

As such, while overall corsi has a lower correlation with winning than overall fenwick, the same may not hold with respect to score tied corsi and score tied fenwick.

In an attempt to resolve this issue, I performed a series of calculations, the results of which have been posted below.

This table shows the split-half reliabilities for score tied corsi, score tied fenwick and score tied shot percentage. The split-half reliabilities for each variable were calculated by randomly selecting 40 games, randomly selecting an independent group of 40 games (that is, a game chosen in one group was necessarily excluded from the other), and using the two data sets to determine the correlations for each variable. This was repeated 1000 times, with the above table showing the average values.

Not surprisingly, corsi is more reliable than either fenwick or shot ratio at the half-season level, which is a product of the fact that there are simply more corsi events then fenwick or shot events in our sample. Thus, corsi should prima facie be considered the superior metric of the three due to its superior reliability.

Ignore the goal ratio column for now - it's only been included for the purpose of performing a subsequent calculation.

This table shows the predictive validity of the same three variables with respect to overall goal ratio. Here, predictive validity was determined by randomly selecting 40 games, calculating each team's score tied corsi, fenwick and shot percentage within that sample, and looking at how each variable correlated with overall goal ratio in an independently selected 40 game sample. As with the first table, the numbers here are the averaged values over 1000 trials.

The predictive validity of each variable is commensurate with its reliability co-efficient, with corsi having the most predictive validity. In other words, a team's score tied corsi over a 40 game sample is a better indicator of how it will perform over the remainder of its schedule than is score tied fenwick or score tied shot percentage.

Of course, the fact that corsi has the most predictive validity in practice doesn't necessarily mean that it serves as the best measure of team skill in theory. As discussed in a previous post, the observed correlation between two variables is contingent upon the reliability with which each variable can be measured. Fortunately, there exists a formula that can be used to calculate what the correlation between two variables would be if each could be measured with perfect reliability. That formula involves dividing the observed correlation by the product of each variable's reliability co-efficient.

There seems to be a decided preference for EV tied and EV close data over the raw numbers. That makes sense - the raw data is subject to score effects, which makes the information less valuable with respect to distinguishing good teams from bad ones.

Interestingly, however, there doesn't appear to be a general agreement as to which of shot, fenwick and corsi percentage serves as the best metric to use once score effects have been controlled for. While fenwick seems to be the most popular, there are some who like corsi, and there are even a few prefer shot percentage over both. This raises the question: which of the three measures ought to be looked to for the purpose of team evaluation?

As Gabe Desjardins once correctly observed, there is a stronger relationship between fenwick and winning percentage than there is between corsi and winning or between shot differential and winning. In fact, according to Gabe's numbers, the correlation between corsi and winning percentage was about the same as the correlation between shot differential and winning percentage, even though including blocked shots substantially increased the sample size. The upshot is that the inclusion of blocked shots in the analysis doesn't add much information.

Gabe's discovery may account for the slight preference towards fenwick discussed above.

However, the weaker relationship between corsi and winning can be partially accounted for by score effects. In particular, the trailing team does better in terms of corsi than it does with respect to either shot percentage or fenwick.

As such, while overall corsi has a lower correlation with winning than overall fenwick, the same may not hold with respect to score tied corsi and score tied fenwick.

In an attempt to resolve this issue, I performed a series of calculations, the results of which have been posted below.

This table shows the split-half reliabilities for score tied corsi, score tied fenwick and score tied shot percentage. The split-half reliabilities for each variable were calculated by randomly selecting 40 games, randomly selecting an independent group of 40 games (that is, a game chosen in one group was necessarily excluded from the other), and using the two data sets to determine the correlations for each variable. This was repeated 1000 times, with the above table showing the average values.

Not surprisingly, corsi is more reliable than either fenwick or shot ratio at the half-season level, which is a product of the fact that there are simply more corsi events then fenwick or shot events in our sample. Thus, corsi should prima facie be considered the superior metric of the three due to its superior reliability.

Ignore the goal ratio column for now - it's only been included for the purpose of performing a subsequent calculation.

This table shows the predictive validity of the same three variables with respect to overall goal ratio. Here, predictive validity was determined by randomly selecting 40 games, calculating each team's score tied corsi, fenwick and shot percentage within that sample, and looking at how each variable correlated with overall goal ratio in an independently selected 40 game sample. As with the first table, the numbers here are the averaged values over 1000 trials.

The predictive validity of each variable is commensurate with its reliability co-efficient, with corsi having the most predictive validity. In other words, a team's score tied corsi over a 40 game sample is a better indicator of how it will perform over the remainder of its schedule than is score tied fenwick or score tied shot percentage.

Of course, the fact that corsi has the most predictive validity in practice doesn't necessarily mean that it serves as the best measure of team skill in theory. As discussed in a previous post, the observed correlation between two variables is contingent upon the reliability with which each variable can be measured. Fortunately, there exists a formula that can be used to calculate what the correlation between two variables would be if each could be measured with perfect reliability. That formula involves dividing the observed correlation by the product of each variable's reliability co-efficient.

r xy adjusted = r xy observed/ SQRT( reliability x * reliability y)

As we already have the split-half reliability co-efficients for all of the variables, we only need to determine the split-half correlations between score tied corsi, score tied fenwick and score tied shot percentage, on the one hand, and goal ratio, on the other.

After inputting all of the relevant variables into the above formula, the following values are obtained:

Therefore, while corsi has more predictive validity with respect to goal ratio at the within-season level, fenwick and shot percentage appear to correlate more strongly with goal ratio over a sufficiently large sample of games. In other words, in theory, both fenwick and shot percentage seem to serve as better measures of team quality than corsi does.

One caveat: the differences between the values here are small, and we only have three seasons of data. It may very well be that all three variables correlate equally well with goal ratio over the long run. This subject may require further study in the future when more data is available.

After inputting all of the relevant variables into the above formula, the following values are obtained:

Therefore, while corsi has more predictive validity with respect to goal ratio at the within-season level, fenwick and shot percentage appear to correlate more strongly with goal ratio over a sufficiently large sample of games. In other words, in theory, both fenwick and shot percentage seem to serve as better measures of team quality than corsi does.

One caveat: the differences between the values here are small, and we only have three seasons of data. It may very well be that all three variables correlate equally well with goal ratio over the long run. This subject may require further study in the future when more data is available.

## Friday, February 11, 2011

### EV Data for Games 1-820

[If you're having difficulty viewing the document, click here to view the spreadsheet directly at googledocs.]

Notes

The document contains three worksheets. The first sheet shows even strength data for all situations. The second shows even strength data for when the score was close (i.e. whenever the score margin was 1 or 0 in the first two periods, or tied in the third period or overtime). The last sheet contains data for when the score was tied.

Empty net goals have been removed from the data.

I didn't make an adjustment for schedule difficulty this go around because - and this is embarassing - when I recently performed a system restore, I forgot to transfer that particular file to my external hard drive. Having said that:

1. There seems to be more interest in the raw numbers.

2. The schedule adjustment would be negligible for most teams at this point in the season.

Missing Games List

Game 124 - WSH@CAR

Game 429 - ATL@NYI

Abbreviations

GF: goals for

GA: goals against

SF: shots for, where shots = goals + saved shots

SA: shots against

SHOT%: shots for/(shots for + shots against)

SH%: shooting percentage

SV%: save percentage

PDO: shooting percentage + save percentage

FF: fenwick for, where fenwick = shots + missed shots

FA: fenwick against

F%: fenwick for/ (fenwick for + fenwick against)

CF: corsi for, where corsi = shots + missed shots + blocked shots

CA: corsi against

C%: corsi for/ (corsi for + corsi against)

Subscribe to:
Posts (Atom)