Handling Outliers with R

Recently, I attended a presentation where the following graph was shown illustrating the response to stimulation with Thalidomide among a cohort of HIV-1 patients. The biomarker used to measure the response in this case was TNF (tumor necrosis factor) and the response was measured at four time points: time of drug administration and 3, 11 and 23 weeks after administration. This graph appears in a published article, which I won’t cite directly because except for this one statistical problem, the research is outstanding.

TV_outlier_2

 

Part of the problem in this graph comes from the software being used. Clearly not R and clearly not ggplot. It comes from one of the highly automated programs that assumes you, the scientist, are a total jerk and incapable of specifying a graph on your own. I have highlighted the problem I saw. That very high value in week 23 seems questionable and could be distorting the summary measures that are being used in the inferences in the article. The following table of the data gives some detail to the problem:

cases_table

Beside the large number of missing values, we can see that the value for Patient 25 in week 23 (visit 4), 1339.552, is 10 times higher than the next value (130.997). The fact that there a number of similar digits between it and the next lower value also raises suspicion that this could simply be a data entry error or an error translating the data from its original home in a spreadsheet to the statistical program used. (I will also not mention the name of the program used, since I try to follow the English comedian Frankie Howerd’s advice: “Don’t mock the afflicted”.)

Is such a high value reasonable? And, if it it is simply a data error, could the use of this value lead to the wrong treatment recommendation based on this cohort.

First Question — Is It an Outlier?

The first question that we need to address is whether this value, 1339.552, is really an outlier that needs attention and, if so, how should we modify it?

What’s an outlier? The most common definition is that it is a value that lies far away from the main body of observations for a variable and could distort summaries of the distribution of values. In practice this translates to a value that is 1.5 times the IQR (interquartile range) more extreme than the quartiles of the distribution.

The most effective way to see an outlier is to use a boxplot. The following figure relates the parts of a boxplot to a distribution and its histogram. I have taken it from the excellent book on R by Hadley Wickham and Garrett Grolemund, R for Data Science, which is available for reading here.

r4ds boxplot image

Boxplots are an excellent way to identify outliers and other data anomalies. We can draw them either with the base R function boxplot() or the ggplot2 geometry geom_boxplot(). Here, I am going to use the ggboxplot() function from the ggpubr package. I find that the functions from ggpubr keep me from making many mistakes in specifying parameters for the equivalent ggplot2 functions. It makes life a little simpler.

Here is the code that read in the original data and tidied the data to facilitate the graphing and subsequent analysis. (The tnf.rds file is available here.)

library(tidyverse)
library(DescTools)
library(ggpubr)
tnf <- readRDS("../tnf.rds")
tnf_long <- tnf %>% 
 gather(vis0:vis4, key = "visit", value = "value") %>% 
 group_by(visit)

The next step is to look at the boxplot, which I have created with the following code:

gr_tnf_vis2 <- tnf_long %>% 
 filter(!is.na(visit) & !is.na(value)) %>% 
 ggboxplot(x = "visit", y = "value",
 palette = "aaas", 
 title = "TNF Levels by Visit - Thalidomide Group",
 xlab = "Visit",
 ylab = "TNF level",
 ggtheme = theme_gray())
suppressMessages(gr_tnf_vis2)

tnf_boxplot

While the graph shows three other outliers, they are not well away from the quartile boxes. However, the value at 1339.552 really stands out as an outlier.

So, what are we going to do about it?

What to Do about Outliers

Clearly, if a value has a reasonable scientific basis, it should not be modified even if it lies well outside the range of the other values for the sample or cohort. I normally refer to this as needing to pass the “compelling reason” test. There should be a compelling biological or medical reason to retain the data point.

If not, there are three commonly accepted ways of modifying outlier values.

  1. Remove the case. If you have many cases and there does not appear to be an explanation for the appearance of this value, or if the explanation is that it is in error, you can simply get rid of it.
  2. Assign the next value nearer to the median in place of the outlier value. Here that would be 130.997, the next lower value. You will find that this approach leaves the distribution close to what it would be without the value. You can use this approach if you have few cases and are trying to maintain the number of observations you do have.
  3. Calculate the mean of the remaining values without the outlier and assign that to the outlier case. While I have seen this frequently, I don’t really understand its justification and I think it distorts the distribution more than the previous solution.

Because we want to keep as many of the few values that we have (a commonplace concern in medical studies), I would choose the second option in this case and assign the value 130.997 to the outlier case. Here is the code that would do that. Patient 7 was the person who represented the value we seek and we are putting the value of that case into the slot for case 25.

# substitute
tnf2 <- tnf_long
# case no. of next case
tnf2$value[tnf2$id == "25" & tnf2$visit == "vis4"] <- 
 tnf2$value[tnf$id == "7" & tnf2$visit == "vis4"]

The boxplot now shows that the extreme outlier has been controlled.

tnf2_box

Problem Resolved?

Yes and No. We now have values for tnf that fit in the distribution more comfortably. However, we needed to manipulate a value to cause that to happen.

There are three conclusions I would suggest from this type of analysis:

  1. Data always needs to be checked for outliers. Without doing this, you are likely to introduce a bias that could distort the results of your study.
  2. Conduct your analysis on the data both with and without the outlier. If the results are very close, you can use the original data without too many qualms. If not, you need to follow the next recommendation even more closely.
  3. When you find an outlier, you should report it in your findings and everything you have done to correct for it including differences between analyses that you conducted with and without the outlier value.

A last note is that canned statistical or data analysis software tends not to alert you to the presence and danger of outliers. You need to be especially vigilant when using these programs to avoid getting lulled into believing that everything is ok when it’s not. By forcing you to write your program, R avoids this problem but does not totally eliminate. You still need to take responsibility to going on an outlier hunt as you move through the data cleaning process.

Advertisements
Posted in R, estatística | Tagged , , | 15 Comments

Learning about Algorithms and Program Structure

Outline of the Problem

When learning to program, we often start with trivial exercises that don’t motivate anyone to treat programming seriously. Here I want to develop an example step-by-step that will illustrate key concepts of programming in R and do something useful. We are going to elaborate a method of searching through a list called “binary search”.

Suppose you were living in New York City back in the antediluvian times that we used phone books. In those times, the phone book for Manhattan, the principal borough of New York, was ridiculously thick, as you can see in this photo of the Yellow Pages (commercial directory) from 10 years ago. The White Pages, or the phone directory just of names, addresses and phone numbers was similarly thick.

manhattan_phone_book

How would you find a friend’s phone number in this multiple thousand page monstrosity? Or, to put the same question in the language of computer science, how would we design an algorithm to find our friend’s number?

What’s An Algorithm?

“Algorithm” is a word that simply means a solution method, the steps we adopt to solve a problem. We often enshrine such a solution in a computer program as we will do here. However, before we do that, let’s conceptualize the problem and try to find an efficient method to solve the problem.

A Naive Approach

We could simply start at the beginning, at page 1, and leaf page by page until we find our friend’s name. If her last name was “Zimmer”, this could take a while. Not very efficient.

Second Approach

A second way to find your friend’s name is to open the book at random and see if your friend’s name is on that page. If not, you close the book and open to a new random page. You continue on until you actually find the page you are looking for. Perhaps this is even worse than the page-by-page method because here we could end up check the same page multiple times.

A More Reasonable (and Common) Approach

After years of asking this question, I have found that most people start out by suggesting that you open the book in the middle, check to see if your friend’s name is on that page. There is only a very small probability this is the case if the phone book is the size of the Manhattan book. If not, you note whether the name comes before or after the point at which you opened the book.

At this point, instead of ignorantly closing the book and try again, you make use of the information you gained. If the name comes before the point where you opened the book, you can forget the second half of the book. You can toss that half away. Likewise, if the name comes after that point, you can toss away the first half of the book, because it cannot help you solve your problem.

You have cut your problem literally in half. Now, instead of trying to find your friend in a list of, say, 1,000 pages, you’ve only got to deal with 500 pages.

As a peripheral note, in the days when phone books were still an important tool for everyday life, professors illustrated this method by literally tearing a phone book in half. Becuase I can’t find one, I can’t give you a demonstration of that. However, the following link will show Prof. David Malan of Harvard doing this for his course CS50, a course I highly recommend. [LINK]

You can execute this approach repeatedly, each time cutting the problem in half, until you find the page you need. For example, in the worst case, you could search through a 1,000 page phone book in 11 iterations, always using integer values (can’t have a fraction of page in this system).

1,000 > 500 > 250 > 125 > 63 > 32 > 16 > 8 > 4 > 2 > 1

This is binary search and we can implement this on a computer in a program.

Planning Our Binary Search Program

Pseudocode

The first phase in writing any computer program is to plan it. We write out a version of the program in our working language: English, Portuguese, whatever. Thus, before trying to translate our ideas into computer code, we are clear about what those ideas are.

Loading the Data

Our first task to identify the list which contains the name we are looking for and read it into memory. In our case, I have created a list of 500 last names of professional American baseball players taken from the famous Lahman database of baseball statistics. [1] The list is saved in a compressed R data file, players_list.rds, so we will read that and install it in memory as players.

So, in pseudocode, we could lay out the first step as:

  1. Read file players_list.rds into memory as players

To understand how to proceed from here, let’s take a quick look at the beginning of the list:

"Victorino" "Morales" "Evans" "Grube" "Figaro" "Cruz" ...

Preprocessing

players is a character vector, a simple set of 500 family names of baseball players. To make our example a bit more concrete, we are going to seek the name “Mantle”, the name of one of my childhood heroes, Mickey Mantle of the New York Yankees.

Does players have the same characteristics as the last names in a phone book? What would we need to do to pre-process the list, to make it ready for the binary search algorithm? In a phone book, the names are in alphabetical order. This is what enables us to tear it in half and know that the half we are now working with must contain the name we seek. If I am searching for the name “Mantle” and I tear a phone book in half at the letter “N”, I know I can throw away the second half of the book (“N” – “Z”) as “Mantle” cannot be there. Binary search depends on the list being searched actually being in order.

Our list is in random order. So, our next task is to put it in alphabetical order. We can do this with the base R sort function or the arrange function within dplyr, one of the key components of the tidyverse set of packages.

Our pseudocode with both steps:

  1. Read file players_list.rds into memory as players
  2. Put players in alphabetical order with sort or arrange

Binary Search

In binary search, we have to identify the middle entry in our list. This will be the dividing point between the first and second halves of the list. To divide the list in the middle, we can divide the number of names by 2, using integer division. Integer division forces any non-integer result to the truncated number, that is, it simply drops the decimal part of the number. Whether we are rounding up or down makes little practical difference in binary searches, as we are still dividing the remaining list in half.

  1. Read file players_list.rds into memory as players
  2. Put players in alphabetical order with sort or arrange
  3. Count the entries in the list and integer divide the total by 2

Now, we can check the entry we have selected and see if it is “Mantle”. If so, we’re done and we stop.

  1. Read file players_list.rds into memory as players
  2. Put players in alphabetical order with sort or arrange
  3. Count the number of entries in the list and divide the total by 2
  4. If the entry chosen equals “Mantle”, report the result and end the program.

Cutting the List Down to Size

If not, we need to find the new middle. First, we need to have to decide if we are dealing with the first half or the second half of the list. If “Mantle” is later in the alphabet than the entry, we can redefine the limits of the list to be the next entry higher than the one we just looked at and the end of the list. If “Mantle” is earlier in the alphabet than the entry we examined, we would choose limits of the first entry up to the entry before the one we chose. For example, if we have a list with seven names:

Baker, Carter, George, Katz, Landis, Mantle, Taylor

we would first choose the middle name–“Katz”–and see if equaled “Mantle”. Since it does not, we would compare “Katz” and “Mantle”. “Mantle” is later than “Katz” alphabetically, so we would define the new list as “Landis” through “Taylor” since “Mantle” must be located there.

If, on the other hand, we had the following list:

Baker, Mantle, Nixon, Parker, Rosen, Smith, Taylor

we would start with “Parker”, the middle name, compare the name to “Mantle”, see that “Mantle” comes before “Parker” and so start the new list at the beginning “Baker” and go to “Nixon”. This defines a new step in the program, setting the new list, which is based on an “if . . . then . . . else” construct that we will see when we translate the pseudocode to R code.

  1. Read file players_list.rds into memory as players
  2. Put players in alphabetical order with sort or arrange
  3. Count the number of entries in the list and divide the total by 2
  4. If the entry chosen equals “Mantle”, report the result and end the program
  5. If “Mantle” is greater than the chosen item, define new list as chosen item + 1 to the end of the list or if “Mantle” is lesser than the chosen item, define new list as the beginning of the list up to the previously chosen item – 1

Now, we just have to keep repeating the steps 3 – 5b, until we reach a point where step 4 returns TRUE, that is, we have found “Mantle” and we simply have to report the results.

This is then our program in pseudocode. Let’s turn it into real programming code and try it out.

Translate Program to R Code

Now, we will start to use code blocks to elaborate our program. Normally, the first block of code we put into a program specifies the libraries of functions that R has that facilitate our programming. However, we won’t need any of these libraries for this exercise, since the objective is to focus on the thought process of creating a program. Here, we will use base R functions, that is, those functions that come with the default distribution of R from CRAN.

In this program, I will also emphasize comments. It is very important to explain your computer code to yourself and to others. The best way to do this is to put observations in your program that explain why you coded as you did. This is primarily for your benefit. When you return to your program six months after you wrote, the comments will help you understand what you did and what choices you made. The first step in commenting is to put your pseudocode steps in your code to serve as an outline of what your program does.

Code Block 1 – Read the Data

# Read file `players_list.rds` into memory as `players`
 players <- readRDS("players_list.rds")
 # Have a look at the structure `str` of `players`
 str(players)

##  chr [1:500] "Victorino" "Morales" "Evans" "Grube" "Figaro" "Cruz" ...

As I stated above, players is a character vector containing 500 names. Technically, you need to put the entire path of folders from the working directory down to the folder where you placed “players_list.rds”. To keep things simple here, I put the data in the same directory as the program.

Also, the symbol <- in R is the primary way we indicate assignment, that is assigning some data to a name. We tend not to use =, as it can be confused with a logical test of equality, which is 2 equals signs together, ==.

Code Block 2 – Arrange the List in Alphabetical Order

# Put `players` in alphabetical order with `sort` 
 players <- sort(players)

In R, you can redefine an object like players to contain a transformed version of itself. In some languages, you need to give it a new name and then assign it back to its original name. In R, you can do it directly.

Code Block 3 – Find the Middle Entry

# Find the middle entry using integer division
size <- length(players)
 lower_limit <- 1 # for later use
 upper_limit <- size # for later use
 middle_entry <- size %/% 2

Code Block 4 – Test the Entry against “Mantle”

#  If the entry chosen equals "Mantle", report the result 
#  and end the program
#  Define Mickey Mantle as the `target`
 target <- "Mantle"
 finished <- FALSE # create a logical variable to mark if we are finished
 if(players[middle_entry] == target) {
   print(paste(target, "is entry number", middle_entry, "in the list."))
   finished = TRUE
 }

When you use an “if . . . then” construction to create a logical test and you want the program to do multiple sub-steps, you need to enclose the steps in curly brackets {}. Also note that we will deal with implementing an end to the program when we construct a loop. In this block as well, I have assigned “Mantle” to the variable target, which will allow us to search for anyone without having to rewrite many lines of code whenever we have a new person to search for.

Code Block 5 – Determining the New, Half-Size List

# A. If "Mantle" is greater than the chosen item, define new list 
#      as chosen item + 1 to the end of the list
# B. If "Mantle" is lesser than the chosen item, define new list 
#      as the beginning of the list up to the previously chosen item - 1
 ifelse(target > players[middle_entry], 
                        lower_limit <- middle_entry + 1,
                        upper_limit <- middle_entry - 1)

## [1] 251

new_size <- upper_limit - lower_limit + 1 
# 1 needed to include both end points
middle_entry <- new_size %/% 2

I am laying out the steps to define each iteration so you can follow the logic a bit better. There are faster ways to code this block, but I think you want a somewhat clearer demonstration of what is going on.

However, I have used one shortcut. ifelse is a very efficient version of an “if . . . then . . .else” construction. Here, I used it to redefine the limits of the list we still want to consider. The conditional tests if Mantle is later in the sorted list than the player we have identified as the middle_entry. If he is, we adjust the lower limit to be 1 entry higher than the player we just considered. If he comes before that player, we adjust the upper limit to be one lower than the entry we just tested. From these new limits, we can get a new size count of the list we are considering and determine a new midpoint.

players is still in memory in its original form (500 entries), but we are using progressively smaller portions of it to conduct our search for Mickey Mantle.

But, We Haven’t Finished

No, finished is still FALSE. We only have gone through our program once. Now, we have to modify the program so it will iterate through the list until it finds “Mantle”.

R, as well as most programming languages, can create loops to process data on an industrial basis. Loops can iterate through a series of data, such as our players list. Two varieties of loop are common – for loops and while loops. Programming languages process for loops a set number of times that the programmer defines through a variable. The program will apply the statements within the loop to the data exactly the number of times specified. while loops operate while a condition holds. The program will keep applying the loop’s programming statements to the data until the condition becomes FALSE.

We don’t know how many iterations our search will take until it finds “Mantle” in the list of players, so a for loop is not appropriate. However, we can use our finished variable as a condition to define the loop. We are going to put code blocks 3, 4 and 5 inside the loop as they contain the program steps we want to iterate through.

Revision of Pseudocode

Because we need to incorporate a loop into our program, let’s return to the pseudocode and build in the looping structure. It’s very common to do revisions of the pseudocode to help you understand what you intended to do when you (or someone else) examines the code at a later date.

The first three code blocks are fine. However, we next will define a loop in step 4 and create a series of sub-steps that will execute the loop. We have most of these steps defined already in 4, 5a and 5b. First, we will execute the test to see if we have found “Mantle”. If not, we will redefine the limits as we did in 5a and 5b above and cycle around until the finished flag is TRUE. Then we will print the results and a counter to show how many iterations the binary search took.

  1. Read file players_list.rds into memory as players
  2. Put players in alphabetical order with sort or arrange
  3. Count the number of entries in the list and divide the total by 2
  4. Enter a while loop that continues until finished becomes TRUE
    4a. Test to see if the middle_entry equals “Mantle”. If so, report the result and end the program
    4b. If “Mantle” is greater than the chosen item, define new lower_limit of the list as the chosen item + 1 and leave the upper_limit as it was
    4c. If “Mantle” is lesser than the chosen item, define leave the lower_limit the same and change the upper_limit to be the previously chosen item – 1
    4d. Calculate new size of the remaining list
    4e. Calculate a new middle_entry to test in step 4a
    4f. Update the entries tested by the binary search in the variable guesses
    4g. Repeat the loop until finished is TRUE.

Now, we can go ahead and write the code incorporating the statements from our old program structure. Note that I have repeated some of the variable initializations we stated above so this code block will run correctly without reference to code blocks 4 and 5 above.

Code Block 4 Revised – Loop Structure

# Initializing variables 
 size <- length(players)
 lower_limit <- 1 
 upper_limit <- size 
 middle_entry <- size %/% 2
 finished <- FALSE
 iter <- 1
 guesses <- middle_entry
 
 # loop
 while (!finished) { # keep repeating until finished == TRUE
   if (players[middle_entry] == target) { # passes test; go to end
     print(paste(target, "is entry number", middle_entry, "in the list."))
     print(paste("We used", iter, "iterations to find the target."))
     finished <- TRUE
   }
   else { # Needs additional iteration
     iter <- iter + 1 # increase iteration counter
     ifelse(target > players[middle_entry],               # test
                        lower_limit <- middle_entry + 1,  # TRUE
                        upper_limit <- middle_entry - 1)  # FALSE
     new_size <- upper_limit - lower_limit + 1 
     # 1 needed to include both end points
     middle_entry <- lower_limit + (new_size %/% 2)
     guesses <- c(guesses, middle_entry) # tally what inidices chosen
   }
 }

## [1] "Mantle is entry number 275 in the list."
 ## [1] "We used 9 iterations to find the target."

We Did It! Mission Accomplished

We have found my childhood hero in our list of players. It took 9 iterations to hit the name. The guesses were 250, 376, 313, 282, 266, 274, 278, 276, 275. If you run this program on your system, you will see that it runs very fast. However, it is very inefficient. I have opted for clarity over speed in every case I could so the reader can have a clearer idea of what decisions I made and how I implemented them. I also did not encapsulate our procedures in functions that make repeated uses of blocks of code easier to manage. I would normally do that.

How to achieve efficient, faster running code is the subject for a future lesson.

If You Want to Play with This Program

I have stored three files on GitHub in my repository MADBlog so you can manipulate the program as well. The first is the program itself as I’ve laid it out here: bin_search_program.R. The second is the 500 name player list: players_list.rds. The third is the 5,000 name player list: big_player_list.rds. Click on the file names in the list and you can download the files. If you are not familiar with GitHub, you will need to click on “Raw” button when you open the bin_search_program.R file. The other two files will download directly. Also, if you are unfamiliar with GitHub and work with data science or programming, don’t miss out on this valuable resource that facilitates version control and simple file transmission for others.

If you want to explore this code further, I can recommend two options. First, have a look at the names in the list with the command players. This will print out on your console the full list of five hundred names. Pick one and modify the target variable to that name. Then run the Code Block 4 Revised again and see how many iterations it took to find the name you chose.

The second option is to replace the 500 name list with the 5,000 name list by modifying the line of code

players <- readRDS("players_list.rds")

with the following line of code

players <- readRDS("big_player_list.rds")

and then see how many iterations it takes to find “Mantle” or any other player who interests you and appears in the list. How much more time did it take than with the short list?

Summary

We learned about how to structure a program and move from an idea (“binary search”) through pseudocode to actual R code. We saw how a while loop functions and we conducted a number of logical tests.

Now, you modify this code to fit your needs. Have fun.


[1] Lahman, S. (2017) Lahman’s Baseball Database, 1871-2016, Main page, http://www.seanlahman.com/ baseball-archive/statistics/

Posted in Cursos, R | Tagged , | Leave a comment

A Useful Logical Shortcut in R

While dealing with a small problem in which I needed a list of odd numbers, I remembered a shortcut that I’ve used for more than 30 years to reduce the typing and complication of logical tests — the “if…then” construct. I find this shortcut very useful.

Approach to Programming

It’s not elegant. However, my approach to programming in R (and other languages) is highly practical. I am a virologist, not a specialist in programming. I want my programs to work and work efficiently enough to accomplish the tasks I’ve set myself. However, the most concise and elegant code are normally not my goals. If a simple but brutish solution to a problem (using for loops and other antiquated concepts) works and runs in a reasonable time on my dataset, I don’t see the need to spend another three hours or so finding a more elegant solution. I would rather move onto the next problem and solve that.

With large datasets, I’m obviously more sensitive to the efficiency of the code and I will work harder and longer to get a solution that will do the calculations I need rapidly. In the study of viruses, many of our datasets are stupendously large–multiple gigabytes–and these get some really serious attention. Our lab is working on a study of histone methylation, which is exemplary of the efforts to find an efficient solution. Even before working through the dataset, we are testing various Bioconductor packages against other web-based and client software packages to find the most practical method to execute our analysis.

Preparation for the Shortcut

Before I show you the shortcut itself, we need to return to the set of odd numbers I want to create. What is the defining characteristic of an odd number we can use to separate them from even numbers? Simply, the odd numbers are not 0 nor divisible by 2. In R, we can use the modulus operator (“%%“) to determine if a number is divisible by 2. Let’s look at a little code because modulus is a concept that gives some people have a hard time.

Let’s first try some simple modulus arithmetic. We know that 4 is even (that is, it is divisible by 2) and 5 is not. Let’s see if the modulus operator distinguishes them:

4 %% 2
## [1] 0

5 %% 2
## [1] 1

If a number is even, it will have a 0 remainder after being divided by 2. An odd number (e.g., 5) will have a remainder of 1 after being divided by 2. If we create a logical test from this property, odd numbers will have the result TRUE and even numbers FALSE.

Let’s see how the logical test would work. Let’s first try an even number then an odd number.

27 %% 2 == 1 # test of the oddness of an odd number
## [1] TRUE

424 %% 2 == 1 # test of the oddness of an even number
## [1] FALSE

These values are logical in type as confirmed by checking the class (i.e., the type) of the statement.

class(27 %% 2 == 1)
## [1] "logical"

The Shortcut

We also know that R gives logical values a numerical equivalent: FALSE is treated as 0 and TRUE is treated as 1. This property of logical values in computer languages–and it’s almost universal across languages–enables us to perform a logical test on any number and multiply the result by that number. The result will either be 0 (if the logical test fails, that is gives the result FALSE) or the value of the number. We can see that in our examples so far.

27 * (27 %% 2 == 1)
## [1] 27

424 * (424 %% 2 == 1)
## [1] 0

The result is either the value of the odd number or 0 in the case of the even number.

Let’s put the shortcut to work on a set of integers that we will choose at random. First, we will use runif (the Uniform Distribution) to choose a set of 10 numbers. Then, we will apply the shortcut and find out the sum of the odd numbers we have in the set. So you can check the result, I’ll print the original 10 numbers and the set showing the value of the odd numbers.

set.seed(42)      #so you can replicate my results
numbers <- as.integer(runif(10, min = 3, max = 2000))
sum(numbers * (numbers %% 2 == 1))
## [1] 8999

numbers
##  [1] 1829 1874  574 1661 1284 1039 1473  271 1315 1411

(odd_nums <- numbers[numbers %% 2 == 1])
## [1] 1829 1661 1039 1473  271 1315 1411

sum(odd_nums)
## [1] 8999

Alternative to the Shortcut

What is the shortcut equivalent to in R? The closest I can think of is a simple if conditional statement, which in our example would need to work with a for loop to calculate the sum:

odd_nums2 <- vector(mode = "integer")
 for (i in seq_along(numbers)) {
   if (numbers[i] %% 2 == 1) {
     odd_nums2 <- c(odd_nums2, numbers[i])
   }
 } 
 odd_nums2
## [1] 1829 1661 1039 1473  271 1315 1411

sum(odd_nums2)
## [1] 8999

This produces an identical result. I could have programmed the sum within the for loop. However, I wanted you to be able to see the set of numbers themselves as well as the sum.

Put Shortcut to the Speed Test

Let’s now create a really big vector of numbers, say 50,000 numbers long and apply both versions of the algorithm to the set and see how long each takes to calculate. To do this, I will use the microbenchmark package. To make the microbenchmark timing work, I will convert both algorithms to functions. To make the comparison a bit more clear, microbenchmark will calculate the sum 100 times for each algorithm. For this comparison, I am going to have the for…if structure simply calculate the sum, not store the entire sequence of odd numbers. This will lose some information, but make the timing more comparable.

suppressPackageStartupMessages(library(microbenchmark))
suppressMessages(library(tidyverse))

## Shortcut function
shortcut <- function(numvec){
   sum(numvec * (numvec %% 2 == 1))
 }
 
## For...if  structure function
forif <- function(numvec) {
   sumodds  <- 0
   for (i in seq_along(numvec)) {
     if (numvec[i] %% 2 == 1) {
       sumodds <- sumodds + numvec[i]
     }
   }
   return(sumodds)
 }
 
# get vector of numbers
 bigvec <- as.integer(runif(50000, min = 3, max = 2000))
 
## calculate sum of odd numbers with shortcut
 paste("Shortcut Total =", shortcut(bigvec))
## [1] "Shortcut Total = 25198126"

## calculate the sum of odd numbers with the for...if structure
 paste("For...if Total =", forif(bigvec))
## [1] "For...if Total = 25198126"

## calculate the timing 
 res <- microbenchmark(shortcut(bigvec), forif(bigvec), times = 100L)
 
print(res)
## Unit: milliseconds
 ##              expr    min   mean median    max       
 ##  shortcut(bigvec)  1.372  2.052  1.923  3.722
 ##     forif(bigvec) 14.191 19.036 18.021 34.338
 
print(res, unit = "relative")
## Unit: relative
 ##              expr    min    mean median    max
 ##  shortcut(bigvec)  1.000   1.000  1.000  1.000
 ##     forif(bigvec) 10.342   9.277  9.369  9.226
 
suppressMessages(autoplot(res) + 
   scale_y_continuous(breaks = seq(0, 36, by = 4)) +
   ylab("Time in milliseconds"))

shortcut_timing

As you can see from the graph and the results, the shortcut saves considerable processing time. One important reason for this is that it can take advantage of R’s inherent vectorization. It can process each value in the set without having to construct a loop and test each value in a separate command. The shortcut provides approximately a 9.3 times speed advantage in processing a vector of 50,000 values.

Conclusion

We can summarize the shortcut by the following expression:

Result of logical test * numerical value

The shortcut can be used with any expression that produces a logical value (TRUE/FALSE). It can also be applied to any numerical value. For example, if I wanted to calculate the price of a ticket to a show for an elderly person in a city where persons over 65 years of age receives a discount of 30%, I could use the following expression in R:

(age > 65) * 0.30

We can now use this shortcut to calculate a ticket price for two people, one above 65 and one younger.

ages <- c(67, 45)
discount <- .30
full_price <- 40
prices <- (1 - (ages > 65) * discount) * full_price
prices

## [1] 28 40

As you can see, the 67 year old received the discount, the younger person did not. Try using this in your programming. Once you begin to apply it, you will find it a handy tool.

Posted in R | Tagged | Leave a comment

Last Day for Stanton Predictions

This week I’ve been looking at two models in R that are attempting to predict whether Giancarlo Stanton would break Roger Maris’ mark of 61 home runs in a season. I’ve not been counting Barry Bond’s official record of 73 home runs because his was a result of the PED’s juicing scandal. However, in my two posts up to now, I made a simple but serious error.

So, before detailing what the error was and how I made it, contrite apologies to Bob Carpenter of the Stan Group. I had misinterpreted a number in his Poisson estimation of the number of at bats Stanton would have left in the season. I’m very sorry Bob for any implied criticism of your model. It really works fine.

My Error

Let me show you first Carpenter’s original model from his blog post and where I goofed.  The model:

sum(rbinom(1e5,
 rpois(1e5, 10 * 555 / 149),
 rbeta(1e5, 1 + 56, 1 + 555 - 56))
 >= (61 - 56)) / 1e5

[1] 0.33376

Note that all the numbers are hard-wired into the model. Sorry, Bob, but this is a bad practice. I have emphasized to students over the last forty years of teaching statistics and quantitative analysis importance of naming variables.

  • Give things names and assign values to the names.

That way, you know what you are calculating and can much more easily examine other alternatives. With 10 games left in the season, his estimate makes a lot of sense. I would have liked to match it with Albert’s model with 10 games remaining, but obtaining the data for hitters as of that date is a bit daunting for the purposes of this exercise.

However, with 10 games remaining, Carpenter’s estimate of the at bats remaining for Stanton in the season is 37.2, which amounts to 3.7 per game, a reasonable average. It’s worth noting that Stanton is playing every day to maximize his chances of breaking the record and not getting days off, so 3.7 might be a bit low since his season-long average of 3.72 assumes he probably got a day off every ten games, a value I would have to check after the season.

The Final Predictions

Both models agree that as of Sunday morning, with 1 game to go, Stanton has probably left it too late to catch Maris although he has some chance of catching up to the Babe.

Here is the Albert model with this morning’s data:

set.seed(42) # to get consistent results from run to run
source("pred-hr.R")
model <- pred_hr(hits, "Stanton", standings, 10000)
current_hrs <- model$current_HR
homers <- tibble(hrs = model$future_HR)
ggplot(homers, aes(x = hrs)) + geom_histogram(bins = 20)
print(paste("Probability of tying or breaking the record:", 
      mean(round(homers$hrs >= (61 - current_hrs), 3))))
print(paste("Probability of tying or breaking 60 (Ruth's mark):",      mean(round(homers$hrs >= (60 - current_hrs), 3))))
print(paste("Average number of home runs Stanton will hit =", 
            round(mean(homers$hrs), 3)))
print(paste("Standard deviation:", round(sd(homers$hrs), 3)))

[1] "Probability of tying or breaking the record: 0.0478"
[1] "Probability of tying or breaking 60 (Ruth's mark): 0.3156"
[1] "Average number of home runs Stanton will hit = 0.367"
[1] "Standard deviation: 0.585"

While Albert’s prediction now gives Stanton less than a 5% chance of reaching or breaking 61 (Maris’ record), he does have a 32% chance of reaching Ruth’s mark. That is, he has a 32% chance of hitting 1 home run today. You can see this clearly in histogram of Albert’s model.

AlberthrSunday

Carpenter’s model echos this prediction fairly closely. The model updated to today:

ab <- 592
hr <- 59
games_played <- 158
games_remaining <- 1
phrs <- sum(rbinom(1e5,
 rpois(1e5, games_remaining * ab / games_played),
 rbeta(1e5, 1 + hr, 1 + ab - hr))
 >= (61 - hr)) / 1e5
print(paste("Probability of tying or breaking the record:",
 round(phrs, 3)))

[1] "Probability of tying or breaking the record: 0.057"

5.7% chance of catching Maris. Carpenter’s model also repeats Albert’s estimate of 31.4% that he will catch Babe Ruth.

We can but hope. As we say in Brazil, “esperança é a última a morrer” (Hope is the last thing to die).

 

 

 

Posted in estatística, R | Tagged , , | Leave a comment

Going, going, … 2

Em vários posts este mês, destaquei um novo versão do curso de estatística que oferecemos em São Paulo sobre estatística. É um curso inicial que usa R como a ferramenta chave para ensinar a montagem de análises quantitativas importantes nas áreas académicas, científicas e de negócios.

Infelizmente, tivemos pouca reação e não temos certeza se temos um público para esta sessão do curso. Se não temos uma indicação de interesse, precisamos focar em outros projetos.

Mas, este projeto tem uma importância para mim e acho seria do alto valor para os alunos.

Se tiver interesse no curso descrito aqui: Ementa — Matéria de Análise de Dados, por favor entre em contato para nós podemos finalizar os detalhes e inicia o curso.

Se tiver qualquer outro pensamento, compartilhe nos comentários abaixo.

Obrigado.

Posted in bioinformática, Cursos, estatística, R | Tagged | Leave a comment

Hold the Phones . . . Stanton’s Alive Again

Sometimes, events move faster than we predict them. This is one of the things that makes  statistics as much of an art as a science. Last night, Giancarlo Stanton hit not 1 but 2 home runs. He is now tied with Babe Ruth’s old record of 59 home runs in a single season. So, my fairly negative prediction of last night needs a drastic revision. He only needs 2 more dingers to equal Maris’ mark and he still has 6 games to do it.

As I write this, I’m waiting for Sports Illustrated’s site to update its hitting database (the source of numbers for Jim Albert’s model). However, I’ve been thinking more about Bob Carpenter’s concise model and its Poisson estimation for the number of at bats that Stanton is likely to have. Here is the Carpenter model updated with last night’s numbers.

ab <- 583
hr <- 59
games <- 156
phrs <- sum(rbinom(1e5,
     rpois(1e5, 10 * ab / games),
     rbeta(1e5, 1 + hr, 1 + ab - hr)) >= (61 - hr)) / 1e5
print(paste("Probability of tying or breaking the record:",
     round(phrs, 3)))

[1] "Probability of tying or breaking the record: 0.888"

In this model, the probability of success (tying or breaking the record) has exploded to 0.888. This has a lot to do with the number of at bats, the random poisson simulation encapsulated in the Poisson distribution estimation of the number of trials (the second parameter of the binomial estimator, ‘rbinom’). Let’s look at this value across the 100,000 simulations that Carpenter makes:

new_abs <- rpois(1e5, 10 * ab / games)
print(paste("Estimated number of remaining at bats:",
    round(mean(new_abs), 2)))

## [1] "Estimated number of remaining at bats: 37.38"

summary(new_abs)

 Min.  1st Qu.  Median  Mean  3rd Qu.    Max. 
13.00   33.00   37.00   37.38   41.00   67.00 

# Show histogram of at bat distribution

suppressPackageStartupMessages(library(ggpubr))
nab <- tibble(new_abs)
gghistogram(nab, x = "new_abs", binwidth = 1, add = "mean",
     add_density = TRUE, ggtheme = theme_gray(), fill = "gray",
     title = "Distribution of New At Bats for Stanton",
     xlab = "New At Bats")
newatbats

This simulation suggests that Stanton will have more or less 37 at bats remaining in six games. That’s an average of 6 at bats per game. I’ve never heard of a series of six games that are either so long (in terms of extra innings) or with so many runs for a mediocre team like the Marlins that batters would get an average of 6 at bats per game.

Also, it is important to remember that we are measuring at bats here, not plate appearances. Plate appearances are the actual number of times the batter comes to the plate to bat. At bats reduce that number as they do not count base on balls, hit by pitch and other categories.

Let us try a revised version of the Carpenter model that limits the at bats to a more realistic 4 per game, or 24 for the remainder of the season. Here we see that the probability of tying or breaking the record is a bit reduced to 0.714.

at_bats_game <- 4
phrs2 <- sum(rbinom(1e5, 
             at_bats_game * (162 - games),
             rbeta(1e5, 1 + hr, 1 + ab - hr))
       >= (61 - hr)) / 1e5
print (paste("Revised prob. of tying or breaking the record:", round (phrs2, 3)))

## [1] "Revised prob. of tying or breaking the record: 0.714"

This probability is still quite high and not very believable given some basic feel for the game. As well, the probability of hitting a home run on any given at bat (the ‘rbeta’ estimate of the probability of success in any binomial trial) is overestimated. Assigning parameters α (alpha) and β (beta) to the beta distribution requires some more experimentation. Both (1 + hr) and (1 + ab – hr) are both reasonable starting points. Many Bayesian tutorials would recognize these as good estimates for a true Bayesian model that calculates a posterior probability based on a prior probability and evidence (how many home runs Stanton has hit so far in x at bats and y games), this is not a true Bayesian model. The ‘rbeta’ function is calculating an initial probability not a posterior. It is merely the p term in the ‘rbinom’ function.

‘rbinom’ requires three parameters as shown below:

rbinom(simulations, trials, probability)

Simulations is the number of times we are going to repeat the experiment. In this case, it means how many times are we going to simulate Stanton’s 24 remaining at bats. Trials is the number of at bats in each experiment. Probability is the probability of success in any one trial. There is no second, posterior probability in this model. While there is no overriding necessity to use the rbeta function, it turns out it performs fairly well. Let’s compare it to using a frequentist approach and substituting Stanton’s real home run rate (close to 1 home run per 10 at bats) for the rbeta function.

phrs3 <- sum(rbinom(1e5,
             at_bats_game * (162 - games),
             hr / ab)
          >= (61 - hr)) / 1e5
print(paste("Revised prob. of tying or breaking the record:", round(phrs3, 3)))

[1] "Revised prob. of tying or breaking the record: 0.715"

As we can see, the rbeta gives us virtually the same answer as Stanton’s existing home run hitting rate.

Let’s now look at the Albert model and see if it has become more optimistic about Stanton’s chances. Albert’s estimate has increased to 23.6% from 3%. So, the two new homers have had a significant effect on its view of Stanton’s chances. However, over its 10,000 simulations, it believes that Stanton will only hit one additional homer, leaving him in limbo without a share of the record. Given the model’s standard deviation of almost 1 home run, it suggests that Stanton could tie with Maris at 61. However, it is still more pessimistic than Carpenter’s model. As you can see from the graph below, the simulations in Albert’s model suggest that 0 and 1  home runs are the most likely outcomes in the 10,000 simulations. The 2 or greater options account for the 23.6% cited above.

Albertv3

R is an amazing tool for conducting these analyses. To do the number of variations I’ve experimented with while writing this post, I’ve modified Jim Albert’s code somewhat. Since he wrote it clearly, it was a piece of cake to mold it to my purposes (largely to avoid wasting time scraping the web on each variation I tried since once the data is downloaded, it’s only going to be updated after the next game).

So, 23.6% or 71.5% probability of tying or breaking Maris’ record. Which is correct? I lean toward Albert’s analysis. As the race between Maris and Mickey Mantle in 1961 showed, there are many factors, physical and psychological, that affect players’ performance at the end of a six-month season.

Let’s see what happens Sunday. Until then, if you want to see the R code I’ve used for this, have a look at this file: Stanton_HR_Prediction_v2a .

 

Posted in estatística, R | Tagged , , , | Leave a comment

Going, Going . . . 1

Two posts today with similar themes. Time is running out.  First, time is running out for Giancarlo Stanton. His bat has been very silent this week so far. The Marlins have 7 more games and he still needs 4 dingers to reach Roger Maris’ old home run record.  I have updated the Albert and Carpenter models. Stanton’s probability of tying the record has fallen in three days from around 20% to 3% according to Jim Albert’s model. Strangely, the Carpenter model continues to predict a 50% probability of tying or breaking the record.

Looking at Carpenter’s model in greater detail, the poisson estimation of the number of remaining at bats seems optimistic and appears to be driving the estimation higher because it gives Stanton more chances to hit home runs. A rule of thumb in baseball is that a batter will get around 4 at bats per game. With 7 games remaining, that would be 28 at bats. Carpenter’s model gives him more.

rpois(1e5, 10 * ab / games)

There would have to be an abnormal number of very high scoring games to reach that level. His R statement (above) when today’s numbers are applied to it suggests 37 at bats.

ab <- 579
games <- 155
new_abs <- rpois(1e5, 10 * ab / games)
print(paste("Estimated number of remaining at bats:",
 round(mean(new_abs), 2)))

[1] "Estimated number of remaining at bats: 37.35"

While hope is the last thing to die, I’m beginning to despair that Stanton will be the one. As we say when we see a home run leaving the park, Stanton’s opportunity is “going, going, …” but will it be gone.

More on Sunday, when the season comes to a close.

 

Posted in R | Tagged , , | Leave a comment