simple tool in R to analyze data : Graph representation, p-value, test, odds ratio, etc.

news
Author

Jumbong Junior

Published

November 16, 2023

1 Introduction

This document is primarily intended for second year students at ENSAI in general, and for those who are particularly interested in biostatistics. When we deal with data, an easy way to understand our work is to create a graph. This gives us an intuition about what we are doing. The advice to follow is to choose simplicity when using graphs and to favor those that everyone knows how to interpret. So when you create a graph, think about your audience and not your personal preferences.

There are also many concerns that we use every day to take decision and we don’t know what they mean. For example the p-value. In this document, we will go through those concepts and explain them in a simple way.

2 Presentation of data.

The data used in this document come from ParisSaclay Introduction à la statistique avec R - Session 21 if the mooc is available when you will read this document. The dataset has 799 observations and 26 variables. One observation represents a prisoner, and variables correspond to the mental health status of the prisoner.

## import library for readcsv
library(readr)

## import data
setwd("C:/Users/johns/OneDrive/Bureau/code/blog/ensai/posts/ClimateScenario")

getwd()
[1] "C:/Users/johns/OneDrive/Bureau/code/blog/ensai/posts/ClimateScenario"
smp2 <- read_delim("data/smp2.csv", delim = ";",  
    escape_double = FALSE, trim_ws = TRUE,show_col_types = FALSE)

str(smp2)
spc_tbl_ [799 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ age         : num [1:799] 31 49 50 47 23 34 24 52 42 45 ...
 $ prof        : chr [1:799] "autre" NA "prof.intermediaire" "ouvrier" ...
 $ duree       : num [1:799] 4 NA 5 NA 4 NA NA 5 4 NA ...
 $ discip      : num [1:799] 0 0 0 0 1 0 0 0 1 0 ...
 $ n.enfant    : num [1:799] 2 7 2 0 1 3 5 2 1 2 ...
 $ n.fratrie   : num [1:799] 4 3 2 6 6 2 3 9 12 5 ...
 $ ecole       : num [1:799] 1 2 2 1 1 2 1 2 1 2 ...
 $ separation  : num [1:799] 0 1 0 1 1 0 1 0 1 0 ...
 $ juge.enfant : num [1:799] 0 0 0 0 NA 0 1 0 1 0 ...
 $ place       : num [1:799] 0 0 0 1 1 0 1 0 0 0 ...
 $ abus        : num [1:799] 0 0 0 0 0 0 0 0 1 1 ...
 $ grav.cons   : num [1:799] 1 2 2 1 2 1 5 1 5 5 ...
 $ dep.cons    : num [1:799] 0 0 0 0 1 0 1 0 1 0 ...
 $ ago.cons    : num [1:799] 1 0 0 0 0 0 0 0 0 0 ...
 $ ptsd.cons   : num [1:799] 0 0 0 0 0 0 0 0 0 0 ...
 $ alc.cons    : num [1:799] 0 0 0 0 0 0 0 0 1 1 ...
 $ subst.cons  : num [1:799] 0 0 0 0 0 0 1 0 1 0 ...
 $ scz.cons    : num [1:799] 0 0 0 0 0 0 0 0 0 0 ...
 $ char        : num [1:799] 1 1 1 1 1 1 1 1 4 1 ...
 $ rs          : num [1:799] 2 2 2 2 2 1 3 2 3 2 ...
 $ ed          : num [1:799] 1 2 3 2 2 2 3 2 3 2 ...
 $ dr          : num [1:799] 1 1 2 2 2 1 2 2 1 2 ...
 $ suicide.s   : num [1:799] 0 0 0 1 0 0 3 0 4 0 ...
 $ suicide.hr  : num [1:799] 0 0 0 0 0 0 1 0 1 0 ...
 $ suicide.past: num [1:799] 0 0 0 0 1 0 1 0 1 0 ...
 $ dur.interv  : num [1:799] NA 70 NA 105 NA NA 105 84 78 60 ...
 - attr(*, "spec")=
  .. cols(
  ..   age = col_double(),
  ..   prof = col_character(),
  ..   duree = col_double(),
  ..   discip = col_double(),
  ..   n.enfant = col_double(),
  ..   n.fratrie = col_double(),
  ..   ecole = col_double(),
  ..   separation = col_double(),
  ..   juge.enfant = col_double(),
  ..   place = col_double(),
  ..   abus = col_double(),
  ..   grav.cons = col_double(),
  ..   dep.cons = col_double(),
  ..   ago.cons = col_double(),
  ..   ptsd.cons = col_double(),
  ..   alc.cons = col_double(),
  ..   subst.cons = col_double(),
  ..   scz.cons = col_double(),
  ..   char = col_double(),
  ..   rs = col_double(),
  ..   ed = col_double(),
  ..   dr = col_double(),
  ..   suicide.s = col_double(),
  ..   suicide.hr = col_double(),
  ..   suicide.past = col_double(),
  ..   dur.interv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

For our analysis, we will use four variables :

  • age : numerical variable (the age of the prisoner in years)
  • n.enfant : numerical variable (the number of children)
  • dep.cons : Existence of a depressive disorder (0 = no, 1 = yes)
  • prof : prisoner’s profession (1 = unemployed, 2 = farmer, 3 = independent, 4 = executive, 5 = employee, 6 = worker, 7 = retired, 8 = student).

The first thing we want to do is to analyze the distribution of each variable.

3 Distribution of categorical variables: profession.

3.1 Barplot

The graphic tool allow us to represent a distribution of categorical variables is the barplot. Before we built the distribution of variable, we have an amazing tool in our hands to summarize the data. It is the table function. This function allows us to count the number of occurrences of each category in a variable.

## count the number of occurrences of each category in a variable
table(smp2$prof)

       agriculteur            artisan              autre              cadre 
                 6                 90                 31                 24 
           employe            ouvrier prof.intermediaire        sans emploi 
               135                227                 58                222 

The result of the table function is a vector. We can use this vector to create a barplot.

## create a barplot
barplot(table(smp2$prof), main = "Distribution of profession", xlab = "Profession", ylab = "Frequency")

3.2 Pie chart

Another way to represent the distribution of the category variable is the pie chart. The pie chart is a circle divided into sectors. Each sector represents a category of the variable. The size of the sector is proportional to the frequency of the category. Statistics don’t like the pie chart because it is difficult to compare the size of the sectors. However, it is a good way to represent the distribution of a variable when the number of categories is small.

## create a pie chart and put the legend on the right and the percentage
pie(table(smp2$dep.cons), main = "Distribution of depressive disorder", col = c("red", "blue"), labels = c("No", "Yes"), cex = 0.8)

4 Distribution of numerical variables: age

4.1 Histogram

The histogram is a graphical representation of the distribution of a numerical variable. Unlike barplot, We don’t need to use the table function to create an histogram.

## create an histogram
hist(smp2$age, main = "Distribution of age", xlab = "Age", ylab = "Frequency")

We can also plot in the same graphic the density of the variable. The density is the probability of the variable.

## create an histogram with density
hist(smp2$age, main = "Distribution of age", xlab = "Age", ylab = "Density", freq = FALSE)
lines(density(smp2$age,na.rm =T), col = "red")

4.2 Boxplot

The boxplot is a graphical representation of the distribution of a numerical variable. The boxplot is composed of a box and two whiskers. The box represents the interquartile range (IQR). The whiskers represent the minimum and maximum values of the variable. The boxplot is a good way to detect outliers in the data.

## create a boxplot
boxplot(smp2$age, main = "Distribution of age", xlab = "Age", ylab = "Frequency")

4.3 Association between categorical and numerical variables

The boxplot is a good way to represent the distribution of a numerical variable according to a categorical variable.

## create a boxplot
boxplot(smp2$age ~ smp2$dep.cons, main = "Distribution of age according to depressive disorder", xlab = "Depressive disorder", ylab = "Age")    

5 Association between two numerical variables

When you deal with two numerical variables.The best way to represent the association between them is the scatter plot.

## create a scatter plot
plot(smp2$age, smp2$n.enfant, main = "Association between age and number of children", xlab = "Age", ylab = "Number of children")

As you see in this graphic, it seem like all the points are not represented. This is because the number of point is too large. R have a good function to deal with the problems : jitter. This function add a little noise to the data to avoid the overplotting.

## create a scatter plot
plot(jitter(smp2$age), jitter(smp2$n.enfant), main = "Association between age and number of children", xlab = "Age", ylab = "Number of children")

6 Association between two categorical variables

The best way to represent the association between two categorical variables is to use a contingency table. A contingency table displays the frequency or count of observations that fall into each combination of the categories of the two variables. We can use the table of contingency to create a barplot.

## create a contingency table
table(smp2$dep.cons, smp2$prof)
   
    agriculteur artisan autre cadre employe ouvrier prof.intermediaire
  0           4      62    22    16      79     131                 31
  1           2      28     9     8      56      96                 27
   
    sans emploi
  0         133
  1          89
## create a barplot
barplot(table(smp2$dep.cons, smp2$prof), main = "Association between depressive disorder and profession", xlab = "Depressive disorder", ylab = "Frequency")

8 Temporal evolution of a numerical variable

I will finish by showing how to represent the evolution of the mean of a numerical variable over time. This is very useful In biostatistic When we want to show the evolution of a disease over time. I will use the package gplots to create the graph. You can use ggplot2 to create the same graph. We will use a temporal data outils_hdrs.csv. The data have 3 variables : - NUMERO : the id of the patient - VISIT : the visit of the patient - HDRS : the score of the patient

hdrs <- read_csv2("data/outils_hdrs.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 1053 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (3): NUMERO, VISIT, HDRS

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hdrs)
spc_tbl_ [1,053 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ NUMERO: num [1:1053] 96 96 96 96 96 96 96 96 157 157 ...
 $ VISIT : num [1:1053] 0 4 7 14 21 28 42 56 0 4 ...
 $ HDRS  : num [1:1053] 34 26 12 7 5 1 1 1 27 19 ...
 - attr(*, "spec")=
  .. cols(
  ..   NUMERO = col_double(),
  ..   VISIT = col_double(),
  ..   HDRS = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# install packages

#install.packages("gplots")  

library(gplots)  

### Representation of the evolution of the mean of the HDRS score over time
plotmeans(hdrs$HDRS~hdrs$VISIT,gap=0,barcol="black",main="Evolution moyenne du HDRS dans le temps",xlab="Visite",ylab="HDRS moyen")

To compute the same graph with ggplot2, you need to compute the mean, the first quartile and the third quartile of the HDRS score for each visit.

library(dplyr)
hdrs_mean <- hdrs %>% 
  group_by(VISIT) %>%
  summarise(mean = mean(HDRS,na.rm =T),
            lower = quantile(HDRS,0.25,na.rm = T),
            upper = quantile(HDRS,0.75,na.rm = T))
head(hdrs_mean)
# A tibble: 6 × 4
  VISIT  mean lower upper
  <dbl> <dbl> <dbl> <dbl>
1     0  27.9  25      30
2     4  25.2  23      28
3     7  20.6  16      25
4    14  16.7  12      22
5    21  14.3   8      20
6    28  11.5   6.5    16

Now we can create the graph with ggplot2.

library(ggplot2)
ggplot(hdrs_mean, aes(x = VISIT, y = mean)) +
  geom_point(size = 3) +
  geom_line(size = 0) +
  geom_errorbar(aes(ymin=lower, ymax=upper), width=1)+
  scale_x_continuous(expand=c(0,0), breaks=c(0,4,7,14,21,28,42,56)) +
  geom_vline(xintercept = 21, linetype="dashed",
               color = "black", size=1) +
    theme_classic() +
    theme(axis.line.x = element_line(color="black", size = 0.5),
          axis.line.y = element_line(color="black", size = 0.5)) +
    theme(axis.text=element_text(size=12, face="bold"),
          legend.title=element_blank(), legend.key = element_blank()) +
    theme(axis.text=element_text(size=12), axis.title=element_text(size=12)) +
    xlab("VISIT")+
    ylab("The mean score of the patient") +
    theme(legend.position = c(1,1)) +
    theme(plot.margin = margin(1, 1, 1, 0.5, "cm"))

Let’s explain each part of this graphic :

  • The ggplot(hdrs_mean, aes(x = VISIT, y = mean)) + function creates the graph. The VISIT variable is on the x-axis and the mean variable is on the y-axis.
  • The geom_point(size = 3) + function creates the points of the graph.
  • The geom_line(size = 0) + function creates the line of the graph.
  • The geom_errorbar(aes(ymin=lower, ymax=upper), width=1) + function creates the error bars of the graph.
  • The scale_x_continuous(expand=c(0,0), breaks=c(0,4,7,14,21,28,42,56)) + function sets the x-axis.
  • The geom_vline(xintercept = 21, linetype=“dashed”, color = “black”, size=1) + function creates a vertical line at the 21st visit.
  • The theme_classic() + function sets the theme of the graph.
  • The theme(axis.line.x = element_line(color=“black”, size = 0.5), axis.line.y = element_line(color=“black”, size = 0.5)) + function sets the color and size of the axis lines.
  • The theme(axis.text=element_text(size=12, face=“bold”), legend.title=element_blank(), legend.key = element_blank()) + function sets the size and style of the axis text.
  • The theme(axis.text=element_text(size=12), axis.title=element_text(size=12)) + function sets the size of the axis text.
  • The xlab(“VISIT”)+ ylab(“The mean score of the patient”) + function sets the labels of the x-axis and y-axis.
  • The theme(legend.position = c(1,1)) + function sets the position of the legend.
  • The theme(plot.margin = margin(1, 1, 1, 0.5, “cm”)) + function sets the margins of the graph.

9 Conclusion

I will finish here And I hope that you will enjoy it. I will continue to write about the p-value, the test, the odds ratio, etc. in the next document.