---
title: "Introduction to tableone"
author: "Kazuki Yoshida"
date: "2020-07-25"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to tableone}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, message = FALSE, tidy = FALSE, echo = F}
## Create a header using devtools::use_vignette("my-vignette")
## knitr configuration: https://yihui.name/knitr/options#chunk_options
library(knitr)
showMessage <- FALSE
showWarning <- TRUE
set_alias(w = "fig.width", h = "fig.height", res = "results")
opts_chunk$set(comment = "", error= TRUE, warning = showWarning, message = showMessage,
               tidy = FALSE, cache = F, echo = T,
               fig.width = 7, fig.height = 7)

## R configuration
options(width = 116, scipen = 5)
```

## What is tableone?

The tableone package is an R package that eases the construction of "Table 1", *i.e.*, patient baseline characteristics table commonly found in biomedical research papers. The packages can summarize both continuous and categorical variables mixed within one table. Categorical variables can be summarized as counts and/or percentages. Continuous variables can be summarized in the "normal" way (means and standard deviations) or "nonnormal" way (medians and interquartile ranges).


## Load packages

```{r}
## tableone package itself
library(tableone)
## survival package for Mayo Clinic's PBC data
library(survival)
data(pbc)
```

## Single group summary

### Simplest use case

The simplest use case is summarizing the whole dataset. You can just feed in the data frame to the main workhorse function CreateTableOne(). You can see there are 418 patients in the dataset.
```{r}
CreateTableOne(data = pbc)
```

### Categorical variable conversion

Most of the categorical variables are coded numerically, so we either have to transform them to factors in the dataset or use factorVars argument to transform them on-the-fly. Also it's a better practice to specify which variables to summarize by the vars argument, and exclude the ID variable(s). How do we know which ones are numerically-coded categorical variables? Please check your data dictionary (in this case help(pbc)). This time I am saving the result object in a variable.

```{r}
## Get variables names
dput(names(pbc))
## Vector of variables to summarize
myVars <- c("time", "status", "trt", "age", "sex", "ascites", "hepato",
          "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos",
          "ast", "trig", "platelet", "protime", "stage")
## Vector of categorical variables that need transformation
catVars <- c("status", "trt", "ascites", "hepato",
             "spiders", "edema", "stage")
## Create a TableOne object
tab2 <- CreateTableOne(vars = myVars, data = pbc, factorVars = catVars)
```

OK. It's more interpretable now. Binary categorical variables are summarized as counts and percentages of the second level. For example, if it is coded as 0 and 1, the "1" level is summarized. For 3+ category variable all levels are summarized. Please bear in mind, the percentages are calculated after excluding missing values.

```{r}
tab2
```

### Showing all levels for categorical variables

If you want to show all levels, you can use showAllLevels argument to the print() method.

```{r}
print(tab2, showAllLevels = TRUE, formatOptions = list(big.mark = ","))
```

### Detailed information including missingness

If you need more detailed information including the number/proportion missing. Use the summary() method on the result object. The continuous variables are shown first, and the categorical variables are shown second.

```{r}
summary(tab2)
```

### Summarizing nonnormal variables

It looks like most of the continuous variables are highly skewed except time, age, albumin, and platelet (biomarkers are usually distributed with strong positive skews). Summarizing them as such may please your future peer reviewer(s). Let's do it with the nonnormal argument to the print() method. Can you see the difference. If you just say nonnormal = TRUE, all variables are summarized the "nonnormal" way.

```{r}
biomarkers <- c("bili","chol","copper","alk.phos","ast","trig","protime")
print(tab2, nonnormal = biomarkers, formatOptions = list(big.mark = ","))
```

### Fine tuning

If you want to fine tune the table further, please check out ?print.TableOne for the full list of options.


## Multiple group summary

Often you want to group patients and summarize group by group. It's also pretty simple. Grouping by exposure categories is probably the most common way, so let's do it by the treatment variable. According to ?pbc, it is coded as (1) D-penicillmain (it's really "D-penicillamine"), (2) placebo, and (NA) not randomized. NA's do not function as a grouping variable, so it is dropped. If you do want to show the result for the NA group, then you need to recoded it something other than NA.

```{r}
tab3 <- CreateTableOne(vars = myVars, strata = "trt" , data = pbc, factorVars = catVars)
print(tab3, nonnormal = biomarkers, formatOptions = list(big.mark = ","))
```

### Testing

As you can see in the previous table, when there are two or more groups group comparison p-values are printed along with the table (well, let's not argue the appropriateness of hypothesis testing for table 1 in an RCT for now.). Very small p-values are shown with the less than sign. The hypothesis test functions used by default are chisq.test() for categorical variables (with continuity correction) and oneway.test() for continous variables (with equal variance assumption, i.e., regular ANOVA). Two-group ANOVA is equivalent of t-test.

You may be worried about the nonnormal variables and small cell counts in the stage variable. In such a situation, you can use the nonnormal argument like before as well as the exact (test) argument in the print() method. Now kruskal.test() is used for the nonnormal continous variables and fisher.test() is used for categorical variables specified in the exact argument. kruskal.test() is equivalent to wilcox.test() in the two-group case. The column named test is to indicate which p-values were calculated using the non-default tests.

To also show standardized mean differences, use the smd option.

```{r}
print(tab3, nonnormal = biomarkers, exact = "stage", smd = TRUE)
```

## Exporting

My typical next step is to export the table to Excel for editing, and then to Word (clinical medical journals usually do not offer LaTeX submission).


### Quick and dirty way

The quick and dirty way that I used to do is copy and paste. Use the quote = TRUE argument to show the quotes and noSpaces = TRUE to remove spaces used to align text in the R console (the latter is optional). Now you can just copy and paste the whole thing to an Excel spread sheet. After pasting, click the small pasting icon to choose Use Text Import Wizard..., in the dialogue you can just click finish to fit the values in the appropriate cells. Then you can edit or re-align things as you like. I usualy center-align the group summaries, and right-aligh the p-values.

```{r}
print(tab3, nonnormal = biomarkers, exact = "stage", quote = TRUE, noSpaces = TRUE)
```

### Real export way

If you do not like the manual labor of copy-and-paste, you can potentially automate the task by the following way. The print() method for a TableOne object invisibly return a matrix identical to what you see. You can capture this by assignment to a variable (here tab3Mat). Do not use the quote argument in this case, the noSpaces argument is again optional. The self-contradictory printToggle = FALSE for the print() method avoids unnecessary printing if you wish. Then you can save the object to a CSV file. As it is a regular matrix object, you can save it to an Excel file using packages such as XLConnect.

```{r, eval = FALSE}
tab3Mat <- print(tab3, nonnormal = biomarkers, exact = "stage", quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
## Save to a CSV file
write.csv(tab3Mat, file = "myTable.csv")
```

## Miscellaneous

### Categorical or continous variables-only

You may want to see the categorical or continous variables only. You can do this by accessing the CatTable part and ContTable part of the TableOne object as follows. summary() methods are defined for both as well as print() method with various arguments. Please see ?print.CatTable and ?print.ContTable for details.

```{r}
## Categorical part only
tab3$CatTable
## Continous part only
print(tab3$ContTable, nonnormal = biomarkers)
```

--------------------
- Authored by Kazuki Yoshida
- CRAN page: https://cran.r-project.org/package=tableone
- github page: https://github.com/kaz-yos/tableone