Introduction to EDA(Exploratory Data Analysis) in the dlookr package
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
This document introduces EDA(Exploratory Data Analysis) methods provided by the dlookr
package. You will learn how to EDA of tbl_df
data that inherits from data.frame and data.frame
with functions provided by dlookr
.
dlookr
increases synergy when used with the dplyr
package. Particularly in data exploration and data wrangle, it increases the efficiency of the tidyverse
package group.
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | summaries | frequency tables | univar_category() |
|
categorical | summaries | chi-squared test | summary.univar_category() |
|
categorical | visualize | bar charts | plot.univar_category() |
|
categorical | visualize | bar charts | plot_bar_category() |
|
numerical | summaries | descriptive statistics | describe() |
x |
numerical | summaries | descriptive statistics | univar_numeric() |
|
numerical | summaries | descriptive statistics of standardized variable | summary.univar_numeric() |
|
numerical | visualize | histogram, box plot | plot.univar_numeric() |
|
numerical | visualize | Q-Q plots | plot_qq_numeric() |
|
numerical | visualize | box plot | plot_box_numeric() |
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | summaries | frequency tables cross cases | compare_category() |
|
categorical | summaries | contingency tables, chi-squared test | summary.compare_category() |
|
categorical | visualize | mosaics plot | plot.compare_category() |
|
numerical | summaries | correlation coefficient, linear model summaries | compare_numeric() |
|
numerical | summaries | correlation coefficient, linear model summaries with threshold | summary.compare_numeric() |
|
numerical | visualize | scatter plot with marginal box plot | plot.compare_numeric() |
|
numerical | Correlate | correlation coefficient | correlate() |
x |
numerical | Correlate | visualization of a correlation matrix | plot_correlate() |
x |
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
numerical | summaries | Shapiro-Wilk normality test | normality() |
x |
numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | plot_normality() |
x |
Target Variable | Predictor | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | categorical | contingency tables | relate() |
x |
categorical | categorical | mosaics plot | plot.relate() |
x |
categorical | numerical | descriptive statistic for each levels and total observation | relate() |
x |
categorical | numerical | density plot | plot.relate() |
x |
categorical | categorical | bar charts | plot_bar_category() |
|
numerical | categorical | ANOVA test | relate() |
x |
numerical | categorical | scatter plot | plot.relate() |
x |
numerical | numerical | simple linear model | relate() |
x |
numerical | numerical | box plot | plot.relate() |
x |
categorical | numerical | Q-Q plots | plot_qq_numeric() |
|
categorical | numerical | box plot | plot_box_numeric() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
reporting the information of EDA into pdf file | reporting the information of EDA. | eda_report() |
x |
reporting the information of EDA into html file | reporting the information of EDA. | eda_report() |
x |
To illustrate the basic use of EDA in the dlookr package, I use a Carseats
datasets. Carseats
in the ISLR
package is simulation dataset that sells children’s car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.
library(dlookr)
library(dplyr)
library(ggplot2)
library(flextable)
library(rmarkdown)
glimpse(ISLR::Carseats)
Rows: 400
Columns: 11
$ Sales <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.…
$ CompPrice <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132…
$ Income <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78,…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, …
$ Population <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131,…
$ Price <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 10…
$ ShelveLoc <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Goo…
$ Age <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, …
$ Education <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, …
$ Urban <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, N…
$ US <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Y…
The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)
When data analysis is performed, data containing missing values is often encountered. However, Carseats is complete data without missing. Therefore, the missing values are generated as follows. And I created a data.frame object named carseats.
describe()
computes descriptive statistics for numerical data. The descriptive statistics help determine the distribution of numerical variables. Like function of dplyr, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame.
The variables of the tbl_df
object returned by describe()
are as follows.
n
: number of observations excluding missing valuesna
: number of missing valuesmean
: arithmetic averagesd
: standard deviationse_mean
: standard error mean. sd/sqrt(n)IQR
: interquartile range (Q3-Q1)skewness
: skewnesskurtosis
: kurtosisp25
: Q1. 25% percentilep50
: median. 50% percentilep75
: Q3. 75% percentilep01
, p05
, p10
, p20
, p30
: 1%, 5%, 20%, 30% percentilesp40
, p60
, p70
, p80
: 40%, 60%, 70%, 80% percentilesp90
, p95
, p99
, p100
: 90%, 95%, 99%, 100% percentilesFor example, we can computes the statistics of all numerical variables in carseats
:
describe(carseats) %>%
paged_table()
skewness
: The left-skewed distribution data, that is, the variables with large positive skewness should consider the log
or sqrt
transformations to follow the normal distribution. The variables Advertising
seem to need to consider variable transformations.mean
and sd
, se_mean
: ThePopulation
with a large standard error of the mean
(se_mean) has low representativeness of the arithmetic mean
(mean). The standard deviation
(sd) is much larger than the arithmetic average.The following explains the descriptive statistics only for a few selected variables.:
# Select columns by name
describe(carseats, Sales, CompPrice, Income) %>%
paged_table()
# Select all columns between year and day (inclusive)
describe(carseats, Sales:Income) %>%
paged_table()
# Select all columns except those from year to day (inclusive)
describe(carseats, -(Sales:Income)) %>%
paged_table()
By using dplyr, You can sort by left or right skewed size
(skewness).:
carseats %>%
describe() %>%
select(variable, skewness, mean, p25, p50, p75) %>%
filter(!is.na(skewness)) %>%
arrange(desc(abs(skewness))) %>%
flextable()
variable | skewness | mean | p25 | p50 | p75 |
Advertising | 0.63958577 | 6.635000 | 0.00 | 5.00 | 12.00 |
Sales | 0.18556036 | 7.496325 | 5.39 | 7.49 | 9.32 |
Price | -0.12528619 | 115.795000 | 100.00 | 117.00 | 131.00 |
Age | -0.07718173 | 53.322500 | 39.75 | 54.50 | 66.00 |
Population | -0.05122664 | 264.840000 | 139.00 | 272.00 | 398.50 |
Education | 0.04400683 | 13.900000 | 12.00 | 14.00 | 16.00 |
CompPrice | -0.04275458 | 124.975000 | 115.00 | 125.00 | 135.00 |
Income | 0.03601821 | 69.321053 | 44.00 | 69.00 | 92.00 |
The describe()
function supports the group_by()
function syntax of dplyr
.
carseats %>%
group_by(US) %>%
describe(Sales, Income) %>%
paged_table()
carseats %>%
group_by(US, Urban) %>%
describe(Sales, Income) %>%
paged_table()
normality()
performs a normality test on numerical data. Shapiro-Wilk normality test
is performed. If the number of observations is larger than 5000, 5000 observations are extracted by random simple sampling and then tested.
The variables of tbl_df
object returned by normality()
are as follows.
statistic
: Statistics of the Shapiro-Wilk testp_value
: p-value of the Shapiro-Wilk testsample
: Number of sample observations performed Shapiro-Wilk testnormality()
performs the normality test for all numerical variables of carseats
as follows.:
vars | statistic | p_value | sample |
Sales | 0.9952023 | 0.25397543034064512523784 | 400 |
CompPrice | 0.9984315 | 0.97715118935082667661618 | 400 |
Income | 0.9608319 | 0.00000001548722780441916 | 400 |
Advertising | 0.8735360 | 0.00000000000000001491831 | 400 |
Population | 0.9520068 | 0.00000000040808515402126 | 400 |
Price | 0.9959161 | 0.39021281790522266419430 | 400 |
Age | 0.9567201 | 0.00000000186455097764598 | 400 |
Education | 0.9241976 | 0.00000000000024269275538 | 400 |
The following example performs a normality test on only a few selected variables.
vars | statistic | p_value | sample |
Sales | 0.9952023 | 0.25397543034065 | 400 |
CompPrice | 0.9984315 | 0.97715118935083 | 400 |
Income | 0.9608319 | 0.00000001548723 | 400 |
# Select all columns between year and day (inclusive)
normality(carseats, Sales:Income) %>%
flextable()
vars | statistic | p_value | sample |
Sales | 0.9952023 | 0.25397543034065 | 400 |
CompPrice | 0.9984315 | 0.97715118935083 | 400 |
Income | 0.9608319 | 0.00000001548723 | 400 |
# Select all columns except those from year to day (inclusive)
normality(carseats, -(Sales:Income)) %>%
flextable()
vars | statistic | p_value | sample |
Advertising | 0.8735360 | 0.00000000000000001491831 | 400 |
Population | 0.9520068 | 0.00000000040808515402126 | 400 |
Price | 0.9959161 | 0.39021281790522266419430 | 400 |
Age | 0.9567201 | 0.00000000186455097764598 | 400 |
Education | 0.9241976 | 0.00000000000024269275538 | 400 |
You can use dplyr to sort non-normal distribution variables by p_value.:
vars | statistic | p_value | sample |
Advertising | 0.8735360 | 0.00000000000000001491831 | 400 |
Education | 0.9241976 | 0.00000000000024269275538 | 400 |
Population | 0.9520068 | 0.00000000040808515402126 | 400 |
Age | 0.9567201 | 0.00000000186455097764598 | 400 |
Income | 0.9608319 | 0.00000001548722780441916 | 400 |
In particular, the Advertising
variable is considered to be the most out of the normal distribution.
The normality()
function supports the group_by()
function syntax in the dplyr
package.
carseats %>%
group_by(ShelveLoc, US) %>%
normality(Income) %>%
arrange(desc(p_value)) %>%
flextable()
variable | ShelveLoc | US | statistic | p_value | sample |
Income | Bad | No | 0.9647643 | 0.350396371 | 34 |
Income | Good | Yes | 0.9575857 | 0.035892391 | 61 |
Income | Bad | Yes | 0.9524369 | 0.023567053 | 62 |
Income | Good | No | 0.8790030 | 0.013998625 | 24 |
Income | Medium | Yes | 0.9643501 | 0.001900678 | 135 |
Income | Medium | No | 0.9440914 | 0.001606087 | 84 |
The Income
variable does not follow the normal distribution. However, if the US
is No
and the ShelveLoc
is Good
or Bad
at the significance level of 0.01, it follows the normal distribution.
In the following, we perform normality test of log(Income)
for each combination of ShelveLoc
and US
variables to inquire about normal distribution cases.
carseats %>%
mutate(log_income = log(Income)) %>%
group_by(ShelveLoc, US) %>%
normality(log_income) %>%
filter(p_value > 0.01) %>%
flextable()
variable | ShelveLoc | US | statistic | p_value | sample |
log_income | Bad | No | 0.9457546 | 0.09998244 | 34 |
plot_normality()
visualizes the normality of numeric data.
The information that plot_normality()
visualizes is as follows.
Histogram of original data
Q-Q plot of original data
histogram of log transformed data
Histogram of square root transformed data
Numerical data following a power-law distribution
are often encountered in data analysis. Since the numerical data following the power distribution is transformed into the normal distribution by performing the log and sqrt transform, the histogram of the data for the log and sqrt transform is drawn.
plot_normality()
can also specify several variables like normality()
function.
# Select columns by name
plot_normality(carseats, Sales, CompPrice)
The plot_normality()
function also supports the group_by()
function syntax in the dplyr
package.
carseats %>%
filter(ShelveLoc == "Good") %>%
group_by(US) %>%
plot_normality(Income)
Correlate()
finds the correlation coefficient of all combinations of carseats
numerical variables as follows:
correlate(carseats) %>%
paged_table()
The following example performs a normality test only on combinations that include several selected variables.
# Select columns by name
correlate(carseats, Sales, CompPrice, Income) %>%
paged_table()
# Select all columns between year and day (inclusive)
correlate(carseats, Sales:Income) %>%
paged_table()
# Select all columns except those from year to day (inclusive)
correlate(carseats, -(Sales:Income)) %>%
paged_table()
correlate()
produces two pairs of variable
combinations. So you can use the following filter()
function to get the correlation coefficient for a pair of variable
combinations:
carseats %>%
correlate(Sales:Income) %>%
filter(as.integer(var1) > as.integer(var2)) %>%
flextable()
var1 | var2 | coef_corr |
CompPrice | Sales | 0.06407873 |
Income | Sales | 0.15316603 |
Income | CompPrice | -0.09184062 |
The correlate()
function also supports the group_by()
function syntax in the dplyr
package.
carseats %>%
filter(ShelveLoc == "Good") %>%
group_by(Urban, US) %>%
correlate(Sales) %>%
filter(abs(coef_corr) > 0.5) %>%
flextable()
Urban | US | var1 | var2 | coef_corr |
No | No | Sales | Population | -0.5301429 |
No | No | Sales | Price | -0.8382820 |
No | Yes | Sales | Price | -0.6554729 |
Yes | No | Sales | Price | -0.8366861 |
Yes | No | Sales | Age | -0.6444156 |
Yes | Yes | Sales | Price | -0.6037759 |
plot_correlate()
visualizes the correlation matrix.
plot_correlate(carseats)
plot_correlate()
can also specify multiple variables, like the correlate()
function. The following is a visualization of the correlation matrix including several selected variables.
# Select columns by name
plot_correlate(carseats, Sales, Price)
The plot_correlate()
function also supports the group_by()
function syntax in the dplyr
package.
carseats %>%
filter(ShelveLoc == "Good") %>%
group_by(Urban, US) %>%
plot_correlate(Sales)
To perform EDA based on target variable
, you need to create atarget_by class
object. target_by()
creates a target_by class with an object inheriting data.frame or data.frame. target_by()
is similar to group_by()
in dplyr
which createsgrouped_df
. The difference is that you specify only one variable.
The following is an example of specifying US as target variable in carseats data.frame.:
categ <- target_by(carseats, US)
Let’s do the EDA when the target variable is categorical. When the categorical variable US is the target variable, the relationship between the target variable and the predictor is examined.
relate()
shows the relationship between the target variable and the predictor. The following example shows the relationship between Sales and the target variable US. The predictor Sales is a numeric variable. In this case, the descriptive statistics are shown for each level of the target variable.
# If the variable of interest is a numerical variable
cat_num <- relate(categ, Sales)
cat_num %>%
paged_table()
summary(cat_num)
variable US n na
Length:3 No :1 Min. :142.0 Min. :0
Class :character Yes :1 1st Qu.:200.0 1st Qu.:0
Mode :character total:1 Median :258.0 Median :0
Mean :266.7 Mean :0
3rd Qu.:329.0 3rd Qu.:0
Max. :400.0 Max. :0
mean sd se_mean IQR
Min. :6.823 Min. :2.603 Min. :0.1412 Min. :3.442
1st Qu.:7.160 1st Qu.:2.713 1st Qu.:0.1602 1st Qu.:3.686
Median :7.496 Median :2.824 Median :0.1791 Median :3.930
Mean :7.395 Mean :2.768 Mean :0.1796 Mean :3.866
3rd Qu.:7.682 3rd Qu.:2.851 3rd Qu.:0.1988 3rd Qu.:4.077
Max. :7.867 Max. :2.877 Max. :0.2184 Max. :4.225
skewness kurtosis p00
Min. :0.07603 Min. :-0.32638 Min. :0.0000
1st Qu.:0.13080 1st Qu.:-0.20363 1st Qu.:0.0000
Median :0.18556 Median :-0.08088 Median :0.0000
Mean :0.19489 Mean : 0.13350 Mean :0.1233
3rd Qu.:0.25432 3rd Qu.: 0.36344 3rd Qu.:0.1850
Max. :0.32308 Max. : 0.80776 Max. :0.3700
p01 p05 p10 p20
Min. :0.4675 Min. :3.147 Min. :3.917 Min. :4.754
1st Qu.:0.6868 1st Qu.:3.148 1st Qu.:4.018 1st Qu.:4.910
Median :0.9062 Median :3.149 Median :4.119 Median :5.066
Mean :1.0072 Mean :3.183 Mean :4.073 Mean :5.051
3rd Qu.:1.2771 3rd Qu.:3.200 3rd Qu.:4.152 3rd Qu.:5.199
Max. :1.6480 Max. :3.252 Max. :4.184 Max. :5.332
p25 p30 p40 p50
Min. :5.080 Min. :5.306 Min. :5.994 Min. :6.660
1st Qu.:5.235 1st Qu.:5.587 1st Qu.:6.301 1st Qu.:7.075
Median :5.390 Median :5.867 Median :6.608 Median :7.490
Mean :5.411 Mean :5.775 Mean :6.506 Mean :7.313
3rd Qu.:5.576 3rd Qu.:6.010 3rd Qu.:6.762 3rd Qu.:7.640
Max. :5.763 Max. :6.153 Max. :6.916 Max. :7.790
p60 p70 p75 p80
Min. :7.496 Min. :7.957 Min. :8.523 Min. : 8.772
1st Qu.:7.787 1st Qu.:8.386 1st Qu.:8.921 1st Qu.: 9.265
Median :8.078 Median :8.815 Median :9.320 Median : 9.758
Mean :8.076 Mean :8.740 Mean :9.277 Mean : 9.665
3rd Qu.:8.366 3rd Qu.:9.132 3rd Qu.:9.654 3rd Qu.:10.111
Max. :8.654 Max. :9.449 Max. :9.988 Max. :10.464
p90 p95 p99 p100
Min. : 9.349 Min. :11.28 Min. :13.64 Min. :14.90
1st Qu.:10.325 1st Qu.:11.86 1st Qu.:13.78 1st Qu.:15.59
Median :11.300 Median :12.44 Median :13.91 Median :16.27
Mean :10.795 Mean :12.08 Mean :13.86 Mean :15.81
3rd Qu.:11.518 3rd Qu.:12.49 3rd Qu.:13.97 3rd Qu.:16.27
Max. :11.736 Max. :12.54 Max. :14.03 Max. :16.27
The relate class
object created withrelate()
visualizes the relationship between the target variable and the predictor with plot()
. The relationship between US and Sales is represented by a density plot.
plot(cat_num)
The following example shows the relationship between ShelveLoc
and the target variable US
. The predictor, ShelveLoc, is a categorical variable. In this case, we show the contigency table
of two variables. The summary()
function also performs an independence test
on the contigency table.
# If the variable of interest is a categorical variable
cat_cat <- relate(categ, ShelveLoc)
cat_cat
ShelveLoc
US Bad Good Medium
No 34 24 84
Yes 62 61 135
summary(cat_cat)
Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
Number of cases in table: 400
Number of factors: 2
Test for independence of all factors:
Chisq = 2.7397, df = 2, p-value = 0.2541
plot()
visualizes the relationship between the target variable and the predictor. The relationship between US
and ShelveLoc
is represented by a mosaics plot
.
plot(cat_cat)
Let’s do the EDA when the target variable is numeric. When the numeric variable Sales is the target variable, the relationship between the target variable and the predictor is examined.
# If the variable of interest is a numerical variable
num <- target_by(carseats, Sales)
The following example shows the relationship between Price
and the target variable Sales
. Price, a predictor, is a numeric variable. In this case, we show the result of simple regression model
of target ~ predictor
relation. The summary()
function represents the details of the model.
# If the variable of interest is a numerical variable
num_num <- relate(num, Price)
num_num
Call:
lm(formula = formula_str, data = data)
Coefficients:
(Intercept) Price
13.64192 -0.05307
summary(num_num)
Call:
lm(formula = formula_str, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.5224 -1.8442 -0.1459 1.6503 7.5108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.641915 0.632812 21.558 <2e-16 ***
Price -0.053073 0.005354 -9.912 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.532 on 398 degrees of freedom
Multiple R-squared: 0.198, Adjusted R-squared: 0.196
F-statistic: 98.25 on 1 and 398 DF, p-value: < 2.2e-16
plot()
visualizes the relationship between the target variable and the predictor. The relationship between Sales and Price is represented as a scatter plot. The plot on the left represents the scatter plot of Sales and Price and the confidence interval of the regression line and the regression line. The plot on the right represents the relationship between the original data and the predicted value of the linear model as a scatter plot. If there is a linear relationship between the two variables, the observations will converge on the red diagonal in the scatter plot.
plot(num_num)
The following example shows the relationship between ShelveLoc
and the target variable Sales
. The predictor, ShelveLoc, is a categorical variable. It shows the result of performing one-way ANOVA
of target ~ predictor
relation. The results are represented in terms of an analysis of variance. The summary()
function also shows the regression coefficients
for each level of the predictor. In other words, it shows detailed information of simple regression analysis
of target ~ predictor
relation.
# If the variable of interest is a categorical variable
num_cat <- relate(num, ShelveLoc)
num_cat
Analysis of Variance Table
Response: Sales
Df Sum Sq Mean Sq F value Pr(>F)
ShelveLoc 2 1009.5 504.77 92.23 < 2.2e-16 ***
Residuals 397 2172.7 5.47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(num_cat)
Call:
lm(formula = formula(formula_str), data = data)
Residuals:
Min 1Q Median 3Q Max
-7.3066 -1.6282 -0.0416 1.5666 6.1471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5229 0.2388 23.131 < 2e-16 ***
ShelveLocGood 4.6911 0.3484 13.464 < 2e-16 ***
ShelveLocMedium 1.7837 0.2864 6.229 1.2e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.339 on 397 degrees of freedom
Multiple R-squared: 0.3172, Adjusted R-squared: 0.3138
F-statistic: 92.23 on 2 and 397 DF, p-value: < 2.2e-16
plot()
visualizes the relationship between the target variable and the predictor. The relationship between Sales
and ShelveLoc
is represented by a box plot
.
plot(num_cat)
eda_report()
performs EDA on all variables of the data frame or object (tbl_df
,tbl
, etc.) that inherits the data frame.
eda_report()
creates an EDA report in two forms:
The contents of the report are as follows.:
The following will create an EDA report for carseats
. The file format is pdf, and the file name is EDA_Report.pdf
.
carseats %>%
eda_report(target = Sales)
The following generates an HTML-formatted report named EDA.html
.
carseats %>%
eda_report(target = Sales, output_format = "html", output_file = "EDA.html")
The EDA report is an automated report to assist in the EDA process. Design the data analysis scenario with reference to the report results.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".