Introduction to data fransformation in the dlookr package
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
This document introduces data transformation methods provided by the dlookr package. You will learn how to transform of tbl_df
data that inherits from data.frame and data.frame
with functions provided by dlookr.
dlookr
increases synergy when used with the dplyr
package. Particularly in data exploration and data wrangle, it increases the efficiency of the tidyverse
package group.
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
missing values | find the variable that contains the missing value in the object that inherits the data.frame | find_na() |
|
outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | find_outliers() |
|
skewed variable | find the numerical variable that skewed variable that inherits the data.frame | find_skewness() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
missing values | missing values are imputed with some representative values and statistical methods. | imputate_na() |
|
outliers | outliers are imputed with some representative values and statistical methods. | imputate_outlier() |
|
summaries | calculate descriptive statistics of the original and imputed values. | summary.imputation() |
|
visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | plot.imputation() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
binning | converts a numeric variable to a categorization variable | binning() |
|
summaries | calculate frequency and relative frequency for each levels(bins) | summary.bins() |
|
visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | plot.bins() |
|
optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | binning_by() |
|
summaries | summary metrics to evaluate the performance of binomial classification model | summary.optimal_bins() |
|
visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | plot.optimal_bins() |
|
evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin() |
|
summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | summary.performance_bin() |
|
visualize | It generates plots for understand frequency, WoE by bins using performance_bin after running binning_by() | plot.performance_bin() |
|
visualize | extract bins from “bins” and “optimal_bins” objects | extract.bins() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model. | performance_bin() |
|
summaries | summary method for “performance_bin”. summary metrics to evaluate the performance of binomial classification model. | summary.performance_bin() |
|
visualize | visualize for understand frequency, WoE by bins using performance_bin and something else. | plot.performance_bin() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
transformation | performs variable transformation for standardization and resolving skewness of numerical variables. | transform() |
|
summaries | compares the distribution of data before and after data transformation | summary.transform() |
|
visualize | visualize two kinds of plot by attribute of ‘transform’ class. The transformation of a numerical variable is a density plot. | plot.transform() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
reporting the information of transformation into pdf | reporting the information of transformation. | transformation_report() |
|
reporting the information of transformation into html | reporting the information of transformation. | transformation_report() |
To illustrate the basic use of EDA in the dlookr package, I use a Carseats
datasets. Carseats
in the ISLR
package is simulation dataset that sells children’s car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.
Rows: 400
Columns: 11
$ Sales <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.…
$ CompPrice <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132…
$ Income <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78,…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, …
$ Population <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131,…
$ Price <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 10…
$ ShelveLoc <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Goo…
$ Age <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, …
$ Education <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, …
$ Urban <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, N…
$ US <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Y…
The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)
When data analysis is performed, data containing missing values is often encountered. However, Carseats is complete data without missing. Therefore, the missing values are generated as follows. And I created a data.frame object named carseats.
find_na()
find the variable that contains the missing value in the object that inherits the data.frame or data.frame.
find_na(carseats)
[1] 3 10
find_na(carseats, index = FALSE)
[1] "Income" "Urban"
find_na(carseats, rate = TRUE)
Sales CompPrice Income Advertising Population
0.0 0.0 5.0 0.0 0.0
Price ShelveLoc Age Education Urban
0.0 0.0 0.0 0.0 2.5
US
0.0
## using dplyr -------------------------------------
# Perform simple data quality diagnosis of variables with missing values.
carseats %>%
select(find_na(.)) %>%
diagnose() %>%
flextable()
variables | types | missing_count | missing_percent | unique_count | unique_rate |
Income | numeric | 20 | 5.0 | 99 | 0.2475 |
Urban | factor | 10 | 2.5 | 3 | 0.0075 |
find_outliers()
find the numerical variable that contains outliers in the object that inherits the data.frame or data.frame.
find_outliers(carseats)
[1] 1 2 6
find_outliers(carseats, index = FALSE)
[1] "Sales" "CompPrice" "Price"
find_outliers(carseats, rate = TRUE)
Sales CompPrice Income Advertising Population
0.50 0.50 0.00 0.00 0.00
Price Age Education
1.25 0.00 0.00
## using dplyr -------------------------------------
# Perform simple data quality diagnosis of variables with outliers.
carseats %>%
select(find_outliers(.)) %>%
diagnose() %>%
flextable()
variables | types | missing_count | missing_percent | unique_count | unique_rate |
Sales | numeric | 0 | 0 | 336 | 0.8400 |
CompPrice | numeric | 0 | 0 | 73 | 0.1825 |
Price | numeric | 0 | 0 | 101 | 0.2525 |
find_skewness()
find the numerical variable that skewed variable that inherits the data.frame or data.frame.
find_skewness(carseats)
[1] 4
find_skewness(carseats, index = FALSE)
[1] "Advertising"
find_skewness(carseats, thres = 0.1)
[1] 1 4 6
find_skewness(carseats, value = TRUE)
Sales CompPrice Income Advertising Population
0.185 -0.043 0.036 0.637 -0.051
Price Age Education
-0.125 -0.077 0.044
find_skewness(carseats, value = TRUE, thres = 0.1)
Sales Advertising Price
0.185 0.637 -0.125
## using dplyr -------------------------------------
# Perform simple data quality diagnosis of skewed variables
carseats %>%
select(find_skewness(.)) %>%
diagnose() %>%
flextable()
variables | types | missing_count | missing_percent | unique_count | unique_rate |
Advertising | numeric | 0 | 0 | 28 | 0.07 |
imputate_na()
imputes the missing value contained in the variable. The predictor with missing values support both numeric and categorical variables, and supports the following method
.
In the following example, imputate_na()
imputes the missing value of Income
, a numeric variable of carseats, using the “rpart” method. summary()
summarizes missing value imputation information, and plot()
visualizes missing information.
if (requireNamespace("rpart", quietly = TRUE)) {
income <- imputate_na(carseats, Income, US, method = "rpart")
# result of imputation
income
# summary of imputation
summary(income)
# viz of imputation
plot(income)
} else {
cat("If you want to use this feature, you need to install the rpart package.\n")
}
* Impute missing values based on Recursive Partitioning and Regression Trees
- method : rpart
* Information of Imputation (before vs after)
Original Imputation
n 380.00000000 400.00000000
na 20.00000000 0.00000000
mean 69.32105263 69.07811282
sd 28.06686473 27.53886441
se_mean 1.43979978 1.37694322
IQR 48.00000000 45.50000000
skewness 0.03601821 0.05313579
kurtosis -1.10286001 -1.04030028
p00 21.00000000 21.00000000
p01 21.79000000 21.99000000
p05 26.00000000 26.00000000
p10 31.90000000 32.00000000
p20 40.00000000 41.00000000
p25 44.00000000 44.75000000
p30 50.00000000 51.00000000
p40 62.00000000 62.00000000
p50 69.00000000 69.00000000
p60 78.00000000 77.00000000
p70 87.30000000 84.60000000
p75 92.00000000 90.25000000
p80 98.00000000 96.00000000
p90 108.10000000 107.00000000
p95 115.05000000 115.00000000
p99 119.21000000 119.01000000
p100 120.00000000 120.00000000
The following imputes the categorical variable urban
by the “mice” method.
library(mice)
urban <- imputate_na(carseats, Urban, US, method = "mice")
iter imp variable
1 1 Income Urban
1 2 Income Urban
1 3 Income Urban
1 4 Income Urban
1 5 Income Urban
2 1 Income Urban
2 2 Income Urban
2 3 Income Urban
2 4 Income Urban
2 5 Income Urban
3 1 Income Urban
3 2 Income Urban
3 3 Income Urban
3 4 Income Urban
3 5 Income Urban
4 1 Income Urban
4 2 Income Urban
4 3 Income Urban
4 4 Income Urban
4 5 Income Urban
5 1 Income Urban
5 2 Income Urban
5 3 Income Urban
5 4 Income Urban
5 5 Income Urban
# result of imputation
urban
[1] Yes Yes Yes Yes Yes No Yes Yes No No No Yes Yes Yes Yes No
[17] Yes Yes No Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes
[33] No Yes Yes No No No Yes No No Yes Yes Yes Yes Yes No Yes
[49] Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes Yes
[65] No No Yes Yes Yes Yes Yes No Yes No No No Yes No Yes Yes
[81] Yes Yes Yes Yes No No Yes No Yes Yes No Yes Yes Yes Yes Yes
[97] No Yes No No No Yes No Yes Yes Yes No Yes Yes No Yes Yes
[113] Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No Yes Yes
[129] Yes No Yes Yes Yes Yes Yes No No Yes Yes No Yes Yes Yes Yes
[145] No Yes Yes No No Yes No No No No No Yes Yes No Yes No
[161] No No Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes
[177] No Yes No Yes Yes Yes Yes Yes No Yes No Yes Yes No No Yes
[193] No Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes Yes Yes No
[209] Yes No No Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes
[225] No Yes Yes Yes No No No No Yes No No Yes No Yes Yes Yes
[241] Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No Yes No Yes Yes
[257] Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes
[273] Yes Yes Yes Yes Yes Yes No Yes No No Yes No No Yes No Yes
[289] No Yes No Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes Yes
[305] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No
[321] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes
[337] Yes Yes Yes Yes Yes No No Yes No Yes No No Yes No No No
[353] Yes No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No No Yes
[369] No Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
[385] Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes
attr(,"var_type")
[1] categorical
attr(,"method")
[1] mice
attr(,"na_pos")
[1] 38 90 159 206 237 252 281 283 335 378
attr(,"seed")
[1] 43320
attr(,"type")
[1] missing values
attr(,"message")
[1] complete imputation
attr(,"success")
[1] TRUE
Levels: No Yes
# summary of imputation
summary(urban)
* Impute missing values based on Multivariate Imputation by Chained Equations
- method : mice
- random seed : 43320
* Information of Imputation (before vs after)
original imputation original_percent imputation_percent
No 115 119 28.75 29.75
Yes 275 281 68.75 70.25
<NA> 10 0 2.50 0.00
# viz of imputation
plot(urban)
The following example imputes the missing value of the Income
variable, and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used, and it is easily interpreted logically using pipes.
# The mean before and after the imputation of the Income variable
carseats %>%
mutate(Income_imp = imputate_na(carseats, Income, US, method = "knn")) %>%
group_by(US) %>%
summarise(orig = mean(Income, na.rm = TRUE),
imputation = mean(Income_imp))
# A tibble: 2 x 3
US orig imputation
* <fct> <dbl> <dbl>
1 No 65.7 65.5
2 Yes 71.3 71.5
imputate_outlier()
imputes the outliers value. The predictor with outliers supports only numeric variables and supports the following methods.
imputate_outlier()
imputes the outliers with the numeric variable Price
as the “capping” method, as follows. summary()
summarizes outliers imputation information, and plot()
visualizes imputation information.
price <- imputate_outlier(carseats, Price, method = "capping")
# result of imputation
price
[1] 120.00 83.00 80.00 97.00 128.00 72.00 108.00 120.00 124.00
[10] 124.00 100.00 94.00 136.00 86.00 118.00 144.00 110.00 131.00
[19] 68.00 121.00 131.00 109.00 138.00 109.00 113.00 82.00 131.00
[28] 107.00 97.00 102.00 89.00 131.00 137.00 128.00 128.00 96.00
[37] 100.00 110.00 102.00 138.00 126.00 124.00 77.00 134.00 95.00
[46] 135.00 70.00 108.00 98.00 149.00 108.00 108.00 129.00 119.00
[55] 144.00 154.00 84.00 117.00 103.00 114.00 123.00 107.00 133.00
[64] 101.00 104.00 128.00 91.00 115.00 134.00 99.00 99.00 150.00
[73] 116.00 104.00 136.00 92.00 70.00 89.00 145.00 90.00 79.00
[82] 128.00 139.00 94.00 121.00 112.00 134.00 126.00 111.00 119.00
[91] 103.00 107.00 125.00 104.00 84.00 148.00 132.00 129.00 127.00
[100] 107.00 106.00 118.00 97.00 96.00 138.00 97.00 139.00 108.00
[109] 103.00 90.00 116.00 151.00 125.00 127.00 106.00 129.00 128.00
[118] 119.00 99.00 128.00 131.00 87.00 108.00 155.00 120.00 77.00
[127] 133.00 116.00 126.00 147.00 77.00 94.00 136.00 97.00 131.00
[136] 120.00 120.00 118.00 109.00 94.00 129.00 131.00 104.00 159.00
[145] 123.00 117.00 131.00 119.00 97.00 87.00 114.00 103.00 128.00
[154] 150.00 110.00 69.00 157.00 90.00 112.00 70.00 111.00 160.00
[163] 149.00 106.00 141.00 155.05 137.00 93.00 117.00 77.00 118.00
[172] 55.00 110.00 128.00 155.05 122.00 154.00 94.00 81.00 116.00
[181] 149.00 91.00 140.00 102.00 97.00 107.00 86.00 96.00 90.00
[190] 104.00 101.00 173.00 93.00 96.00 128.00 112.00 133.00 138.00
[199] 128.00 126.00 146.00 134.00 130.00 157.00 124.00 132.00 160.00
[208] 97.00 64.00 90.00 123.00 120.00 105.00 139.00 107.00 144.00
[217] 144.00 111.00 120.00 116.00 124.00 107.00 145.00 125.00 141.00
[226] 82.00 122.00 101.00 163.00 72.00 114.00 122.00 105.00 120.00
[235] 129.00 132.00 108.00 135.00 133.00 118.00 121.00 94.00 135.00
[244] 110.00 100.00 88.00 90.00 151.00 101.00 117.00 156.00 132.00
[253] 117.00 122.00 129.00 81.00 144.00 112.00 81.00 100.00 101.00
[262] 118.00 132.00 115.00 159.00 129.00 112.00 112.00 105.00 166.00
[271] 89.00 110.00 63.00 86.00 119.00 132.00 130.00 125.00 151.00
[280] 158.00 145.00 105.00 154.00 117.00 96.00 131.00 113.00 72.00
[289] 97.00 156.00 103.00 89.00 74.00 89.00 99.00 137.00 123.00
[298] 104.00 130.00 96.00 99.00 87.00 110.00 99.00 134.00 132.00
[307] 133.00 120.00 126.00 80.00 166.00 132.00 135.00 54.00 129.00
[316] 171.00 72.00 136.00 130.00 129.00 152.00 98.00 139.00 103.00
[325] 150.00 104.00 122.00 104.00 111.00 89.00 112.00 134.00 104.00
[334] 147.00 83.00 110.00 143.00 102.00 101.00 126.00 91.00 93.00
[343] 118.00 121.00 126.00 149.00 125.00 112.00 107.00 96.00 91.00
[352] 105.00 122.00 92.00 145.00 146.00 164.00 72.00 118.00 130.00
[361] 114.00 104.00 110.00 108.00 131.00 162.00 134.00 77.00 79.00
[370] 122.00 119.00 126.00 98.00 116.00 118.00 124.00 92.00 125.00
[379] 119.00 107.00 89.00 151.00 121.00 68.00 112.00 132.00 160.00
[388] 115.00 78.00 107.00 111.00 124.00 130.00 120.00 139.00 128.00
[397] 120.00 159.00 95.00 120.00
attr(,"method")
[1] "capping"
attr(,"var_type")
[1] "numerical"
attr(,"outlier_pos")
[1] 43 126 166 175 368
attr(,"outliers")
[1] 24 49 191 185 53
attr(,"type")
[1] "outliers"
attr(,"message")
[1] "complete imputation"
attr(,"success")
[1] TRUE
attr(,"class")
[1] "imputation" "numeric"
# summary of imputation
summary(price)
Impute outliers with capping
* Information of Imputation (before vs after)
Original Imputation
n 400.0000000 400.0000000
na 0.0000000 0.0000000
mean 115.7950000 115.8927500
sd 23.6766644 22.6109187
se_mean 1.1838332 1.1305459
IQR 31.0000000 31.0000000
skewness -0.1252862 -0.0461621
kurtosis 0.4518850 -0.3030578
p00 24.0000000 54.0000000
p01 54.9900000 67.9600000
p05 77.0000000 77.0000000
p10 87.0000000 87.0000000
p20 96.8000000 96.8000000
p25 100.0000000 100.0000000
p30 104.0000000 104.0000000
p40 110.0000000 110.0000000
p50 117.0000000 117.0000000
p60 122.0000000 122.0000000
p70 128.3000000 128.3000000
p75 131.0000000 131.0000000
p80 134.0000000 134.0000000
p90 146.0000000 146.0000000
p95 155.0500000 155.0025000
p99 166.0500000 164.0200000
p100 191.0000000 173.0000000
# viz of imputation
plot(price)
The following example imputes the outliers of the Price
variable, and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used, and it is easily interpreted logically using pipes.
# The mean before and after the imputation of the Price variable
carseats %>%
mutate(Price_imp = imputate_outlier(carseats, Price, method = "capping")) %>%
group_by(US) %>%
summarise(orig = mean(Price, na.rm = TRUE),
imputation = mean(Price_imp, na.rm = TRUE))
# A tibble: 2 x 3
US orig imputation
* <fct> <dbl> <dbl>
1 No 114. 114.
2 Yes 117. 117.
binning()
transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported.
Here are some examples of how to bin Income
using binning()
.:
# Binning the carat variable. default type argument is "quantile"
bin <- binning(carseats$Income)
# Print bins class object
bin
binned type: quantile
number of bins: 10
x
[21,31.36667] (31.36667,40] (40,50] (50,62]
38 40 37 40
(62,69] (69,78] (78,87.56667] (87.56667,98]
40 34 37 41
(98,108.6333] (108.6333,120] <NA>
35 38 20
# Summarize bins class object
summary(bin)
levels freq rate
1 [21,31.36667] 38 0.0950
2 (31.36667,40] 40 0.1000
3 (40,50] 37 0.0925
4 (50,62] 40 0.1000
5 (62,69] 40 0.1000
6 (69,78] 34 0.0850
7 (78,87.56667] 37 0.0925
8 (87.56667,98] 41 0.1025
9 (98,108.6333] 35 0.0875
10 (108.6333,120] 38 0.0950
11 <NA> 20 0.0500
# Plot bins class object
plot(bin)
# Using labels argument
bin <- binning(carseats$Income, nbins = 4,
labels = c("LQ1", "UQ1", "LQ3", "UQ3"))
bin
binned type: quantile
number of bins: 4
x
LQ1 UQ1 LQ3 UQ3 <NA>
98 97 91 94 20
# Using another type argument
binning(carseats$Income, nbins = 5, type = "equal")
binned type: equal
number of bins: 5
x
[21,40.8] (40.8,60.6] (60.6,80.4] (80.4,100.2] (100.2,120]
78 68 92 79 63
<NA>
20
binning(carseats$Income, nbins = 5, type = "pretty")
binned type: pretty
number of bins: 5
x
[20,40] (40,60] (60,80] (80,100] (100,120] <NA>
78 68 92 79 63 20
if (requireNamespace("classInt", quietly = TRUE)) {
binning(carseats$Income, nbins = 5, type = "kmeans")
binning(carseats$Income, nbins = 5, type = "bclust")
} else {
cat("If you want to use this feature, you need to install the classInt package.\n")
}
binned type: bclust
number of bins: 5
x
[21,36.5] (36.5,57.5] (57.5,85] (85,108.5] (108.5,120]
65 67 129 81 38
<NA>
20
# Extract the binned results
extract(bin)
[1] LQ3 UQ1 LQ1 UQ3 UQ1 UQ3 UQ3 LQ3 UQ3 UQ3 LQ3 UQ3 LQ1
[14] <NA> UQ3 UQ3 LQ1 LQ3 UQ3 LQ3 LQ3 LQ1 UQ1 LQ1 UQ3 LQ1
[27] UQ3 UQ3 LQ3 UQ3 UQ3 UQ1 LQ1 LQ1 UQ1 LQ3 LQ3 LQ1 LQ3
[40] UQ1 UQ3 UQ1 UQ1 LQ1 LQ3 UQ1 LQ3 UQ3 UQ1 UQ3 LQ1 LQ3
[53] LQ1 UQ1 UQ3 LQ3 LQ3 LQ3 UQ3 LQ3 UQ3 LQ1 UQ1 LQ3 UQ1
[66] LQ1 LQ3 UQ1 UQ1 UQ1 LQ3 UQ1 UQ1 LQ3 UQ1 UQ3 LQ3 LQ3
[79] UQ1 UQ1 UQ3 LQ3 LQ3 LQ1 LQ1 UQ3 LQ3 UQ1 LQ1 <NA> <NA>
[92] UQ1 UQ3 LQ1 UQ3 LQ1 LQ1 LQ3 LQ3 UQ1 UQ1 UQ3 LQ1 LQ3
[105] UQ3 UQ3 LQ1 UQ3 LQ3 UQ1 UQ1 UQ3 UQ3 LQ1 LQ3 LQ1 LQ3
[118] <NA> LQ3 UQ3 UQ3 LQ3 UQ3 UQ3 UQ3 LQ3 UQ1 UQ1 UQ3 UQ3
[131] LQ3 UQ1 LQ3 UQ3 LQ1 UQ3 <NA> LQ1 UQ3 UQ1 UQ1 LQ1 LQ3
[144] LQ3 UQ1 UQ1 LQ3 UQ1 UQ3 UQ3 LQ3 UQ1 <NA> LQ1 UQ1 LQ3
[157] LQ1 UQ1 LQ3 UQ1 LQ1 LQ1 LQ3 UQ1 UQ1 UQ1 UQ1 LQ3 LQ3
[170] LQ1 LQ1 UQ3 UQ3 LQ3 LQ1 LQ3 UQ3 LQ3 <NA> LQ1 UQ3 LQ3
[183] UQ1 LQ3 LQ1 UQ3 UQ1 LQ1 LQ1 UQ3 LQ1 LQ1 LQ1 LQ3 <NA>
[196] UQ3 <NA> UQ1 LQ3 LQ3 LQ3 LQ3 LQ3 LQ3 LQ3 LQ1 UQ1 UQ3
[209] UQ1 LQ1 LQ1 UQ3 UQ1 LQ3 UQ3 LQ3 LQ1 LQ1 UQ1 LQ3 UQ3
[222] LQ1 UQ3 UQ1 LQ3 LQ1 LQ1 UQ1 <NA> UQ3 UQ1 UQ1 LQ3 LQ3
[235] UQ1 LQ1 LQ1 LQ1 LQ1 UQ3 LQ3 UQ1 UQ1 <NA> LQ1 LQ1 UQ1
[248] UQ3 UQ1 UQ1 UQ3 UQ3 UQ3 LQ1 UQ3 <NA> LQ1 UQ1 LQ1 LQ1
[261] UQ3 LQ1 LQ3 LQ1 LQ1 LQ1 UQ3 LQ3 UQ1 UQ1 LQ1 UQ1 LQ1
[274] UQ3 UQ3 UQ3 UQ1 UQ1 UQ3 UQ1 LQ3 UQ1 UQ3 UQ3 UQ1 LQ1
[287] UQ3 LQ1 LQ1 LQ3 UQ3 LQ3 UQ1 LQ3 LQ3 LQ1 LQ1 LQ3 <NA>
[300] LQ1 LQ3 UQ3 LQ3 UQ1 UQ3 <NA> LQ1 LQ3 LQ3 UQ3 UQ1 UQ1
[313] UQ3 LQ3 LQ1 LQ1 LQ1 LQ1 LQ3 UQ1 LQ3 LQ1 UQ1 UQ3 UQ1
[326] UQ1 LQ1 <NA> UQ1 UQ1 UQ1 UQ1 LQ1 UQ1 UQ3 LQ3 LQ1 LQ1
[339] LQ1 LQ1 LQ1 UQ3 UQ3 LQ1 LQ3 UQ1 UQ3 <NA> UQ3 LQ1 UQ3
[352] UQ3 UQ3 UQ1 <NA> UQ3 UQ3 LQ3 UQ3 UQ1 LQ3 LQ1 UQ1 LQ3
[365] LQ1 LQ1 UQ1 UQ3 LQ1 UQ3 LQ1 LQ3 UQ1 <NA> UQ1 UQ1 UQ1
[378] UQ1 LQ3 UQ3 UQ1 UQ1 LQ1 UQ3 LQ1 LQ3 UQ3 <NA> LQ3 LQ1
[391] LQ3 UQ1 LQ1 UQ1 UQ1 UQ3 LQ1 LQ1 <NA> LQ1
Levels: LQ1 < UQ1 < LQ3 < UQ3
# -------------------------
# Using pipes & dplyr
# -------------------------
library(dplyr)
carseats %>%
mutate(Income_bin = binning(carseats$Income) %>%
extract()) %>%
group_by(ShelveLoc, Income_bin) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
head(10)
# A tibble: 10 x 3
# Groups: ShelveLoc [1]
ShelveLoc Income_bin freq
<fct> <ord> <int>
1 Medium [21,31.36667] 25
2 Medium (62,69] 23
3 Medium (50,62] 22
4 Medium (31.36667,40] 21
5 Medium (40,50] 20
6 Medium (69,78] 20
7 Medium (108.6333,120] 20
8 Medium (78,87.56667] 19
9 Medium (87.56667,98] 19
10 Medium (98,108.6333] 19
binning_by()
transforms a numeric variable into a categorical variable by optimal binning. This method is often used when developing a scorecard model
.
The following binning_by()
example optimally binning Advertising
considering the target variable US
with a binary class.
library(dplyr)
# optimal binning using character
bin <- binning_by(carseats, "US", "Advertising")
# optimal binning using name
bin <- binning_by(carseats, US, Advertising)
bin
binned type: optimal
number of bins: 3
x
[-1,0] (0,6] (6,29]
144 69 187
# summary optimal_bins class
summary(bin)
── Binning Table ──────────────────────── Several Metrics ──
Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE
1 [-1,0] 144 19 125 0.07364 0.88028 0.1520 -2.48101
2 (0,6] 69 54 15 0.20930 0.10563 3.6000 0.68380
3 (6,29] 187 185 2 0.71705 0.01408 92.5000 3.93008
4 Total 400 258 142 1.00000 1.00000 1.8169 NA
IV JSD AUC
1 2.00128 0.20093 0.03241
2 0.07089 0.00869 0.01883
3 2.76272 0.21861 0.00903
4 4.83489 0.42823 0.06028
── General Metrics ─────────────────────────────────────────
● Gini index : -0.87944
● IV (Jeffrey) : 4.83489
● JS (Jensen-Shannon) Divergence : 0.42823
● Kolmogorov-Smirnov Statistics : 0.80664
● HHI (Herfindahl-Hirschman Index) : 0.37791
● HHI (normalized) : 0.06687
● Cramer's V : 0.81863
── Significance Tests ──────────────────── Chisquare Test ──
Bin A Bin B statistics p_value
1 [-1,0] (0,6] 87.67064 7.731349e-21
2 (0,6] (6,29] 34.73349 3.780706e-09
# performance table
attr(bin, "performance")
Bin CntRec CntPos CntNeg CntCumPos CntCumNeg RatePos RateNeg
1 [-1,0] 144 19 125 19 125 0.07364 0.88028
2 (0,6] 69 54 15 73 140 0.20930 0.10563
3 (6,29] 187 185 2 258 142 0.71705 0.01408
4 Total 400 258 142 NA NA 1.00000 1.00000
RateCumPos RateCumNeg Odds LnOdds WoE IV JSD
1 0.07364 0.88028 0.1520 -1.88387 -2.48101 2.00128 0.20093
2 0.28295 0.98592 3.6000 1.28093 0.68380 0.07089 0.00869
3 1.00000 1.00000 92.5000 4.52721 3.93008 2.76272 0.21861
4 NA NA 1.8169 0.59713 NA 4.83489 0.42823
AUC
1 0.03241
2 0.01883
3 0.00903
4 0.06028
# visualize optimal_bins class
plot(bin)
[1] (6,29] (6,29] (6,29] (0,6] (0,6] (6,29] [-1,0] (6,29] [-1,0]
[10] [-1,0] (6,29] (0,6] (0,6] (6,29] (6,29] (0,6] [-1,0] (6,29]
[19] [-1,0] (6,29]
Levels: [-1,0] < (0,6] < (6,29]
performance_bin()
calculates metrics to evaluate the performance of binned variable for binomial classification model.
# Generate data for the example
heartfailure2 <- heartfailure
set.seed(123)
heartfailure2[sample(seq(NROW(heartfailure2)), 5), "creatinine"] <- NA
# Change the target variable to 0(negative) and 1(positive).
heartfailure2$death_event_2 <- ifelse(heartfailure2$death_event %in% "Yes", 1, 0)
# Binnig from creatinine to platelets_bin.
breaks <- c(0, 1, 2, 10)
heartfailure2$creatinine_bin <- cut(heartfailure2$creatinine, breaks)
# Diagnose performance binned variable
perf <- performance_bin(heartfailure2$death_event_2, heartfailure2$creatinine_bin)
perf
Bin CntRec CntPos CntNeg CntCumPos CntCumNeg RatePos RateNeg
1 (0,1] 128 24 104 24 104 0.25000 0.51232
2 (1,2] 132 51 81 75 185 0.53125 0.39901
3 (2,10] 34 21 13 96 198 0.21875 0.06404
4 <NA> 5 0 5 96 203 0.00000 0.02463
5 Total 299 96 203 NA NA 1.00000 1.00000
RateCumPos RateCumNeg Odds LnOdds WoE IV JSD
1 0.25000 0.51232 0.23077 -1.46634 -0.71748 0.18821 0.02303
2 0.78125 0.91133 0.62963 -0.46262 0.28623 0.03785 0.00472
3 1.00000 0.97537 1.61538 0.47957 1.22843 0.19005 0.02237
4 1.00000 1.00000 0.00000 -Inf NA NA NA
5 NA NA 0.47291 -0.74886 NA NA NA
AUC
1 0.06404
2 0.20574
3 0.05704
4 0.02463
5 0.35145
summary.performance_bin()
summarize metrics to evaluate the performance of binomial classification model.
summary(perf)
── Binning Table ──────────────────────── Several Metrics ──
Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE
1 (0,1] 128 24 104 0.25000 0.51232 0.23077 -0.71748
2 (1,2] 132 51 81 0.53125 0.39901 0.62963 0.28623
3 (2,10] 34 21 13 0.21875 0.06404 1.61538 1.22843
4 <NA> 5 0 5 0.00000 0.02463 0.00000 NA
5 Total 299 96 203 1.00000 1.00000 0.47291 NA
IV JSD AUC
1 0.18821 0.02303 0.06404
2 0.03785 0.00472 0.20574
3 0.19005 0.02237 0.05704
4 NA NA 0.02463
5 NA NA 0.35145
── General Metrics ─────────────────────────────────────────
● Gini index : -0.29711
● IV (Jeffrey) : NA
● JS (Jensen-Shannon) Divergence : NA
● Kolmogorov-Smirnov Statistics : 0.26232
● HHI (Herfindahl-Hirschman Index) : 0.39137
● HHI (normalized) : 0.1885
● Cramer's V : 0.30024
── Significance Tests ──────────────────── Chisquare Test ──
Bin A Bin B statistics p_value
1 (0,1] (1,2] 12.520885 0.0004024283
2 (1,2] (2,10] 5.888296 0.0152418462
plot.performance_bin()
visualize metrics to evaluate the performance of binomial classification model.
plot(perf)
transform()
performs data transformation. Only numeric variables are supported, and the following methods are provided.
Use the methods “zscore” and “minmax” to perform standardization.
find_skewness()
searches for variables with skewed data. This function finds data skewed by search conditions and calculates skewness.
# find index of skewed variables
find_skewness(carseats)
[1] 4
# find names of skewed variables
find_skewness(carseats, index = FALSE)
[1] "Advertising"
# compute the skewness
find_skewness(carseats, value = TRUE)
Sales CompPrice Income Advertising Population
0.185 -0.043 0.036 0.637 -0.051
Price Age Education
-0.125 -0.077 0.044
# compute the skewness & filtering with threshold
find_skewness(carseats, value = TRUE, thres = 0.1)
Sales Advertising Price
0.185 0.637 -0.125
The skewness of Advertising
is 0.637. This means that the distribution of data is somewhat inclined to the left. So, for normal distribution, use transform()
to convert to “log” method as follows. summary()
summarizes transformation information, and plot()
visualizes transformation information.
Advertising_log = transform(carseats$Advertising, method = "log")
# result of transformation
head(Advertising_log)
[1] 2.397895 2.772589 2.302585 1.386294 1.098612 2.564949
# summary of transformation
summary(Advertising_log)
* Resolving Skewness with log
* Information of Transformation (before vs after)
Original Transformation
n 400.0000000 400.0000000
na 0.0000000 0.0000000
mean 6.6350000 -Inf
sd 6.6503642 NaN
se_mean 0.3325182 NaN
IQR 12.0000000 Inf
skewness 0.6395858 NaN
kurtosis -0.5451178 NaN
p00 0.0000000 -Inf
p01 0.0000000 -Inf
p05 0.0000000 -Inf
p10 0.0000000 -Inf
p20 0.0000000 -Inf
p25 0.0000000 -Inf
p30 0.0000000 -Inf
p40 2.0000000 0.6931472
p50 5.0000000 1.6094379
p60 8.4000000 2.1265548
p70 11.0000000 2.3978953
p75 12.0000000 2.4849066
p80 13.0000000 2.5649494
p90 16.0000000 2.7725887
p95 19.0000000 2.9444390
p99 23.0100000 3.1359198
p100 29.0000000 3.3672958
# viz of transformation
plot(Advertising_log)
It seems that the raw data contains 0, as there is a -Inf in the log converted value. So this time, convert it to “log+1”.
Advertising_log <- transform(carseats$Advertising, method = "log+1")
# result of transformation
head(Advertising_log)
[1] 2.484907 2.833213 2.397895 1.609438 1.386294 2.639057
# summary of transformation
summary(Advertising_log)
* Resolving Skewness with log+1
* Information of Transformation (before vs after)
Original Transformation
n 400.0000000 400.00000000
na 0.0000000 0.00000000
mean 6.6350000 1.46247709
sd 6.6503642 1.19436323
se_mean 0.3325182 0.05971816
IQR 12.0000000 2.56494936
skewness 0.6395858 -0.19852549
kurtosis -0.5451178 -1.66342876
p00 0.0000000 0.00000000
p01 0.0000000 0.00000000
p05 0.0000000 0.00000000
p10 0.0000000 0.00000000
p20 0.0000000 0.00000000
p25 0.0000000 0.00000000
p30 0.0000000 0.00000000
p40 2.0000000 1.09861229
p50 5.0000000 1.79175947
p60 8.4000000 2.23936878
p70 11.0000000 2.48490665
p75 12.0000000 2.56494936
p80 13.0000000 2.63905733
p90 16.0000000 2.83321334
p95 19.0000000 2.99573227
p99 23.0100000 3.17846205
p100 29.0000000 3.40119738
# viz of transformation
# plot(Advertising_log)
transformation_report()
generates a data transformation report for all the variables in the data frame or objects that inherit the data frame (tbl_df
, tbl
, etc.).
transformation_report()
generates a data transformation report in two forms:
The contents of the report are as follows.:
The following generates a data transformation report for carseats
. The file format is pdf, and the file name is Transformation_Report.pdf
.
carseats %>%
transformation_report(target = US)
The following generates a report in html format called transformation_carseats.html
.
carseats %>%
transformation_report(target = US, output_format = "html",
output_file = "transformation_carseats.html")
Data transformation reports are automated reports to assist in the data transformation process. Design data conversion scenarios by referring to the report results.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".