OkCupid Binary Predictors
Source
Kim (2015), "OkCupid Data for Introductory Statistics and Data Science Courses", Journal of Statistics Education, Volume 23, Number 2. doi:10.1080/10691898.2015.11889737
Kuhn and Johnson (2020), Feature Engineering and Selection, Chapman and Hall/CRC . https://bookdown.org/max/FES/ and https://github.com/topepo/FES
Details
Data originally from Kim (2015) includes a training and test set consistent with Kuhn and Johnson (2020). Predictors include ethnicity indicators and a set of keywords derived from text essay data.
Examples
data(okc_binary)
str(okc_binary_train)
#> tibble [38,809 × 61] (S3: tbl_df/tbl/data.frame)
#> $ software : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ engineer : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ startup : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ tech : num [1:38809] 0 0 0 0 1 0 0 0 0 1 ...
#> $ computers : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ engineering : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ computer : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ internet : num [1:38809] 0 0 0 0 1 0 0 0 0 0 ...
#> $ technology : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ science : num [1:38809] 0 0 0 0 1 0 0 0 0 0 ...
#> $ programming : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ technical : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ web : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ developer : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ im : num [1:38809] 1 0 1 0 1 1 0 1 1 0 ...
#> $ programmer : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ scientist : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ code : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ stephenson : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ geek : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ nerd : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ lol : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ biotech : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ matrix : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ coding : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ geeky : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ solving : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ problems : num [1:38809] 0 0 1 0 1 0 0 0 0 0 ...
#> $ data : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ fixing : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ teacher : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ student : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ silicon : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ law : num [1:38809] 0 0 0 0 0 0 0 1 0 0 ...
#> $ mechanical : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ electronic : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ pratchett : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ wikipedia : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ neal : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ mobile : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ math : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ lab : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ systems : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ electronics : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ futurama : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ alot : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ solve : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ websites : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ firefly : num [1:38809] 0 0 0 0 0 0 0 0 0 1 ...
#> $ valley : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ apps : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ lawyer : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ asian : num [1:38809] 1 0 0 0 0 0 0 0 0 0 ...
#> $ black : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ hispanic_latin : num [1:38809] 0 0 0 0 0 0 0 1 0 0 ...
#> $ indian : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ middle_eastern : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ native_american : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ other : num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ pacific_islander: num [1:38809] 0 0 0 0 0 0 0 0 0 0 ...
#> $ white : num [1:38809] 1 1 1 1 1 1 1 1 1 1 ...
