10.5 C
Washington

Analyzing rtweet Data with kerasformula


Overview
The kerasformula package offers a high-level interface for the R interface to Keras. It’s main interface is the kms function, a regression-style interface to keras_model_sequential that uses formulas and sparse matrices.
The kerasformula package is available on CRAN, and can be installed with:

# install the kerasformula package
install.packages(“kerasformula”)
# or devtools::install_github(“rdrr1990/kerasformula”)

library(kerasformula)

# install the core keras library (if you haven’t already done so)
# see ?install_keras() for options e.g. install_keras(tensorflow = “gpu”)
install_keras()

The kms() function
Many classic machine learning tutorials assume that data come in a relatively homogenous form (e.g., pixels for digit recognition or word counts or ranks) which can make coding somewhat cumbersome when data is contained in a heterogenous data frame. kms() takes advantage of the flexibility of R formulas to smooth this process.
kms builds dense neural nets and, after fitting them, returns a single object with predictions, measures of fit, and details about the function call. kms accepts a number of parameters including the loss and activation functions found in keras. kms also accepts compiled keras_model_sequential objects allowing for even further customization. This little demo shows how kms can aid is model building and hyperparameter selection (e.g., batch size) starting with raw data gathered using library(rtweet).

Let’s look at #rstats tweets (excluding retweets) for a six-day period ending January 24, 2018 at 10:40. This happens to give us a nice reasonable number of observations to work with in terms of runtime (and the purpose of this document is to show syntax, not build particularly predictive models).

rstats <- search_tweets(“#rstats”, n = 10000, include_rts = FALSE)
dim(rstats)

[1] 2840 42
Suppose our goal is to predict how popular tweets will be based on how often the tweet was retweeted and favorited (which correlate strongly).

cor(rstats$favorite_count, rstats$retweet_count, method=”spearman”)

[1] 0.7051952
Since few tweeets go viral, the data are quite skewed towards zero.

Getting the most out of formulas
Let’s suppose we are interested in putting tweets into categories based on popularity but we’re not sure how finely-grained we want to make distinctions. Some of the data, like rstats$mentions_screen_name comes in a list of varying lengths, so let’s write a helper function to count non-NA entries.

Let’s start with a dense neural net, the default of kms. We can use base R functions to help clean the data–in this case, cut to discretize the outcome, grepl to look for key words, and weekdays and format to capture different aspects of the time the tweet was posted.

breaks <- c(-1, 0, 1, 10, 100, 1000, 10000)
popularity <- kms(cut(retweet_count + favorite_count, breaks) ~ screen_name +
source + n(hashtags) + n(mentions_screen_name) +
n(urls_url) + nchar(text) +
grepl(‘photo’, media_type) +
weekdays(created_at) +
format(created_at, ‘%H’), rstats)
plot(popularity$history)
+ ggtitle(paste(“#rstat popularity:”,
paste0(round(100*popularity$evaluations$acc, 1), “%”),
“out-of-sample accuracy”))
+ theme_minimal()

popularity$confusion

popularity$confusion

(-1,0] (0,1] (1,10] (10,100] (100,1e+03] (1e+03,1e+04]
(-1,0] 37 12 28 2 0 0
(0,1] 14 19 72 1 0 0
(1,10] 6 11 187 30 0 0
(10,100] 1 3 54 68 0 0
(100,1e+03] 0 0 4 10 0 0
(1e+03,1e+04] 0 0 0 1 0 0
The model only classifies about 55% of the out-of-sample data correctly and that predictive accuracy doesn’t improve after the first ten epochs. The confusion matrix suggests that model does best with tweets that are retweeted a handful of times but overpredicts the 1-10 level. The history plot also suggests that out-of-sample accuracy is not very stable. We can easily change the breakpoints and number of epochs.

breaks <- c(-1, 0, 1, 25, 50, 75, 100, 500, 1000, 10000)
popularity <- kms(cut(retweet_count + favorite_count, breaks) ~
n(hashtags) + n(mentions_screen_name) + n(urls_url) +
nchar(text) +
screen_name + source +
grepl(‘photo’, media_type) +
weekdays(created_at) +
format(created_at, ‘%H’), rstats, Nepochs = 10)

plot(popularity$history)
+ ggtitle(paste(“#rstat popularity (new breakpoints):”,
paste0(round(100*popularity$evaluations$acc, 1), “%”),
“out-of-sample accuracy”))
+ theme_minimal()

That helped some (about 5% additional predictive accuracy). Suppose we want to add a little more data. Let’s first store the input formula.

pop_input <- “cut(retweet_count + favorite_count, breaks) ~
n(hashtags) + n(mentions_screen_name) + n(urls_url) +
nchar(text) +
screen_name + source +
grepl(‘photo’, media_type) +
weekdays(created_at) +
format(created_at, ‘%H’)”

Here we use paste0 to add to the formula by looping over user IDs adding something like:
grepl(“12233344455556″, mentions_user_id)

mentions <- unlist(rstats$mentions_user_id)
mentions <- unique(mentions[which(table(mentions) > 5)]) # remove infrequent
mentions <- mentions[!is.na(mentions)] # drop NA

for(i in mentions)
pop_input <- paste0(pop_input, ” + “, “grepl(“, i, “, mentions_user_id)”)

popularity <- kms(pop_input, rstats)

That helped a touch but the predictive accuracy is still fairly unstable across epochs…
Customizing layers with kms()
We could add more data, perhaps add individual words from the text or some other summary stat (mean(text %in% LETTERS) to see if all caps explains popularity). But let’s alter the neural net.
The input.formula is used to create a sparse model matrix. For example, rstats$source (Twitter or Twitter-client application type) and rstats$screen_name are character vectors that will be dummied out. How many columns does it have?

[1] 1277
Say we wanted to reshape the layers to transition more gradually from the input shape to the output.

popularity <- kms(pop_input, rstats,
layers = list(
units = c(1024, 512, 256, 128, NA),
activation = c(“relu”, “relu”, “relu”, “relu”, “softmax”),
dropout = c(0.5, 0.45, 0.4, 0.35, NA)
))

kms builds a keras_sequential_model(), which is a stack of linear layers. The input shape is determined by the dimensionality of the model matrix (popularity$P) but after that users are free to determine the number of layers and so on. The kms argument layers expects a list, the first entry of which is a vector units with which to call keras::layer_dense(). The first element the number of units in the first layer, the second element for the second layer, and so on (NA as the final element connotes to auto-detect the final number of units based on the observed number of outcomes). activation is also passed to layer_dense() and may take values such as softmax, relu, elu, and linear. (kms also has a separate parameter to control the optimizer; by default kms(… optimizer=”rms_prop”).) The dropout that follows each dense layer rate prevents overfitting (but of course isn’t applicable to the final layer).
Choosing a Batch Size
By default, kms uses batches of 32. Suppose we were happy with our model but didn’t have any particular intuition about what the size should be.

Nbatch <- c(16, 32, 64)
Nruns <- 4
accuracy <- matrix(nrow = Nruns, ncol = length(Nbatch))
colnames(accuracy) <- paste0(“Nbatch_”, Nbatch)

est <- list()
for(i in 1:Nruns){
for(j in 1:length(Nbatch)){
est[[i]] <- kms(pop_input, rstats, Nepochs = 2, batch_size = Nbatch[j])
accuracy[i,j] <- est[[i]][[“evaluations”]][[“acc”]]
}
}

colMeans(accuracy)

Nbatch_16 Nbatch_32 Nbatch_64
0.5088407 0.3820850 0.5556952
For the sake of curtailing runtime, the number of epochs was set arbitrarily short but, from those results, 64 is the best batch size.
Making predictions for new data
Thus far, we have been using the default settings for kms which first splits data into 80% training and 20% testing. Of the 80% training, a certain portion is set aside for validation and that’s what produces the epoch-by-epoch graphs of loss and accuracy. The 20% is only used at the end to assess predictive accuracy.
But suppose you wanted to make predictions on a new data set…

popularity <- kms(pop_input, rstats[1:1000,])
predictions <- predict(popularity, rstats[1001:2000,])
predictions$accuracy

[1] 0.579

Because the formula creates a dummy variable for each screen name and mention, any given set of tweets is all but guaranteed to have different columns. predict.kms_fit is an S3 method that takes the new data and constructs a (sparse) model matrix that preserves the original structure of the training matrix. predict then returns the predictions along with a confusion matrix and accuracy score.
If your newdata has the same observed levels of y and columns of x_train (the model matrix), you can also use keras::predict_classes on object$model.
Using a compiled Keras model
This section shows how to input a model compiled in the fashion typical to library(keras), which is useful for more advanced models. Here is an example for lstm analogous to the imbd with Keras example.

k <- keras_model_sequential()
k %>%
layer_embedding(input_dim = popularity$P, output_dim = popularity$P) %>%
layer_lstm(units = 512, dropout = 0.4, recurrent_dropout = 0.2) %>%
layer_dense(units = 256, activation = “relu”) %>%
layer_dropout(0.3) %>%
layer_dense(units = 8, # number of levels observed on y (outcome)
activation = ‘sigmoid’)

k %>% compile(
loss = ‘categorical_crossentropy’,
optimizer = ‘rmsprop’,
metrics = c(‘accuracy’)
)

popularity_lstm <- kms(pop_input, rstats, k)

Drop me a line via the project’s Github repo. Special thanks to @dfalbel and @jjallaire for helpful suggestions!!

Enjoy this blog? Get notified of new posts by email:

Posts also available at r-bloggers

━ more like this

Newbury BS cuts resi, expat, landlord rates by up to 30bps  – Mortgage Strategy

Newbury Building Society has cut fixed-rate offers by up to 30 basis points across a range of mortgage products including standard residential, shared...

Rate and Term Refinances Are Up a Whopping 300% from a Year Ago

What a difference a year makes.While the mortgage industry has been purchase loan-heavy for several years now, it could finally be starting to shift.A...

Goldman Sachs loses profit after hits from GreenSky, real estate

Second-quarter profit fell 58% to $1.22 billion, or $3.08 a share, due to steep declines in trading and investment banking and losses related to...

Building Data Science Pipelines Using Pandas

Image generated with ChatGPT   Pandas is one of the most popular data manipulation and analysis tools available, known for its ease of use and powerful...

#240 – Neal Stephenson: Sci-Fi, Space, Aliens, AI, VR & the Future of Humanity

Podcast: Play in new window | DownloadSubscribe: Spotify | TuneIn | Neal Stephenson is a sci-fi writer (Snow Crash, Cryptonomicon, and new book Termination...