Loading CSV data

Note: this is the R version of this tutorial in the TensorFlow official webiste.

This tutorial provides an example of how to load CSV data from a file into a TensorFlow Dataset using tfdatasets.

The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and wether the person was traveling alone.

Setup

library(keras)
library(tfdatasets)

TRAIN_DATA_URL <- "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL <- "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path <- get_file("train_csv", TRAIN_DATA_URL)
test_file_path <- get_file("eval.csv", TEST_DATA_URL)

You coud load this using read.csv, and pass the arrays to TensorFlow. If you need to scale up to a large set of files, or need a loader that integrates with TensorFlow and tfdatasets then use the make_csv_dataset function:

Now read the CSV data from the file and create a dataset.

train_dataset <- make_csv_dataset(
  train_file_path, 
  field_delim = ",",
  batch_size = 5, 
  num_epochs = 1
)

test_dataset <- train_dataset <- make_csv_dataset(
  test_file_path, 
  field_delim = ",",
  batch_size = 5, 
  num_epochs = 1
)

We can see an element of the dataset with:

train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  reticulate::py_to_r()

## $survived
## tf.Tensor([0 0 1 1 0], shape=(5,), dtype=int32)
## 
## $sex
## tf.Tensor([b'male' b'male' b'female' b'female' b'male'], shape=(5,), dtype=string)
## 
## $age
## tf.Tensor([20.  54.  28.  29.  45.5], shape=(5,), dtype=float32)
## 
## $n_siblings_spouses
## tf.Tensor([0 0 1 0 0], shape=(5,), dtype=int32)
## 
## $parch
## tf.Tensor([0 0 0 2 0], shape=(5,), dtype=int32)
## 
## $fare
## tf.Tensor([ 7.925  51.8625 26.     15.2458 28.5   ], shape=(5,), dtype=float32)
## 
## $class
## tf.Tensor([b'Third' b'First' b'Second' b'Third' b'First'], shape=(5,), dtype=string)
## 
## $deck
## tf.Tensor([b'unknown' b'E' b'unknown' b'unknown' b'C'], shape=(5,), dtype=string)
## 
## $embark_town
## tf.Tensor([b'Southampton' b'Southampton' b'Southampton' b'Cherbourg' b'Southampton'], shape=(5,), dtype=string)
## 
## $alone
## tf.Tensor([b'y' b'y' b'n' b'n' b'y'], shape=(5,), dtype=string)

You can see that make_csv_dataset creates a list of Tensors each representing a column. This resembles a lot like R’s data.frame, the most significative difference is that a TensorFlow dataset is an iterator - meaning that each time you call iter_next it will yield a different batch of rows from the dataset.

As you can see above, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a character vector to the column_names argument in the make_csv_dataset function.

If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns argument of the constructor.

Data preprocessing

A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.

You can preprocess your data using any tool you like (like nltk or sklearn), and just pass the processed output to TensorFlow.

TensorFlow has a built-in system for describing common input conversions: feature_column, which we are going to use via the high-level interface called feature_spec.

The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model.

First let’s define the spec.

spec <- feature_spec(train_dataset, survived ~ .)

We can now add steps to our spec telling how to transform our data.

Continuous data

For continuous data we use the step_numeric_column:

spec <- spec %>% 
  step_numeric_column(all_numeric())

After adding a step we need to fit our spec:

spec <- fit(spec)

We can then create a layer_dense_features that receives our dataset as input and returns an array containing all dense features:

layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  layer()

## tf.Tensor(
## [[48.     34.375   1.      3.    ]
##  [40.      7.225   0.      0.    ]
##  [28.      8.7125  0.      0.    ]
##  [42.      8.4042  0.      1.    ]
##  [33.      7.8958  0.      0.    ]], shape=(5, 4), dtype=float32)

It’s usually a good idea to normalize all numeric features in a neural network. We can use the same step_numeric_column with an additional argument ``:

spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>% 
  step_numeric_column(all_numeric(), normalizer_fn = scaler_standard())

We can then fit and creat the layer_dense_features to take a look at the output:

spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  layer()

## tf.Tensor(
## [[-1.6755021   0.02076224  0.54201424  1.9260981 ]
##  [-0.0509259  -0.13340482  0.54201424  0.73245984]
##  [ 1.8561853  -0.3152102  -0.48006976 -0.4611784 ]
##  [ 1.644284   -0.38669372 -0.48006976 -0.4611784 ]
##  [-0.26282716  0.4161861   0.54201424  1.9260981 ]], shape=(5, 4), dtype=float32)

Now, the outputs are scaled.

Categorical data

Categorical data can’t be directly included in the model matrix - we need to perform some kind of transformation in order to represent them as numbers. Representing categorical variables as a set of one-hot encoded columns is very common in practice.

We can also perform this transformation using the feature_spec API:

Let’s again define our spec and add some steps:

spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>% 
  step_categorical_column_with_vocabulary_list(sex) %>% 
  step_indicator_column(sex)

We can now see the output with:

spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  layer()

## tf.Tensor(
## [[0. 1.]
##  [0. 1.]
##  [1. 0.]
##  [1. 0.]
##  [0. 1.]], shape=(5, 2), dtype=float32)

We can see that this generates 2 columns, one for each different category in the column sex of the dataset.

It’s straightforward to make this transformation for all the categorical features in the dataset:

spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>% 
  step_categorical_column_with_vocabulary_list(all_nominal()) %>% 
  step_indicator_column(all_nominal())

Now let’s see the output:

spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  layer()

## tf.Tensor(
## [[0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1.]
##  [1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0.]
##  [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
##  [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
##  [0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]], shape=(5, 18), dtype=float32)

Combining everything

We demonstrated how to use the feature_spec interface both for continuous and categorical data separetedly. It’s also possible to combine all transformations in a single spec:

spec <- feature_spec(train_dataset, survived ~ .) %>% 
  step_numeric_column(all_numeric(), normalizer_fn = scaler_standard()) %>% 
  step_categorical_column_with_vocabulary_list(all_nominal()) %>% 
  step_indicator_column(all_nominal())

Now, let’s fit the spec and take a look at the output:

spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  layer()

## tf.Tensor(
## [[ 0.93794656  0.          1.          0.          0.          1.
##    0.          0.          0.          0.          0.          1.
##    0.          0.          0.          1.          0.         -0.5539651
##   -0.48006976 -0.4611784   0.          1.        ]
##  [-0.0509259   0.          1.          0.          0.          1.
##    0.          0.          0.          0.          0.          0.
##    1.          1.          0.          0.          0.         -0.5659972
##   -0.48006976 -0.4611784   0.          1.        ]
##  [-1.9700449   1.          0.          0.          1.          0.
##    0.          0.          0.          0.          0.          0.
##    1.          0.          0.          1.          0.         -0.23657836
##    0.54201424  0.73245984  0.          1.        ]
##  [-0.5453621   0.          1.          0.          0.          1.
##    0.          0.          0.          0.          0.          0.
##    1.          0.          0.          1.          0.         -0.5461019
##   -0.48006976 -0.4611784   0.          1.        ]
##  [-0.8985309   1.          0.          0.          0.          1.
##    0.          0.          0.          0.          0.          0.
##    1.          0.          0.          1.          0.          0.5683259
##    4.63035     1.9260981   1.          0.        ]], shape=(5, 22), dtype=float32)

This concludes our data preprocessing step and we can now focus on building a training a model.

Building the model

We will use the Keras sequential API do build a model that uses the dense features we have defined in the spec:

model <- keras_model_sequential() %>% 
  layer_dense_features(feature_columns = dense_features(spec)) %>% 
  layer_dense(units = 128, activation = "relu") %>% 
  layer_dense(units = 128, activation = "relu") %>% 
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = "adam",
  metrics = "accuracy"
)

Train, evaluate and predict

Now the model can be instantiated and trained.

model %>% 
  fit(
    train_dataset %>% dataset_use_spec(spec) %>% dataset_shuffle(500),
    epochs = 20,
    validation_data = test_dataset %>% dataset_use_spec(spec),
    verbose = 2
  )

## Epoch 1/20
## 53/53 - 1s - loss: 0.5752 - accuracy: 0.7197 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
## Epoch 2/20
## 53/53 - 0s - loss: 0.4546 - accuracy: 0.8144 - val_loss: 0.4310 - val_accuracy: 0.8068
## Epoch 3/20
## 53/53 - 0s - loss: 0.4273 - accuracy: 0.8068 - val_loss: 0.4009 - val_accuracy: 0.8371
## Epoch 4/20
## 53/53 - 0s - loss: 0.4070 - accuracy: 0.8333 - val_loss: 0.3742 - val_accuracy: 0.8598
## Epoch 5/20
## 53/53 - 0s - loss: 0.3938 - accuracy: 0.8561 - val_loss: 0.3579 - val_accuracy: 0.8523
## Epoch 6/20
## 53/53 - 0s - loss: 0.3824 - accuracy: 0.8333 - val_loss: 0.3593 - val_accuracy: 0.8598
## Epoch 7/20
## 53/53 - 0s - loss: 0.3675 - accuracy: 0.8182 - val_loss: 0.3362 - val_accuracy: 0.8750
## Epoch 8/20
## 53/53 - 0s - loss: 0.3647 - accuracy: 0.8598 - val_loss: 0.3340 - val_accuracy: 0.8598
## Epoch 9/20
## 53/53 - 0s - loss: 0.3317 - accuracy: 0.8788 - val_loss: 0.3158 - val_accuracy: 0.8902
## Epoch 10/20
## 53/53 - 0s - loss: 0.3411 - accuracy: 0.8750 - val_loss: 0.3180 - val_accuracy: 0.8864
## Epoch 11/20
## 53/53 - 0s - loss: 0.3312 - accuracy: 0.8826 - val_loss: 0.3023 - val_accuracy: 0.8826
## Epoch 12/20
## 53/53 - 0s - loss: 0.3209 - accuracy: 0.8598 - val_loss: 0.3082 - val_accuracy: 0.8826
## Epoch 13/20
## 53/53 - 0s - loss: 0.3173 - accuracy: 0.8712 - val_loss: 0.2955 - val_accuracy: 0.8864
## Epoch 14/20
## 53/53 - 0s - loss: 0.3176 - accuracy: 0.8788 - val_loss: 0.2820 - val_accuracy: 0.8902
## Epoch 15/20
## 53/53 - 0s - loss: 0.3172 - accuracy: 0.8561 - val_loss: 0.2811 - val_accuracy: 0.8977
## Epoch 16/20
## 53/53 - 0s - loss: 0.3068 - accuracy: 0.8902 - val_loss: 0.2851 - val_accuracy: 0.8939
## Epoch 17/20
## 53/53 - 0s - loss: 0.2970 - accuracy: 0.8750 - val_loss: 0.2674 - val_accuracy: 0.9053
## Epoch 18/20
## 53/53 - 0s - loss: 0.2874 - accuracy: 0.8826 - val_loss: 0.2665 - val_accuracy: 0.8977
## Epoch 19/20
## 53/53 - 0s - loss: 0.2823 - accuracy: 0.8977 - val_loss: 0.2544 - val_accuracy: 0.9091
## Epoch 20/20
## 53/53 - 0s - loss: 0.2818 - accuracy: 0.8826 - val_loss: 0.2601 - val_accuracy: 0.9091

Once the model is trained, you can check its accuracy on the test_data set.

model %>% evaluate(test_dataset %>% dataset_use_spec(spec), verbose = 0)

## $loss
## [1] 0.2602749
## 
## $accuracy
## [1] 0.9090909

You can also use predict to infer labels on a batch or a dataset of batches:

batch <- test_dataset %>% 
  reticulate::as_iterator() %>% 
  reticulate::iter_next() %>% 
  reticulate::py_to_r()
predict(model, batch)

##            [,1]
## [1,] 0.01933812
## [2,] 0.17113033
## [3,] 0.07307167
## [4,] 0.98227388
## [5,] 0.98028392