Loading CSV data
Note: this is the R version of this tutorial in the TensorFlow official webiste.
This tutorial provides an example of how to load CSV data from a file into a TensorFlow Dataset using tfdatasets.
The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and wether the person was traveling alone.
Setup
TRAIN_DATA_URL <- "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL <- "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_file_path <- get_file("train_csv", TRAIN_DATA_URL)
test_file_path <- get_file("eval.csv", TEST_DATA_URL)
You coud load this using read.csv
, and pass the arrays to TensorFlow. If you need
to scale up to a large set of files, or need a loader that integrates with TensorFlow and tfdatasets then use the make_csv_dataset
function:
Now read the CSV data from the file and create a dataset.
train_dataset <- make_csv_dataset(
train_file_path,
field_delim = ",",
batch_size = 5,
num_epochs = 1
)
test_dataset <- train_dataset <- make_csv_dataset(
test_file_path,
field_delim = ",",
batch_size = 5,
num_epochs = 1
)
We can see an element of the dataset with:
## $survived
## tf.Tensor([0 0 1 1 0], shape=(5,), dtype=int32)
##
## $sex
## tf.Tensor([b'male' b'male' b'female' b'female' b'male'], shape=(5,), dtype=string)
##
## $age
## tf.Tensor([20. 54. 28. 29. 45.5], shape=(5,), dtype=float32)
##
## $n_siblings_spouses
## tf.Tensor([0 0 1 0 0], shape=(5,), dtype=int32)
##
## $parch
## tf.Tensor([0 0 0 2 0], shape=(5,), dtype=int32)
##
## $fare
## tf.Tensor([ 7.925 51.8625 26. 15.2458 28.5 ], shape=(5,), dtype=float32)
##
## $class
## tf.Tensor([b'Third' b'First' b'Second' b'Third' b'First'], shape=(5,), dtype=string)
##
## $deck
## tf.Tensor([b'unknown' b'E' b'unknown' b'unknown' b'C'], shape=(5,), dtype=string)
##
## $embark_town
## tf.Tensor([b'Southampton' b'Southampton' b'Southampton' b'Cherbourg' b'Southampton'], shape=(5,), dtype=string)
##
## $alone
## tf.Tensor([b'y' b'y' b'n' b'n' b'y'], shape=(5,), dtype=string)
You can see that make_csv_dataset
creates a list of Tensors each representing a column. This resembles a lot like R’s data.frame
, the most significative difference
is that a TensorFlow dataset is an iterator - meaning that each time you call iter_next
it will yield a different batch of rows from the dataset.
As you can see above, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a character vector to the column_names
argument in the make_csv_dataset
function.
If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns
argument of the constructor.
Data preprocessing
A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.
You can preprocess your data using any tool you like (like nltk or sklearn), and just pass the processed output to TensorFlow.
TensorFlow has a built-in system for describing common input conversions: feature_column
, which we are going to use via the high-level interface
called feature_spec
.
The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model.
First let’s define the spec
.
We can now add steps
to our spec telling how to transform our data.
Continuous data
For continuous data we use the step_numeric_column
:
After adding a step we need to fit
our spec:
We can then create a layer_dense_features
that receives our dataset as input and returns an array containing all dense features:
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
layer()
## tf.Tensor(
## [[48. 34.375 1. 3. ]
## [40. 7.225 0. 0. ]
## [28. 8.7125 0. 0. ]
## [42. 8.4042 0. 1. ]
## [33. 7.8958 0. 0. ]], shape=(5, 4), dtype=float32)
It’s usually a good idea to normalize all numeric features in a neural network. We can use the same step_numeric_column
with an additional argument ``:
spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>%
step_numeric_column(all_numeric(), normalizer_fn = scaler_standard())
We can then fit and creat the layer_dense_features
to take a look at the output:
spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
layer()
## tf.Tensor(
## [[-1.6755021 0.02076224 0.54201424 1.9260981 ]
## [-0.0509259 -0.13340482 0.54201424 0.73245984]
## [ 1.8561853 -0.3152102 -0.48006976 -0.4611784 ]
## [ 1.644284 -0.38669372 -0.48006976 -0.4611784 ]
## [-0.26282716 0.4161861 0.54201424 1.9260981 ]], shape=(5, 4), dtype=float32)
Now, the outputs are scaled.
Categorical data
Categorical data can’t be directly included in the model matrix - we need to perform some kind of transformation in order to represent them as numbers. Representing categorical variables as a set of one-hot encoded columns is very common in practice.
We can also perform this transformation using the feature_spec
API:
Let’s again define our spec
and add some steps:
spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>%
step_categorical_column_with_vocabulary_list(sex) %>%
step_indicator_column(sex)
We can now see the output with:
spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
layer()
## tf.Tensor(
## [[0. 1.]
## [0. 1.]
## [1. 0.]
## [1. 0.]
## [0. 1.]], shape=(5, 2), dtype=float32)
We can see that this generates 2 columns, one for each different category in the column sex
of the dataset.
It’s straightforward to make this transformation for all the categorical features in the dataset:
spec <- feature_spec(train_dataset, survived ~ .)
spec <- spec %>%
step_categorical_column_with_vocabulary_list(all_nominal()) %>%
step_indicator_column(all_nominal())
Now let’s see the output:
spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
layer()
## tf.Tensor(
## [[0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1.]
## [1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0.]
## [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1.]
## [0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
## [0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]], shape=(5, 18), dtype=float32)
Combining everything
We demonstrated how to use the feature_spec
interface both for continuous and categorical data separetedly. It’s also possible to combine all transformations in a single spec
:
spec <- feature_spec(train_dataset, survived ~ .) %>%
step_numeric_column(all_numeric(), normalizer_fn = scaler_standard()) %>%
step_categorical_column_with_vocabulary_list(all_nominal()) %>%
step_indicator_column(all_nominal())
Now, let’s fit the spec
and take a look at the output:
spec <- fit(spec)
layer <- layer_dense_features(feature_columns = dense_features(spec))
train_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
layer()
## tf.Tensor(
## [[ 0.93794656 0. 1. 0. 0. 1.
## 0. 0. 0. 0. 0. 1.
## 0. 0. 0. 1. 0. -0.5539651
## -0.48006976 -0.4611784 0. 1. ]
## [-0.0509259 0. 1. 0. 0. 1.
## 0. 0. 0. 0. 0. 0.
## 1. 1. 0. 0. 0. -0.5659972
## -0.48006976 -0.4611784 0. 1. ]
## [-1.9700449 1. 0. 0. 1. 0.
## 0. 0. 0. 0. 0. 0.
## 1. 0. 0. 1. 0. -0.23657836
## 0.54201424 0.73245984 0. 1. ]
## [-0.5453621 0. 1. 0. 0. 1.
## 0. 0. 0. 0. 0. 0.
## 1. 0. 0. 1. 0. -0.5461019
## -0.48006976 -0.4611784 0. 1. ]
## [-0.8985309 1. 0. 0. 0. 1.
## 0. 0. 0. 0. 0. 0.
## 1. 0. 0. 1. 0. 0.5683259
## 4.63035 1.9260981 1. 0. ]], shape=(5, 22), dtype=float32)
This concludes our data preprocessing step and we can now focus on building a training a model.
Building the model
We will use the Keras sequential API do build a model that uses the
dense features we have defined in the spec
:
model <- keras_model_sequential() %>%
layer_dense_features(feature_columns = dense_features(spec)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
Train, evaluate and predict
Now the model can be instantiated and trained.
model %>%
fit(
train_dataset %>% dataset_use_spec(spec) %>% dataset_shuffle(500),
epochs = 20,
validation_data = test_dataset %>% dataset_use_spec(spec),
verbose = 2
)
## Epoch 1/20
## 53/53 - 1s - loss: 0.5752 - accuracy: 0.7197 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
## Epoch 2/20
## 53/53 - 0s - loss: 0.4546 - accuracy: 0.8144 - val_loss: 0.4310 - val_accuracy: 0.8068
## Epoch 3/20
## 53/53 - 0s - loss: 0.4273 - accuracy: 0.8068 - val_loss: 0.4009 - val_accuracy: 0.8371
## Epoch 4/20
## 53/53 - 0s - loss: 0.4070 - accuracy: 0.8333 - val_loss: 0.3742 - val_accuracy: 0.8598
## Epoch 5/20
## 53/53 - 0s - loss: 0.3938 - accuracy: 0.8561 - val_loss: 0.3579 - val_accuracy: 0.8523
## Epoch 6/20
## 53/53 - 0s - loss: 0.3824 - accuracy: 0.8333 - val_loss: 0.3593 - val_accuracy: 0.8598
## Epoch 7/20
## 53/53 - 0s - loss: 0.3675 - accuracy: 0.8182 - val_loss: 0.3362 - val_accuracy: 0.8750
## Epoch 8/20
## 53/53 - 0s - loss: 0.3647 - accuracy: 0.8598 - val_loss: 0.3340 - val_accuracy: 0.8598
## Epoch 9/20
## 53/53 - 0s - loss: 0.3317 - accuracy: 0.8788 - val_loss: 0.3158 - val_accuracy: 0.8902
## Epoch 10/20
## 53/53 - 0s - loss: 0.3411 - accuracy: 0.8750 - val_loss: 0.3180 - val_accuracy: 0.8864
## Epoch 11/20
## 53/53 - 0s - loss: 0.3312 - accuracy: 0.8826 - val_loss: 0.3023 - val_accuracy: 0.8826
## Epoch 12/20
## 53/53 - 0s - loss: 0.3209 - accuracy: 0.8598 - val_loss: 0.3082 - val_accuracy: 0.8826
## Epoch 13/20
## 53/53 - 0s - loss: 0.3173 - accuracy: 0.8712 - val_loss: 0.2955 - val_accuracy: 0.8864
## Epoch 14/20
## 53/53 - 0s - loss: 0.3176 - accuracy: 0.8788 - val_loss: 0.2820 - val_accuracy: 0.8902
## Epoch 15/20
## 53/53 - 0s - loss: 0.3172 - accuracy: 0.8561 - val_loss: 0.2811 - val_accuracy: 0.8977
## Epoch 16/20
## 53/53 - 0s - loss: 0.3068 - accuracy: 0.8902 - val_loss: 0.2851 - val_accuracy: 0.8939
## Epoch 17/20
## 53/53 - 0s - loss: 0.2970 - accuracy: 0.8750 - val_loss: 0.2674 - val_accuracy: 0.9053
## Epoch 18/20
## 53/53 - 0s - loss: 0.2874 - accuracy: 0.8826 - val_loss: 0.2665 - val_accuracy: 0.8977
## Epoch 19/20
## 53/53 - 0s - loss: 0.2823 - accuracy: 0.8977 - val_loss: 0.2544 - val_accuracy: 0.9091
## Epoch 20/20
## 53/53 - 0s - loss: 0.2818 - accuracy: 0.8826 - val_loss: 0.2601 - val_accuracy: 0.9091
Once the model is trained, you can check its accuracy on the test_data set.
## $loss
## [1] 0.2602749
##
## $accuracy
## [1] 0.9090909
You can also use predict
to infer labels on a batch or a dataset of batches:
batch <- test_dataset %>%
reticulate::as_iterator() %>%
reticulate::iter_next() %>%
reticulate::py_to_r()
predict(model, batch)
## [,1]
## [1,] 0.01933812
## [2,] 0.17113033
## [3,] 0.07307167
## [4,] 0.98227388
## [5,] 0.98028392