Tag Archives: Data Extraction

Conversion of Movie-review data to one-hot encoding

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/conversion-of-movie-review-data-to-one.html

In the last post, we obtained the files test_data.h5 and train_data.h5, containing text data from movie reviews (from the ACL 2011 IMDB dataset). In the next exercise, we need to access a one-hot encoded version of these files, based on a large vocabulary. The following code converts the data and stores it on disk for later use. It takes about two hours to run on my laptop and uses 13GB of storage for the converted file.

The Jupyter notebook can be downloaded here

 

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Conversion of Movie-review data to one-hot encoding

The final exercise of Google’s Machine Learning Crash Course uses the ACL 2011 IMDB dataset) to train a Neural Network on movie review data. At this step, we are not concerned with building an input pipeline or implementing an effective handling and storage of the data.
The following code converts the movie review data we extracted from a .tfrecord-file in the previous step to a one-hot encoded matrix and stores it on the disk for later use:
In [3]:
using HDF5
using JLD
The following function handles the conversion to a one-hot encoding:
In [4]:
# function for creating categorial colum from vocabulary list in one hot encoding
function create_data_columns(data, informative_terms)
onehotmat=zeros(length(data), length(informative_terms))

for i=1:length(data)
string=data[i]
for j=1:length(informative_terms)
if contains(string, informative_terms[j])
onehotmat[i,j]=1
end
end
end
return onehotmat
end
Out[4]:
create_data_columns (generic function with 1 method)
Let’s load the data from disk:
In [5]:
c = h5open("train_data.h5", "r") do file
global train_labels=read(file, "output_labels")
global train_features=read(file, "output_features")
end
c = h5open("test_data.h5", "r") do file
global test_labels=read(file, "output_labels")
global test_features=read(file, "output_features")
end
train_labels=train_labels'
test_labels=test_labels';
We will use the full vocabulary file, which can be obtained here. Put it in the same folder as the Jupyter-file and open it using
In [6]:
vocabulary=Array{String}(0)
open("terms.txt") do file
for ln in eachline(file)
push!(vocabulary, ln)
end
end
We will now create the test and training features matrices based on the full vocabulary file. This code does not create sparse matrices and takes a long time to run (about 2h on my laptop).
In [7]:
# This takes a looong time. Only run it once and save the result
train_features_full=create_data_columns(train_features, vocabulary)
test_features_full=create_data_columns(test_features, vocabulary);
Save the data to disk. The data takes about 13GB of memory in uncompressed state.
In [8]:
save("IMDB_fullmatrix_datacolumns.jld", "train_features_full", train_features_full, "test_features_full", test_features_full)

Data Extraction from TFrecord-files

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/data-extraction-from-tfrecord-files.html

The last exercise of the Machine Learning Crash Course uses text data from movie reviews (from the ACL 2011 IMDB dataset). The data has been processed as a tf.Example-format and can be downloaded as a .tfrecord-file from Google’s servers. 

Tensorflow.jl does not support this file type, so in order to follow the exercise, we need to extract the data from the tfrecord-dataset. This Jupyter-notebook contains Python code to access the data, store it as an HDF5 file, and upload it to Google Drive. It can be run directly in Google’s Colaboratory Platform without installing Python. We obtain the files test_data.h5 and train_data.h5, which will be used in the next post.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TFrecord Extraction

We will load a tfrecord dataset and get the data out to use them with some other framework, for example TensorFlow on Julia.

Prepare Packages and Parse Function

In [1]:
from __future__ import print_function

import collections
import io
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)
train_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/train.tfrecord'
train_path = tf.keras.utils.get_file(train_url.split('/')[-1], train_url)
test_url = 'https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/test.tfrecord'
test_path = tf.keras.utils.get_file(test_url.split('/')[-1], test_url)
Downloading data from https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/train.tfrecord
41631744/41625533 [==============================] - 0s 0us/step
41639936/41625533 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/test.tfrecord
40689664/40688441 [==============================] - 0s 0us/step
40697856/40688441 [==============================] - 0s 0us/step
In [0]:
def _parse_function(record):
"""Extracts features and labels.

Args:
record: File path to a TFRecord file
Returns:
A `tuple` `(labels, features)`:
features: A dict of tensors representing the features
labels: A tensor with the corresponding labels.
"""
features = {
"terms": tf.VarLenFeature(dtype=tf.string), # terms are strings of varying lengths
"labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32) # labels are 0 or 1
}

parsed_features = tf.parse_single_example(record, features)

terms = parsed_features['terms'].values
labels = parsed_features['labels']

return {'terms':terms}, labels

Training Data

We start with the training data.
In [0]:
# Create the Dataset object.
ds = tf.data.TFRecordDataset(train_path)
# Map features and labels with the parse function.
ds = ds.map(_parse_function)
In [0]:
# Make a one shot iterator
n = ds.make_one_shot_iterator().get_next()
sess = tf.Session()
Direct meta-information on the number of datasets in a tfrecord file is unfortunately not available. We use the following nice hack to get the total number of entries by iterating over the whole dataset.
In [6]:
sum(1 for _ in tf.python_io.tf_record_iterator(train_path))
Out[6]:
25000
Now, we create two vectors to store the output labels and features. Looping over the tfrecord-dataset extracts the entries.
In [0]:
output_features=[]
output_labels=[]

for i in range(0,24999):
value=sess.run(n)
output_features.append(value[0]['terms'])
output_labels.append(value[1])

Export to File

We create a file to export using the h5py package.
In [0]:
import h5py
In [0]:
dt = h5py.special_dtype(vlen=str)

h5f = h5py.File('train_data.h5', 'w')
h5f.create_dataset('output_features', data=output_features, dtype=dt)
h5f.create_dataset('output_labels', data=output_labels)
h5f.close()

Test Data

We do a similar action on the test data.
In [0]:
# Create the Dataset object.
ds = tf.data.TFRecordDataset(test_path)
# Map features and labels with the parse function.
ds = ds.map(_parse_function)
In [0]:
n = ds.make_one_shot_iterator().get_next()
sess = tf.Session()
The total number of datasets is
In [7]:
sum(1 for _ in tf.python_io.tf_record_iterator(test_path))
Out[7]:
25000
In [0]:
output_features=[]
output_labels=[]

for i in range(0,24999):
value=sess.run(n)
output_features.append(value[0]['terms'])
output_labels.append(value[1])

Export to file

In [0]:
dt = h5py.special_dtype(vlen=str)

h5f = h5py.File('test_data.h5', 'w')
h5f.create_dataset('output_features', data=output_features, dtype=dt)
h5f.create_dataset('output_labels', data=output_labels)
h5f.close()

Google Drive Export

Finally, we export the two files containing the training and test data to Google Drive. If necessary, intall the PyDrive package using !pip install -U -q PyDrive. The folder-id is the string of letters and numbers that can be seen in your browser URL after https://drive.google.com/drive/u/0/folders/ when accessing the desired folder.
In [0]:
!pip install -U -q PyDrive
In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# PyDrive reference:
# https://googledrive.github.io/PyDrive/docs/build/html/index.html
In [0]:
# Adjust the id to the folder of your choice in Google Drive
# Use `file = drive.CreateFile()` to write to root directory
file = drive.CreateFile({'parents':[{"id": "insert_folder_id"}]})
file.SetContentFile('train_data.h5')
file.Upload()
In [0]:
# Adjust the id to the folder of your choice in Google Drive
# Use `file = drive.CreateFile()` to write to root directory
file = drive.CreateFile({'parents':[{"id": "insert_folder_id"}]})
file.SetContentFile('test_data.h5')
file.Upload()