Synthetic Features and Outliers

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/08/synthetic-features-and-outliers.html

In this second part, we create a synthetic feature and remove some outliers from the data set.

The Jupyter notebook can be downloaded here. For the version displayed below, I needed to remove some scatter plots.


This notebook is based on the file Synthetic Features and Outliers, which is part of Google’s Machine Learning Crash Course.
In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Synthetic Features and Outliers

Learning Objectives:
  • Create a synthetic feature that is the ratio of two other features
  • Use this new feature as an input to a linear regression model
  • Improve the effectiveness of the model by identifying and clipping (removing) outliers out of the input data
Let’s revisit our model from the previous First Steps with TensorFlow exercise.
First, we’ll import the California housing data into DataFrame:

Setup

In [2]:
using Plots
gr()
using DataFrames
using TensorFlow
import CSV

sess=Session()

california_housing_dataframe = CSV.read("california_housing_train.csv", delim=",");
california_housing_dataframe[:median_house_value] /= 1000.0
california_housing_dataframe
Out[2]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
1 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66.9
2 -114.47 34.4 19.0 7650.0 1901.0 1129.0 463.0 1.82 80.1
3 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85.7
4 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73.4
5 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.925 65.5
6 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74.0
7 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82.4
8 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48.5
9 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58.4
10 -114.6 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48.1
11 -114.6 33.62 16.0 3741.0 801.0 2434.0 824.0 2.6797 86.5
12 -114.6 33.6 21.0 1988.0 483.0 1182.0 437.0 1.625 62.0
13 -114.61 34.84 48.0 1291.0 248.0 580.0 211.0 2.1571 48.6
14 -114.61 34.83 31.0 2478.0 464.0 1346.0 479.0 3.212 70.4
15 -114.63 32.76 15.0 1448.0 378.0 949.0 300.0 0.8585 45.0
16 -114.65 34.89 17.0 2556.0 587.0 1005.0 401.0 1.6991 69.1
17 -114.65 33.6 28.0 1678.0 322.0 666.0 256.0 2.9653 94.9
18 -114.65 32.79 21.0 44.0 33.0 64.0 27.0 0.8571 25.0
19 -114.66 32.74 17.0 1388.0 386.0 775.0 320.0 1.2049 44.0
20 -114.67 33.92 17.0 97.0 24.0 29.0 15.0 1.2656 27.5
21 -114.68 33.49 20.0 1491.0 360.0 1135.0 303.0 1.6395 44.4
22 -114.73 33.43 24.0 796.0 243.0 227.0 139.0 0.8964 59.2
23 -114.94 34.55 20.0 350.0 95.0 119.0 58.0 1.625 50.0
24 -114.98 33.82 15.0 644.0 129.0 137.0 52.0 3.2097 71.3
25 -115.22 33.54 18.0 1706.0 397.0 3424.0 283.0 1.625 53.5
26 -115.32 32.82 34.0 591.0 139.0 327.0 89.0 3.6528 100.0
27 -115.37 32.82 30.0 1602.0 322.0 1130.0 335.0 3.5735 71.1
28 -115.37 32.82 14.0 1276.0 270.0 867.0 261.0 1.9375 80.9
29 -115.37 32.81 32.0 741.0 191.0 623.0 169.0 1.7604 68.6
30 -115.37 32.81 23.0 1458.0 294.0 866.0 275.0 2.3594 74.3
Next, we’ll set up our input functions, and define the function for model training:
In [3]:
function create_batches(features, targets, steps, batch_size=5, num_epochs=0)

if(num_epochs==0)
num_epochs=ceil(batch_size*steps/length(features))
end

features_batches=Union{Float64, Missings.Missing}[]
target_batches=Union{Float64, Missings.Missing}[]

for i=1:num_epochs

select=shuffle(1:length(features))

append!(features_batches, features[select])
append!(target_batches, targets[select])
end

return features_batches, target_batches
end
Out[3]:
create_batches (generic function with 3 methods)
In [4]:
function next_batch(features_batches, targets_batches, batch_size, iter)

select=mod((iter-1)*batch_size+1, length(features_batches)):mod(iter*batch_size, length(features_batches));

ds=features_batches[select];
target=targets_batches[select];

return ds, target
end
Out[4]:
next_batch (generic function with 1 method)
In [5]:
function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1):
"""Trains a linear regression model of one feature.

Args:
features: DataFrame of features
targets: DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""

# Construct a dataset, and configure batching/repeating.
ds, target = next_batch(features_batches, targets_batches, batch_size, iter)

# Shuffle the data, if specified.
if shuffle_flag==1
select=shuffle(1:size(ds, 1));
ds = ds[select,:]
target = target[select, :]
end

# Return the next batch of data.
return convert.(Float64,ds), convert.(Float64,target)
end
Out[5]:
my_input_fn (generic function with 3 methods)
In [11]:
function train_model(learning_rate, steps, batch_size, input_feature=:total_rooms)
"""Trains a linear regression model of one feature.

Args:
learning_rate: A `float`, the learning rate.
steps: A non-zero `int`, the total number of training steps. A training step
consists of a forward and backward pass using a single batch.
batch_size: A non-zero `int`, the batch size.
input_feature: A `symbol` specifying a column from `california_housing_dataframe`
to use as input feature.
"""

periods = 10
steps_per_period = steps / periods

my_feature = input_feature
my_feature_data = convert.(Float32,california_housing_dataframe[my_feature])
my_label = :median_house_value
targets = convert.(Float32,california_housing_dataframe[my_label])

# Create feature columns.
feature_columns = placeholder(Float32)
target_columns = placeholder(Float32)

# Create a linear regressor object.
m=Variable(0.0)
b=Variable(0.0)
y=m.*feature_columns .+ b
loss=reduce_sum((target_columns - y).^2)
run(sess, global_variables_initializer())
features_batches, targets_batches = create_batches(my_feature_data, targets, steps, batch_size)

# Use gradient descent as the optimizer for training the model.
#my_optimizer=train.minimize(train.GradientDescentOptimizer(learning_rate), loss)
my_optimizer=(train.GradientDescentOptimizer(learning_rate))
gvs = train.compute_gradients(my_optimizer, loss)
capped_gvs = [(clip_by_norm(grad, 5.), var) for (grad, var) in gvs]
my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)

# Set up to plot the state of our model's line each period.
sample = california_housing_dataframe[rand(1:size(california_housing_dataframe,1), 300),:];
p1=scatter(sample[my_feature], sample[my_label], title="Learned Line by Period", ylabel=my_label, xlabel=my_feature,color=:coolwarm)
colors= [ColorGradient(:coolwarm)[i] for i in linspace(0,1, periods+1)]

# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
println("Training model...")
println("RMSE (on training data):")
root_mean_squared_errors = []
for period in 1:periods
# Train the model, starting from the prior state.
for i=1:steps_per_period
features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
run(sess, my_optimizer, Dict(feature_columns=>features, target_columns=>labels))
end
# Take a break and compute predictions.
predictions = run(sess, y, Dict(feature_columns=> my_feature_data));

# Compute loss.
mean_squared_error = mean((predictions- targets).^2)
root_mean_squared_error = sqrt(mean_squared_error)
# Occasionally print the current loss.
println(" period ", period, ": ", root_mean_squared_error)
# Add the loss metrics from this period to our list.
push!(root_mean_squared_errors, root_mean_squared_error)
# Finally, track the weights and biases over time.

# Apply some math to ensure that the data and line are plotted neatly.
y_extents = [0 maximum(sample[my_label])]
weight = run(sess,m)
bias = run(sess,b)
x_extents = (y_extents - bias) / weight
x_extents = max.(min.(x_extents, maximum(sample[my_feature])),
minimum(sample[my_feature]))
y_extents = weight .* x_extents .+ bias
p1=plot!(x_extents', y_extents', color=colors[period], linewidth=2)
end

predictions = run(sess, y, Dict(feature_columns=> my_feature_data));
weight = run(sess,m)
bias = run(sess,b)

println("Model training finished.")

# Output a graph of loss metrics over periods.
p2=plot(root_mean_squared_errors, title="Root Mean Squared Error vs. Periods", ylabel="RMSE", xlabel="Periods")

# Output a table with calibration data.
calibration_data = DataFrame()
calibration_data[:predictions] = predictions
calibration_data[:targets] = targets
describe(calibration_data)

println("Final RMSE (on training data): ", root_mean_squared_errors[end])
println("Final Weight (on training data): ", weight)
println("Final Bias (on training data): ", bias)

return p1, p2, calibration_data
end
Out[11]:
train_model (generic function with 2 methods)

Task 1: Try a Synthetic Feature

Both the total_rooms and population features count totals for a given city block.
But what if one city block were more densely populated than another? We can explore how block density relates to median house value by creating a synthetic feature that’s a ratio of total_rooms and population.
In the cell below, we create a feature called rooms_per_person, and use that as the input_feature to train_model().
In [7]:
california_housing_dataframe[:rooms_per_person] =(
california_housing_dataframe[:total_rooms] ./ california_housing_dataframe[:population]);
In [14]:
p1, p2, calibration_data= train_model(
0.05, # learning rate
1000, # steps
5, # batch size
:rooms_per_person #feature
)
Training model...
RMSE (on training data):
period 1: 174.73499015754794
period 2: 134.71078976839658
period 3: 125.55076328927971
period 4: 126.57741465589378
period 5: 126.25333063681862
period 6: 126.96155073215469
period 7: 126.46464736821247
period 8: 127.36567810703599
period 9: 126.96500792643184
period 10: 129.1270898729325
Model training finished.
Out[14]:
(Plot{Plots.GRBackend() n=11}, Plot{Plots.GRBackend() n=1}, 17000×2 DataFrames.DataFrame
│ Row │ predictions │ targets │
├───────┼─────────────┼─────────┤
│ 1 │ 473.972 │ 66.9 │
│ 2 │ 564.655 │ 80.1 │
│ 3 │ 229.099 │ 85.7 │
│ 4 │ 283.821 │ 73.4 │
│ 5 │ 241.315 │ 65.5 │
│ 6 │ 222.183 │ 74.0 │
│ 7 │ 186.689 │ 82.4 │
│ 8 │ 229.33 │ 48.5 │
│ 9 │ 182.983 │ 58.4 │
│ 10 │ 210.19 │ 48.1 │
│ 11 │ 183.63 │ 86.5 │

│ 16989 │ 210.883 │ 66.9 │
│ 16990 │ 214.725 │ 58.1 │
│ 16991 │ 214.596 │ 78.3 │
│ 16992 │ 242.92 │ 73.2 │
│ 16993 │ 220.514 │ 50.8 │
│ 16994 │ 241.927 │ 106.7 │
│ 16995 │ 311.485 │ 76.1 │
│ 16996 │ 249.621 │ 111.4 │
│ 16997 │ 214.93 │ 79.0 │
│ 16998 │ 228.355 │ 103.6 │
│ 16999 │ 221.564 │ 85.8 │
│ 17000 │ 236.074 │ 94.6 │)
Final RMSE (on training data): 129.1270898729325
Final Weight (on training data): 72.72948312797806
Final Bias (on training data): 71.84641197969654
In [15]:
plot(p1, p2, layout=(1,2), legend=false)
Out[15]:
051015100200300400500Learned Line by Periodrooms_per_personmedian_house_value246810130140150160170Root Mean Squared Error vs. PeriodsPeriodsRMSE

Task 2: Identify Outliers

We can visualize the performance of our model by creating a scatter plot of predictions vs. target values. Ideally, these would lie on a perfectly correlated diagonal line.
We use scatter to create a scatter plot of predictions vs. targets, using the rooms-per-person model you trained in Task 1.
Do you see any oddities? Trace these back to the source data by looking at the distribution of values in rooms_per_person.
In [28]:
#scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)
The calibration data shows most scatter points aligned to a line. The line is almost vertical, but we’ll come back to that later. Right now let’s focus on the ones that deviate from the line. We notice that they are relatively few in number.
If we plot a histogram of rooms_per_person, we find that we have a few outliers in our input data:
In [17]:
histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)
Out[17]:
010203040506005.0×1031.0×1041.5×104

Task 3: Clip Outliers

We see if we can further improve the model fit by setting the outlier values of rooms_per_person to some reasonable minimum or maximum.
The histogram we created in Task 2 shows that the majority of values are less than 5. Let’s clip rooms_per_person to 5, and plot a histogram to double-check the results.
In [18]:
california_housing_dataframe[:rooms_per_person] = min.(
california_housing_dataframe[:rooms_per_person],5)

histogram(california_housing_dataframe[:rooms_per_person], nbins=20, legend=false)
Out[18]:
012345010002000300040005000
To verify that clipping worked, let’s train again and print the calibration data once more:
In [23]:
p1, p2, calibration_data= train_model(
0.05, # learning rate
500, # steps
10, # batch size
:rooms_per_person #feature
)
Training model...
RMSE (on training data):
period 1: 204.65393150901195
period 2: 173.7183427312223
period 3: 145.97809305428905
period 4: 123.6036198238067
period 5: 112.8142399617989
period 6: 108.63058108212915
period 7: 107.55735462898159
period 8: 107.53097708301351
period 9: 107.5025442282244
period 10: 107.44954799028854
Model training finished.
Out[23]:
(Plot{Plots.GRBackend() n=11}, Plot{Plots.GRBackend() n=1}, 17000×2 DataFrames.DataFrame
│ Row │ predictions │ targets │
├───────┼─────────────┼─────────┤
│ 1 │ 413.677 │ 66.9 │
│ 2 │ 413.677 │ 80.1 │
│ 3 │ 217.866 │ 85.7 │
│ 4 │ 269.782 │ 73.4 │
│ 5 │ 229.456 │ 65.5 │
│ 6 │ 211.305 │ 74.0 │
│ 7 │ 177.63 │ 82.4 │
│ 8 │ 218.085 │ 48.5 │
│ 9 │ 174.115 │ 58.4 │
│ 10 │ 199.926 │ 48.1 │
│ 11 │ 174.728 │ 86.5 │

│ 16989 │ 200.584 │ 66.9 │
│ 16990 │ 204.229 │ 58.1 │
│ 16991 │ 204.107 │ 78.3 │
│ 16992 │ 230.979 │ 73.2 │
│ 16993 │ 209.721 │ 50.8 │
│ 16994 │ 230.037 │ 106.7 │
│ 16995 │ 296.027 │ 76.1 │
│ 16996 │ 237.335 │ 111.4 │
│ 16997 │ 204.423 │ 79.0 │
│ 16998 │ 217.16 │ 103.6 │
│ 16999 │ 210.717 │ 85.8 │
│ 17000 │ 224.484 │ 94.6 │)
Final RMSE (on training data): 107.44954799028854
Final Weight (on training data): 69.0
Final Bias (on training data): 68.67712220400571
In [24]:
plot(p1, p2, layout=(1,2), legend=false)
Out[24]:
0123450100200300400500Learned Line by Periodrooms_per_personmedian_house_value246810120140160180200Root Mean Squared Error vs. PeriodsPeriodsRMSE
In [27]:
#scatter(calibration_data[:predictions], calibration_data[:targets], legend=false)
In [25]:
# end of file