MemoryError when running python script on google cloud - python

I am trying to use the Google cloud to run a script that makes predictions for every line of a test.csv file. I use the cloud because it looks like Google Colab is going to take some time. However, when I run it there is a memory error:
(pre_env) mikempc3#instance-1:~$ python predictSales.py
Traceback (most recent call last):
File "predictSales.py", line 7, in <module>
sales = pd.read_csv("sales_train.csv")
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 1169, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/frame.py", line 411, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 257, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 87, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1694, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1764, in form_blocks
int_blocks = _multi_blockify(items_dict["IntBlock"])
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1846, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1874, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 67.2 MiB for an array with shape (3, 2935849) and data type int64
Here is my script:
import statsmodels.tsa.arima.model as smt
import pandas as pd
import datetime
import numpy as np
sales = pd.read_csv("sales_train.csv")
test = pd.read_csv("test.csv")
sales.date = sales.date.apply(lambda x: datetime.datetime.strptime(x, "%d.%m.%Y"))
sales_monthly = sales.groupby(
["date_block_num", "shop_id", "item_id"])["date", "item_price",
"item_cnt_day"].agg({
"date": ["min", "max"],
"item_price": "mean",
"item_cnt_day": "sum"})
array = []
for i, row in test.iterrows():
print("row['shop_id']: ", row['shop_id'], " row['item_id']: ", row['item_id'])
print(statsmodels.__version__)
ts = pd.DataFrame(sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :]['item_price'].values *
sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :][
'item_cnt_day'].values).T.iloc[0]
print(ts.values)
if ts.values != [] and len(ts.values) > 2:
best_aic = np.inf
best_order = None
best_model = None
ranges = range(1, 5)
for difference in ranges:
# try:
tmp_model = smt.ARIMA(ts.values, order=(0, 1, 0), trend='t').fit()
tmp_aic = tmp_model.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_difference = difference
best_model = tmp_model
# except Exception as e:
# print(e)
# continue
if best_model is not None:
y_hat = best_model.forecast()[0]
if y_hat < 0:
y_hat = 0
else:
y_hat = 0
else:
y_hat = 0
print("predicted:", y_hat)
d = {'id': row['ID'], 'item_cnt_month': y_hat}
array.append(d)
print("-------------------")
df = pd.DataFrame(array)
df.to_csv("submission.csv")

You can use the Fil memory profiler (https://pythonspeed.com/fil) to figure out which lines of code are responsible for peak memory use. It will also handle out-of-memory conditions and dump a report when you run out.
Only caveat is (1) it require Python 3.6 or later and (2) will only run on Linux or macOS. We're up to 3.9 so probably time to upgrade regardless.

Related

Memory error when exporting large data set processed in python and Dask

I am trying to clean and filter large data I get each month. Recently the data size has grown even larger for a variety of reasons and I can no longer use pandas. I've been attempting to find an alternative and so far Dask has seemed to work until it comes to the export step. My simplified code is:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
MAP = pd.read_excel('map.csv')
MAP = MAP.from_pandas(MAP,1)
MAP2 = dd.read_csv('map2.csv')
MAP3 = dd.read_CSV('map3.csv')
MAP = dd.merge(
MAP,
MAP2,
how="left",
left_on=("id1", "id2", "id3"),
right_on=("id1", "id2", "id3"),
indicator=False)
MAP = MAP.drop_duplicates(subset=["id1", "id2", "id3",'col1','col2' ])
BIG_DATA = dd.read_csv("BIG_DATA.gz",
sep='|',
compression='gzip',
header=None,
blocksize=None,
dtype={0: np.int64, 1: np.int64, 4: np.int16, 6: str, 8: np.int64, 9: np.float32, 10: np.float32,
11: np.float32, 19: str, 32: str, 37: np.int32, 40: np.float32})
BIG_DATA = pd.merge(
BIG_DATA,
MAP3,
how="left",
left_on=("id3", "id4"),
right_on=("id3", "id4"),
indicator=False)
BIG_DATA = BIG_DATA[filter condition]
groupvars = [g1, g2, g3, g4, g5, g6, g7...g17]
sumlist = [s1, s2, s3, s4]
BIG_DATA = BIG_DATA.groupby(groupvars)[sumlist].sum().reset_index()
BIG_DATA = pd.merge(
BIG_DATA,
MAP,
how="outer",
left_on=("id1", "id2"),
right_on=("id1", "id2"),
indicator=True)
BIG_DATA = BIG_DATA[(BIG_DATA['_merge'].isin(['right_only']) == False)]
BIG_DATA1 = BIG_DATA[filter condition1]
BIG_DATA2 = BIG_DATA[filter condition2]
OUTPUT = pd.concat([BIG_DATA1, BIG_DATA2]).reset_index()
OUTPUT = OUTPUT.repartition(npartitions=100000) #have tried multiple values here
OUTPUT.to_csv(r'\\files\User\test.csv', single_file=True)
When using pandas, this process crashes at the groupby statment. I thought dask might be the way around this, but it seems to always fail when I try to export to csv. I'm new to python and dask, but I'm guessing it is delaying the groupby statement until the export and failing for the same reason as pandas? I've created the same result set using fortran and it results in a 100mb csv file with approximately 600k rows of data. I'm not really sure how to go about changing this so that it will work.
Exact error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\IPython\core\interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-52-29ce59d87600>", line 350, in <cell line: 350>
plinePSA.to_csv(r'\\files\User\test.csv', single_file=True, chunksize = 100)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\dataframe\core.py", line 1699, in to_csv
return to_csv(self, filename, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\dataframe\io\csv.py", line 972, in to_csv
return list(dask.compute(*values, **compute_kwargs))
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\base.py", line 598, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\threaded.py", line 89, in get
results = get_async(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\local.py", line 511, in get_async
raise_exception(exc, tb)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\local.py", line 319, in reraise
raise exc
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\local.py", line 224, in execute_task
result = _execute_task(task, data)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\optimization.py", line 990, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\core.py", line 149, in get
result = _execute_task(task, cache)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\dataframe\io\csv.py", line 129, in __call__
df = pandas_read_text(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\dask\dataframe\io\csv.py", line 182, in pandas_read_text
df = reader(bio, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read
return parser.read(nrows)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\io\parsers\readers.py", line 1059, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\construction.py", line 464, in dict_to_mgr
return arrays_to_mgr(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\construction.py", line 135, in arrays_to_mgr
return create_block_manager_from_arrays(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\managers.py", line 1773, in create_block_manager_from_arrays
blocks = _form_blocks(arrays, names, axes, consolidate)
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\managers.py", line 1838, in _form_blocks
numeric_blocks = _multi_blockify(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\managers.py", line 1928, in _multi_blockify
values, placement = _stack_arrays(
File "C:\ProgramData\Anaconda3\envs\pythonProject1\lib\site-packages\pandas\core\internals\managers.py", line 1957, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 2.11 GiB for an array with shape (3, 189229412) and data type float32

I was trying to make a random forest Ai model to predict the retention of an employee in a company thats when i got this error, Exception in Tkinter

code related to it:
for sh in xlrd.open_workbook(myPath).sheets():
popp = findCell(sh, searchedValue)
test_str = popp
print(popp)
index_list = [1]
new_str = [test_str[i] for i in index_list]
rownum = "".join(new_str)
rownum = int(rownum)
for roww in ws.iter_rows(min_row=rownum, min_col=3, max_row=rownum, max_col=3):
for cell in roww:
total_hours_worked = cell.value
print()
for rowww in ws.iter_rows(min_row=rownum, min_col=4, max_row=rownum, max_col=4):
for cell in rowww:
leaves_taken = cell.value
print()
Prediction_result = (' Predicted Result: ', forest.predict([[total_hours_worked, leaves_taken]]))
error
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\parth\AppData\Local\Programs\Python\Python36\lib\tkinter\__init__.py", line 1705, in __call__
return self.func(*args)
File "C:/Parth/ChurnPrediciton/not so accurate model/error.py", line 75, in values
Prediction_result = (' Predicted Result: ', forest.predict([[total_hours_worked, leaves_taken]]))
File "C:\Parth\ChurnPrediciton\venv\lib\site-packages\sklearn\ensemble\_forest.py", line 612, in predict
proba = self.predict_proba(X)
File "C:\Parth\ChurnPrediciton\venv\lib\site-packages\sklearn\ensemble\_forest.py", line 656, in predict_proba
X = self._validate_X_predict(X)
File "C:\Parth\ChurnPrediciton\venv\lib\site-packages\sklearn\ensemble\_forest.py", line 412, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Parth\ChurnPrediciton\venv\lib\site-packages\sklearn\tree\_classes.py", line 380, in _validate_X_predict
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
File "C:\Parth\ChurnPrediciton\venv\lib\site-packages\sklearn\utils\validation.py", line 531, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\parth\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Total Working Hours'
Process finished with exit code 0
I made this model a few days ago and it was working perfectly fine, it was giving the proper output and all but suddenly it stopped working. And I'm not able to understand why it is happening
So, in my case the error was in the data itself, some of the data was in format of a float datatype so it was not able to convert it into a string. So I went into my excel sheet and csv files and changed the values, and everything's working perfectly now

Generating data via SDV GaussianCopula throws "numpy.linalg.LinAlgError: SVD did not converge" in Python

I am currently using SDV and GaussianCopula (https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html) to train my models. I have a given data set which is loaded for training.
However, I get the following error message when creating the datasets:
Saving Model to path D:/.../GaussianCopula/model_MLB_1.pkl
Generating 22479 rows of synthetic data
Traceback (most recent call last):
File ".\generate_gaussian_model.py", line 47, in <module>
samples = gaussianCopula.sample(len(data.index))
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\sdv\tabular\base.py", line 442, in sample
return self._sample_batch(num_rows, max_retries, max_rows_multiplier)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\sdv\tabular\base.py", line 300, in _sample_batch
num_rows, conditions, transformed_conditions, float_rtol)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\sdv\tabular\base.py", line 228, in _sample_rows
sampled = self._sample(num_rows)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\sdv\tabular\copulas.py", line 319, in _sample
return self._model.sample(num_rows, conditions=conditions)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\copulas\__init__.py", line 36, in wrapper
return function(self, *args, **kwargs)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\copulas\multivariate\gaussian.py", line 249, in sample
samples = self._get_normal_samples(num_rows, conditions)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\copulas\multivariate\gaussian.py", line 223, in _get_normal_samples
samples = np.random.multivariate_normal(means, covariance, size=num_rows)
File "mtrand.pyx", line 4120, in numpy.random.mtrand.RandomState.multivariate_normal
File "<__array_function__ internals>", line 6, in svd
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\linalg\linalg.py", line 1660, in svd
u, s, vh = gufunc(a, signature=signature, extobj=extobj)
File "C:\Users\...\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\linalg\linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge
I also checked out this following thread and tried to apply their solution (which you can see below) but it didn't work.
And this is my class (generate_gaussian_model.py) and what I've tried so far:
from sdv.tabular import GaussianCopula
import pickle
import pandas as pd
from pandas.core.indexes.base import Index
header_import_path = "C:/Users/.../headers/all_headers.txt"
all_mlb_names = ['MLB_1', 'MLB_7', 'MLB_19', 'MLB_31', 'MLB_41', 'MLB_45', 'MLB_49', 'MLB_53', 'MLB_58']
with open(header_import_path, 'rb') as fp:
all_headers = pickle.load(fp)
for mlb_file_name in all_mlb_names:
#Create separate model for each MLB Table
model_export_path = "D:/.../GaussianCopula/model_{0}.pkl".format(mlb_file_name)
synth_data_export_path = "C:/Users/.../models/generated/{0}_samples.csv".format(mlb_file_name)
data_import_path = "C:/Users/.../models/original/{0}.csv".format(mlb_file_name)
headers = all_headers[mlb_file_name]
print("Read data for table {0}".format(mlb_file_name))
data = pd.read_csv(data_import_path, sep='|', names=headers)
# This is necessary to remove invalid columns from my original dataset
for colname in data.columns:
if colname.startswith("Calculation"):
data = data.drop(axis=1, labels=[colname])
# Thought this would fix my issue but it didn't
# https://stackoverflow.com/questions/21827594/raise-linalgerrorsvd-did-not-converge-linalgerror-svd-did-not-converge-in-m
data.dropna(inplace=True)
#print("Takes a third of the dataset")
data = data.sample(frac=0.3)
print(data)
gaussianCopula = GaussianCopula()
print("Start training of GaussianCopula Model")
gaussianCopula.fit(data)
print("Saving Model to path {0}".format(model_export_path))
gaussianCopula.save(model_export_path)
print("Generating {0} rows of synthetic data".format(len(data.index)))
# Here it begins to crash
samples = gaussianCopula.sample(len(data.index))
samples.to_csv(synth_data_export_path, header=True, sep='|', index=False)
The following command would work, but these are not enough datasets for me: data = data.sample(n=1000)
Hope you guys can help me out and explain this error message to me.

string.split() giving memory error in pandas dataframe

I am trying to split string but getting memory error. Is there any way to solve this or alternative solution for this?
I am getting error below code -
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
Error-
Traceback (most recent call last):
File "ravi_sir.py", line 47, in <module>
df1 = df1[0].str.split(',', expand=True)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2001, in wrapper
return func(self, *args, **kwargs)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2690, in split
return self._wrap_result(result, expand=expand, returns_string=expand)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2272, in _wrap_result
result = cons(result, columns=name, index=index, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/frame.py", line 520, in __init__
mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1739, in form_blocks
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1784, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1830, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
I am trying to read zip file from s3 bucket and saving the content into dataframe to get the total count of files inside that zip file. Creating the dataframe. My full code is given below-
list_table = []
for table in d:
dict_table = OrderedDict()
s_time = datetime.datetime.now().strftime("%H:%M:%S")
print("start_time--->>",s_time)
print("tablename--->>", table)
s3 = boto3.resource('s3')
key='raw/vs-1/load-1619/data' +'/'+ table
obj = s3.Object('********',key)
n = obj.get()['Body'].read()
gzipfile = BytesIO(n)
gzipfile = gzip.GzipFile(fileobj=gzipfile)
content = gzipfile.read()
#print(content)
content_str = content.decode('utf-8')
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
#df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
#print(df1)
#count = os.popen('aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/{0} - | wc -l'.format(table)).read()
count = int(len(df1)) - 2
del(df1)
e_time = datetime.datetime.now().strftime("%H:%M:%S")
print("End_time---->>",e_time)
print(count)
dict_table['Table_Name'] = str(table)
dict_table['Count'] = count
list_table.append(dict_table)
Since you are splitting a huge string using a df column, then deleting the df, looks like you only need the count of commas for each row. So get the count, which is simple, rather than splitting the df -- which could generate a huge amount of columns and therefore cause your memory error.
row1list = ['1,2,3,4']
row2list = ['5,6']
row3list = ['7,8,9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['col'])
df['count_commas'] = df['col'].str.count(',')
print(df)
# col count_commas
# 0 1,2,3,4 3
# 1 5,6 1
# 2 7,8,9 2

Expected binary or unicode string, got nan - tensorflow/pandas

I am fairly new to TensorFlow/Machine Learning and therefore have a few difficulties. I have a dataset in csv format here and want to read it with pandas like here. It worked on a different dataset but I modified and extended but I think I am missing something important here. Basically all I am trying to do ist to predict the "overall" rating from the given dataset. Here's my code and the traceback I get:
import pandas as pd
import tensorflow as tf
import tempfile
COLUMNS = ["reviewerID", "asin", "reviewerName", "helpful_0", "helpful_1", "reviewText",
"overall", "summary", "unixReviewTime"]
CATEGORICAL_COLUMNS = ["reviewerID", "reviewerName", "reviewText", "summary"]
CONTINUOUS_COLUMNS = ["helpful_0", "helpful_1", "unixReviewTime"]
df_train = pd.read_csv('Digital_Music_5.csv', names=COLUMNS, skipinitialspace=True,
low_memory=False, skiprows=1)
df_test = pd.read_csv('Digital_Music_5_test.csv', names=COLUMNS,
skipinitialspace=True, skiprows=1)
LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = df_train["overall"]
df_test[LABEL_COLUMN] = df_train["overall"]
print(df_train)
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k)
# to the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name
# (k) to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
dense_shape=[df[k].size, 1],) for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols)
feature_cols.update(categorical_cols)
# Converts the label column into a constant Tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# Returns the feature columns and the label.
return feature_cols, label
def train_input_fn():
return input_fn(df_train)
def eval_input_fn():
return input_fn(df_test)
reviewText = tf.contrib.layers.sparse_column_with_hash_bucket("reviewText", hash_bucket_size=100000)
reviewerID = tf.contrib.layers.sparse_column_with_hash_bucket("reviewerID", hash_bucket_size=100000)
reviewerName = tf.contrib.layers.sparse_column_with_hash_bucket("reviewerName", hash_bucket_size=100000)
summary = tf.contrib.layers.sparse_column_with_hash_bucket("summary", hash_bucket_size=100000)
asin = tf.contrib.layers.real_valued_column("asin")
helpful_0 = tf.contrib.layers.real_valued_column("helpful_0")
helpful_1 = tf.contrib.layers.real_valued_column("helpful_1")
unixReviewTime = tf.contrib.layers.real_valued_column("unixReviewTime")
# reviewText_x_summary = tf.contrib.layers.crossed_column([reviewText, summary], hash_bucket_size=100000)
# reviewerID_x_reviewerName = tf.contrib.layers.crossed_column([reviewerID, reviewerName], hash_bucket_size=100000)
# reviewText_x_reviewerID_x_reviewerName = tf.contrib.layers.crossed_column([reviewText, reviewerID, reviewerName], hash_bucket_size=100000)
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(feature_columns=[reviewText, reviewerName, summary,
asin, helpful_0, helpful_1, unixReviewTime], optimizer=tf.train.FtrlOptimizer(
learning_rate=0.1,
l1_regularization_strength=1.0,
l2_regularization_strength=1.0),
model_dir=model_dir)
m.fit(input_fn=train_input_fn, steps=200)
# results = m.evaluate(input_fn=eval_input_fn, steps=1)
# for key in sorted(results):
# print("{}: {}".format(key, results[key]))
Traceback:
Traceback (most recent call last):
File "amazon_reviews.py", line 78, in <module>
m.fit(input_fn=train_input_fn, steps=200)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 932, in _train_model
features, labels = input_fn()
File "amazon_reviews.py", line 47, in train_input_fn
return input_fn(df_train)
File "amazon_reviews.py", line 36, in input_fn
dense_shape=[df[k].size, 1],) for k in CATEGORICAL_COLUMNS}
File "amazon_reviews.py", line 36, in <dictcomp>
dense_shape=[df[k].size, 1],) for k in CATEGORICAL_COLUMNS}
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/sparse_tensor.py", line 125, in __init__
values, name="values", as_ref=True)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 702, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 110, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 99, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 451, in make_tensor_proto
append_fn(tensor_proto, proto_values)
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 109, in SlowAppendObjectArrayToTensorProto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 109, in <listcomp>
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/cfritz/virtualenvs/tensorflow/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got nan
Your input DataFrame contains empty reviewer names and review texts which are mapped to NaN by pd.read_csv(), however TensorFlow expects a string and not NaN.
Check the empty cells using this command:
df_train[df_train.isnull().any(axis=1)]
You can simply convert these NaNs to an empty string using
df_train.fillna('', inplace=True)
or have pd.read_csv() create empty strings instead of NANs directly using na_values=[]:
df_train = pd.read_csv('Digital_Music_5.csv', names=COLUMNS,
skipinitialspace=True, low_memory=False,
skiprows=1, na_values=[])

Categories