ValueError: import data via chunks into pandas.csv_reader() - python

I have a large gzip file which I would like to import into a pandas dataframe. Unfortunately, the file has an uneven number of columns. The data has roughly this format:
.... Col_20: 25 Col_21: 23432 Col22: 639142
.... Col_20: 25 Col_22: 25134 Col23: 243344
.... Col_21: 75 Col_23: 79876 Col25: 634534 Col22: 5 Col24: 73453
.... Col_20: 25 Col_21: 32425 Col23: 989423
.... Col_20: 25 Col_21: 23424 Col22: 342421 Col23: 7 Col24: 13424 Col 25: 67
.... Col_20: 95 Col_21: 32121 Col25: 111231
As a test, I tried this:
import pandas as pd
filename = `path/to/filename.gz`
for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
print(chunk)
Here is the error I get in return:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 795, in __next__
return self.get_chunk()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1761, in read
alldata = self._rows_to_cols(content)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 2166, in _rows_to_cols
raise ValueError(msg)
ValueError: Expected 18 fields in line 28, saw 22
How can you allocate a certain number of columns for pandas.read_csv()?

You could also try this:
for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)
error_bad_lines would skip bad lines thought. I will see if a better alternative can be found
EDIT: In order to maintain the lines that were skipped by error_bad_lines we can go through the error and add it back to the dataframe
line = []
expected = []
saw = []
cont = True
while cont == True:
try:
data = pd.read_csv('file1.csv',skiprows=line)
cont = False
except Exception as e:
errortype = e.message.split('.')[0].strip()
if errortype == 'Error tokenizing data':
cerror = e.message.split(':')[1].strip().replace(',','')
nums = [n for n in cerror.split(' ') if str.isdigit(n)]
expected.append(int(nums[0]))
saw.append(int(nums[2]))
line.append(int(nums[1])-1)
else:
cerror = 'Unknown'
print 'Unknown Error - 222'

Related

string.split() giving memory error in pandas dataframe

I am trying to split string but getting memory error. Is there any way to solve this or alternative solution for this?
I am getting error below code -
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
Error-
Traceback (most recent call last):
File "ravi_sir.py", line 47, in <module>
df1 = df1[0].str.split(',', expand=True)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2001, in wrapper
return func(self, *args, **kwargs)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2690, in split
return self._wrap_result(result, expand=expand, returns_string=expand)
File "/app/python3/lib/python3.6/site-packages/pandas/core/strings.py", line 2272, in _wrap_result
result = cons(result, columns=name, index=index, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/frame.py", line 520, in __init__
mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1739, in form_blocks
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1784, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1830, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
I am trying to read zip file from s3 bucket and saving the content into dataframe to get the total count of files inside that zip file. Creating the dataframe. My full code is given below-
list_table = []
for table in d:
dict_table = OrderedDict()
s_time = datetime.datetime.now().strftime("%H:%M:%S")
print("start_time--->>",s_time)
print("tablename--->>", table)
s3 = boto3.resource('s3')
key='raw/vs-1/load-1619/data' +'/'+ table
obj = s3.Object('********',key)
n = obj.get()['Body'].read()
gzipfile = BytesIO(n)
gzipfile = gzip.GzipFile(fileobj=gzipfile)
content = gzipfile.read()
#print(content)
content_str = content.decode('utf-8')
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
#df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
#print(df1)
#count = os.popen('aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/{0} - | wc -l'.format(table)).read()
count = int(len(df1)) - 2
del(df1)
e_time = datetime.datetime.now().strftime("%H:%M:%S")
print("End_time---->>",e_time)
print(count)
dict_table['Table_Name'] = str(table)
dict_table['Count'] = count
list_table.append(dict_table)
Since you are splitting a huge string using a df column, then deleting the df, looks like you only need the count of commas for each row. So get the count, which is simple, rather than splitting the df -- which could generate a huge amount of columns and therefore cause your memory error.
row1list = ['1,2,3,4']
row2list = ['5,6']
row3list = ['7,8,9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['col'])
df['count_commas'] = df['col'].str.count(',')
print(df)
# col count_commas
# 0 1,2,3,4 3
# 1 5,6 1
# 2 7,8,9 2

Why can dask.dataframe.apply only process a column called 'name'?

I am attempting to port some Pandas (Python) code to Dask instead. I am using Pandas 1.1.3 and Dask 2.30.0. I keep ramming my head against a wall I can't see. That is, I cannot understand what is going on here. I have boiled it down to the following minimal working example:
My data is the file 'test.csv' containing the following:
age,name
28,Alice
The following Python script (using Pandas) works fine:
import pandas as pd
df = pd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper())
#result = df['age'].apply(lambda num: num + 1)
print(result)
and prints:
0 ALICE
Name: name, dtype: object
The commented-out line operating on the 'age' column also works and prints:
0 29
Name: age, dtype: int64
Now, with Dask instead, my example becomes:
import dask.dataframe as dd
df = dd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper(), meta={'name': str})
#result = df['age'].apply(lambda num: num + 1, meta={'age': int})
print(result.compute())
which works fine just like the Pandas example. However, if I try the commented-out line operating on the 'age' column instead, Python complains with the following error message:
Traceback (most recent call last):
File "test_dask.py", line 7, in <module>
print(result.compute())
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/optimization.py", line 961, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/utils.py", line 29, in apply
return func(*args, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/dataframe/core.py", line 5306, in apply_and_enforce
c = meta.name
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/pandas/core/generic.py", line 5139, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'name'
Even if I just call the 'name' column something else, it also fails like this. It is as if Dask is only able to work on columns of a DataFrame that are called 'name'. This seems extraordinarily weird to me, and I must be misunderstanding something. What is really going on here?
The docs seem to suggest that the dict should work, so that's weird, but if you replace the meta argument with a tuple instead, your code runs as expected:
df = dd.read_csv("test.csv")
result = df['age'].apply(lambda num: num + 1, meta=('age', 'int64'))
print(result.compute())
becomes
0 29
Name: age, dtype: int64

MemoryError when running python script on google cloud

I am trying to use the Google cloud to run a script that makes predictions for every line of a test.csv file. I use the cloud because it looks like Google Colab is going to take some time. However, when I run it there is a memory error:
(pre_env) mikempc3#instance-1:~$ python predictSales.py
Traceback (most recent call last):
File "predictSales.py", line 7, in <module>
sales = pd.read_csv("sales_train.csv")
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 1169, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/frame.py", line 411, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 257, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 87, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1694, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1764, in form_blocks
int_blocks = _multi_blockify(items_dict["IntBlock"])
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1846, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1874, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 67.2 MiB for an array with shape (3, 2935849) and data type int64
Here is my script:
import statsmodels.tsa.arima.model as smt
import pandas as pd
import datetime
import numpy as np
sales = pd.read_csv("sales_train.csv")
test = pd.read_csv("test.csv")
sales.date = sales.date.apply(lambda x: datetime.datetime.strptime(x, "%d.%m.%Y"))
sales_monthly = sales.groupby(
["date_block_num", "shop_id", "item_id"])["date", "item_price",
"item_cnt_day"].agg({
"date": ["min", "max"],
"item_price": "mean",
"item_cnt_day": "sum"})
array = []
for i, row in test.iterrows():
print("row['shop_id']: ", row['shop_id'], " row['item_id']: ", row['item_id'])
print(statsmodels.__version__)
ts = pd.DataFrame(sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :]['item_price'].values *
sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :][
'item_cnt_day'].values).T.iloc[0]
print(ts.values)
if ts.values != [] and len(ts.values) > 2:
best_aic = np.inf
best_order = None
best_model = None
ranges = range(1, 5)
for difference in ranges:
# try:
tmp_model = smt.ARIMA(ts.values, order=(0, 1, 0), trend='t').fit()
tmp_aic = tmp_model.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_difference = difference
best_model = tmp_model
# except Exception as e:
# print(e)
# continue
if best_model is not None:
y_hat = best_model.forecast()[0]
if y_hat < 0:
y_hat = 0
else:
y_hat = 0
else:
y_hat = 0
print("predicted:", y_hat)
d = {'id': row['ID'], 'item_cnt_month': y_hat}
array.append(d)
print("-------------------")
df = pd.DataFrame(array)
df.to_csv("submission.csv")
You can use the Fil memory profiler (https://pythonspeed.com/fil) to figure out which lines of code are responsible for peak memory use. It will also handle out-of-memory conditions and dump a report when you run out.
Only caveat is (1) it require Python 3.6 or later and (2) will only run on Linux or macOS. We're up to 3.9 so probably time to upgrade regardless.

Value Error Mismatch While Converting Using Pandas

here is the mismatch error I keep getting. I'm inputting "202710".
Traceback (most recent call last):
File "nbastatsrecieveit.py", line 29, in <module>
df.columns = headers
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 5149, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\_libs\properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 564, in _set_axis
self._mgr.set_axis(axis, labels)
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\internals\managers.py", line 226, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 0 elements, new values have 24 elements
To be honest, I'm not sure as to how to go about fixing this problem as it works with specific player IDs but not all of then. Here is the rest of my code:
from nba_api.stats.endpoints import shotchartdetail
import pandas as pd
import json
from openpyxl import Workbook
print('Player ID?')
playerid = input()
filename = str(playerid) + '.xlsx'
response = shotchartdetail.ShotChartDetail(
team_id= 0,
context_measure_simple = 'FGA',
#last_n_games = numGames,
game_id_nullable = '0041900403',
player_id= playerid
)
content = json.loads(response.get_json())
# transform contents into dataframe
results = content['resultSets'][0]
headers = results['headers']
rows = results['rowSet']
#df = pd.DataFrame(rows)
df = pd.DataFrame(rows)
df.columns = headers
# write to excel file
df.to_excel(filename, index=False)
This is because your df is empty for ID 202710. Exception handling will resolve the issue here-
df = pd.DataFrame(rows)
try:
df.columns = headers
except:
pass

i have two csv files as shown shown in the format below and i want to fuzzy match these and get the highest ratio matched

FIle-1 FILE-2
Name Age Name Age
Hiites 21 Hitesh 21
Hardick 11 Hardik 11
Rajes 48 Rajesh 48
Snha 47 Sneha 47
Here i want to match the names and get the best match. below is the code which i have used and i am getting the following error:-
import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv
save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')
def parse_csv(path):
with open(path,'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
yield row
if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('file1.csv'):
data[row[0]] = row[0]
## For each row in the lookup compute the partial ratio
for row in parse_csv("file2.csv"):
for found, score, matchrow in process.extractOne(row, data, score_cutoff = 60):
if score >= 60:
print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
Digi_Results = [row, score, found]
writer.writerow(Digi_Results)
save_file.close()
Below is the error:-
File "script.py", line 26, in <module>
for found, score, matchrow in process.extractOne(row, data, score_cutoff = 60):
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/process.py", line 245, in extractOne
return max(best_list, key=lambda i: i[1])
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/process.py", line 103, in extractWithoutOrder
processed_query = processor(query)
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/utils.py", line 89, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/string_processing.py", line 26, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(" ", a_string)
TypeError: expected string or buffer
hitesh#hitesh-VGN-CS25GN-B:~/Desktop$ python script.py
/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/fuzz.py:35: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
File "script.py", line 26, in <module>
for found, score, matchrow in process.extractOne(row, data, score_cutoff = 60):
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/process.py", line 245, in extractOne
return max(best_list, key=lambda i: i[1])
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/process.py", line 103, in extractWithoutOrder
processed_query = processor(query)
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/utils.py", line 89, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/dist-packages/fuzzywuzzy/string_processing.py", line 26, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(" ", a_string)
TypeError: expected string or buffer

Categories