using pandas read_csv with missing data - python

I am attempting to read a csv file where some rows may be missing chunks of data.
This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.
A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)
import pandas as pd
import numpy as np
import io
datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
index_col=0, names=names, na_values=' ')
The error I get when I run this is
Traceback (most recent call last):
File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
index_col=0, names=names, na_values=' ')
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: ' '
Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?

Try this:
import pandas as pd
import numpy as np
import io
datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names
Edit: To converter dtypes post imports.
df["number"] = df["data"].astype('int')
df["data"] = df["data"].astype('float')
Your data has mixed of blanks as str and numbers.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id 2 non-null object
flag 2 non-null object
number 2 non-null object
data 2 non-null object
data2 2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes
If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.

So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to
na_count_old = na_count
print(col_res)
for ind, row in enumerate(col_res):
k = kh_get_str(na_hashset, row.strip().encode())
if k != na_hashset.n_buckets:
col_res[ind] = np.nan
na_count += 1
else:
col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)
if na_count_old==na_count:
# float -> int conversions can fail the above
# even with no nans
col_res_orig = col_res
col_res = col_res.astype(col_dtype)
if (col_res != col_res_orig).any():
raise ValueError("cannot safely convert passed user dtype of "
"{col_dtype} for {col_res} dtyped data in "
"column {column}".format(col_dtype=col_dtype,
col_res=col_res_orig.dtype.name,
column=i))
which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).
While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

Related

How to handle duplicate Pandas DataFrame columns when also specifying dtype?

I am parsing some data with predefined columns, and sometimes these columns are duplicated e.g.:
df = pd.DataFrame([['A','B']], columns=['A','A'])
The above works just fine, but I want to also specify the dtype for the column e.g.
df = pd.DataFrame([['A','B']], columns=['A','A'],dtype={'A':str})
However, the above errors out with the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 513, in __init__
dtype = self._validate_dtype(dtype)
File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 345, in _validate_dtype
dtype = pandas_dtype(dtype)
File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/common.py", line 1799, in pandas_dtype
npdtype = np.dtype(dtype)
File "/home/anaconda3/lib/python3.7/site-packages/numpy/core/_internal.py", line 62, in _usefields
names, formats, offsets, titles = _makenames_list(adict, align)
File "/home/anaconda3/lib/python3.7/site-packages/numpy/core/_internal.py", line 30, in _makenames_list
n = len(obj)
TypeError: object of type 'type' has no len()
Is there a way around this?
Your syntax is invalid, irrespective of the duplicated columns, the dtype parameter expects a single dtype.
dtype dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
You can use:
df = pd.DataFrame([['A','B']], columns=['A','A']).astype({'A':str})

Pandas TypeError: object of type 'float' has no len()

I'm doing some data-discovery using Python/Pandas.
MVCE: I have a CSV file with some street addresses and I want to find the length of the longest address in my file. (this is a simplified version of my actual problem)
I wrote this simple Python code:
import sys
import pandas as pd
df = pd.read_csv(sys.argv[1])
print(df['address'].map(len).max())
The address column is of type str, or so I thought (see below).
Why then do I get this error?
Traceback (most recent call last):
File "eval-lengths.py", line 8, in <module>
print(df['address'].map(len).max())
File "C:\Python35\lib\site-packages\pandas\core\series.py", line 2996, in map
arg, na_action=na_action)
File "C:\Python35\lib\site-packages\pandas\core\base.py", line 1004, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer
TypeError: object of type 'float' has no len()
Here's the output of df.info()
RangeIndex: 154733 entries, 0 to 154732
Data columns (total 2 columns):
address 154510 non-null object
zip 154732 non-null object
dtypes: object(2)
memory usage: 2.4+ MB
UPDATE
Here's a sample CSV file
address,zip
555 APPLE STREET,82101
1180 BANANA LAKE ROAD,81913
577 LEMON DR,81911
,99999
The last line is key to reproducing the problem.
You have missing data in your column, represented by NaNs (which are of float type).
Don't use map/apply, etc for things like finding the length, just do this with str.len:
df['address'].str.len()
Items for which len() is not applicable automatically show in the result as NaN. You can fillna(-1) those out to indicate the result is invalid there.
My Solution was to fillNa with an empty string and then try to run the apply, like this:
df['address'].fillna('', inplace=True)
print(df['address'].map(len).max())

Python pandas, errors using .str.contains to search dataframe column for substring

I'd appreciate any help you could offer on this issue using python 2.7, pandas 0.22, and easygui 0.98.1.
I'm trying to load a csv into pandas, assign column names from a user-chosen list using easygui (returns as a list of strings, I think) and search in a certain column of the dataframe for a substring.
import easygui as eg
import pandas as pd
# define vars_vars from choices of imagej outputs
vars_vars = eg.multchoicebox
(msg="\n\n\n\nPlease highlight variables included in ImageJ analysis:",
title="IF Analysis - 2017",
choices=["integrated density", "mean",
"mean grey value", "area fraction"])
# add required imagej columns for later processing
vars_vars.insert(0, "label")
vars_vars.insert(0, "#")
#User input for csv file
file = eg.fileopenbox()
# load into dataframe using pandas and assign columns using chosen variables
df = pd.read_csv(file, header=None, names=None)
df.columns = vars_vars
# Search 'label' column for certain substring
df[df['label'].str.contains('substring')]
But I'm getting this error:
Traceback (most recent call last):
File "C:/Users/User/.PyCharmCE2017.3/config/scratches/scratch.py", line 61, in <module>
df[df['label'].str.contains('nsv')]
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
I tried goolging it, and found these fixes
df['label'] = df['label'].map(str)
df['label'] = df['label'].astype(str)
df['label'] = df['label'].astype(baseline)
and variations of those where I call the entire dataframe rather than df['label'].
However, these all result in no error following pass of that fix line, but invariably returns a similar error as previously when I do the .str.contains line which states dataframe object has no attribute map (for .map(str)) or str (for .astype(x)) lines.
print type(df['label'])
print df.dtypes
print df['label'].head()
print (df['label'].info())
print type(df['label'][0])
returns
<class 'pandas.core.frame.DataFrame'>
# int64
label object
integrated density int64
dtype: object
label
0 nc4_al1_I+pP_4x_contra_ctx_blue.tif
1 nc4_al1_I+pP_4x_contra_ctx_green.tif
2 nc4_al1_I1+pP_4x_contra_ctx_red.tif
3 nc4_al1_I+pP_4x_contra_hc_blue.tif
4 nc4_al1_I+pP_4x_contra_hc_green.tif
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 695 entries, 0 to 694
Data columns (total 1 columns):
(label,) 695 non-null object
dtypes: object(1)
memory usage: 5.5+ KB
None
Traceback (most recent call last):
File "C:/Users/Shon/.PyCharmCE2017.3/config/scratches/scratch.py", line 67, in <module>
print type(df['label'][0])
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\frame.py", line 2137, in __getitem__
return self._getitem_multilevel(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\frame.py", line 2181, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\multi.py", line 2072, in get_loc
loc = self._get_level_indexer(key, level=0)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\multi.py", line 2362, in _get_level_indexer
loc = level_index.get_loc(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
I'd greatly appreciate any help you guys could offer.
UPDATE:
Turns out the issue was having a multiindex dataframe.
df.columns.map(''.join).str.strip()
Resulted in changing the column into a series upon printing its type, which enabled me to correctly .str.contains the data.
Looking at the output from type(df['label']).info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 695 entries, 0 to 694
Data columns (total 1 columns):
(label,) 695 non-null object
dtypes: object(1) memory usage: 5.5+ KB None
It appears as if the column from this dataframe is a tuple with an empty element, which indicates a multiindex column heading where the second level is empty.
I recommend first we try to fix the pd.read_cv prevent the creation of a multiindex. However, using:
df.columns = df.columns.map(''.join).str.strip()
Will flatten that multiindex column header to a single level column index and create a pd.Series datatype for df['label'] then allowing the usage of string accessor and contains.

Changing abbreviations for strings

Trying to change the abbreviated addresses end with full description but the traceback does not make any sense. Please tell me what am I doing wrong here.
import pandas as pd
edit = pd.read_csv('mycsvfile')
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ct', 'Court'))
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Rd', 'Road'))
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ln', 'Lane'))
edit.to_csv('newcsvfile',index = False)
Traceback (most recent call last):
File "C:\Users\.py", line 20, in <module>
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ct', 'Court'))
File "C:\********.py", line 2294, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1207, in pandas.lib.map_infer (pandas\lib.c:66124)
File "C:******.py", , in <lambda>
edit['Home'] = edit[Home'].apply(lambda s: s.replace('Ct', 'Court'))
AttributeError: 'float' object has no attribute 'replace'
These are few of the values in the Home column:
1458 Clearlight Rd
7458 Grove Ln
8574 Grove Ct
2222 Grove Ln
1258 Grove Ct
1478 Grove Ln
Some of the values in the Home column are missing. Pandas treats missing values as numpy nan, which are of type float.
You have a few options:
Fill your missing values with something other than that np.nan: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.fillna.html
(Fill missing values when reading csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
Filter for non-null values, then apply your function:
edit[edit['Home'].notnull()]['Home'].apply(lambda s: s.replace('Ct', 'Court')

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.

Categories