Changing abbreviations for strings - python

Trying to change the abbreviated addresses end with full description but the traceback does not make any sense. Please tell me what am I doing wrong here.
import pandas as pd
edit = pd.read_csv('mycsvfile')
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ct', 'Court'))
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Rd', 'Road'))
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ln', 'Lane'))
edit.to_csv('newcsvfile',index = False)
Traceback (most recent call last):
File "C:\Users\.py", line 20, in <module>
edit['Home'] = edit['Home'].apply(lambda s: s.replace('Ct', 'Court'))
File "C:\********.py", line 2294, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1207, in pandas.lib.map_infer (pandas\lib.c:66124)
File "C:******.py", , in <lambda>
edit['Home'] = edit[Home'].apply(lambda s: s.replace('Ct', 'Court'))
AttributeError: 'float' object has no attribute 'replace'
These are few of the values in the Home column:
1458 Clearlight Rd
7458 Grove Ln
8574 Grove Ct
2222 Grove Ln
1258 Grove Ct
1478 Grove Ln

Some of the values in the Home column are missing. Pandas treats missing values as numpy nan, which are of type float.
You have a few options:
Fill your missing values with something other than that np.nan: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.fillna.html
(Fill missing values when reading csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
Filter for non-null values, then apply your function:
edit[edit['Home'].notnull()]['Home'].apply(lambda s: s.replace('Ct', 'Court')

Related

Pandas TypeError: object of type 'float' has no len()

I'm doing some data-discovery using Python/Pandas.
MVCE: I have a CSV file with some street addresses and I want to find the length of the longest address in my file. (this is a simplified version of my actual problem)
I wrote this simple Python code:
import sys
import pandas as pd
df = pd.read_csv(sys.argv[1])
print(df['address'].map(len).max())
The address column is of type str, or so I thought (see below).
Why then do I get this error?
Traceback (most recent call last):
File "eval-lengths.py", line 8, in <module>
print(df['address'].map(len).max())
File "C:\Python35\lib\site-packages\pandas\core\series.py", line 2996, in map
arg, na_action=na_action)
File "C:\Python35\lib\site-packages\pandas\core\base.py", line 1004, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer
TypeError: object of type 'float' has no len()
Here's the output of df.info()
RangeIndex: 154733 entries, 0 to 154732
Data columns (total 2 columns):
address 154510 non-null object
zip 154732 non-null object
dtypes: object(2)
memory usage: 2.4+ MB
UPDATE
Here's a sample CSV file
address,zip
555 APPLE STREET,82101
1180 BANANA LAKE ROAD,81913
577 LEMON DR,81911
,99999
The last line is key to reproducing the problem.
You have missing data in your column, represented by NaNs (which are of float type).
Don't use map/apply, etc for things like finding the length, just do this with str.len:
df['address'].str.len()
Items for which len() is not applicable automatically show in the result as NaN. You can fillna(-1) those out to indicate the result is invalid there.
My Solution was to fillNa with an empty string and then try to run the apply, like this:
df['address'].fillna('', inplace=True)
print(df['address'].map(len).max())

using pandas read_csv with missing data

I am attempting to read a csv file where some rows may be missing chunks of data.
This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.
A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)
import pandas as pd
import numpy as np
import io
datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
index_col=0, names=names, na_values=' ')
The error I get when I run this is
Traceback (most recent call last):
File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
index_col=0, names=names, na_values=' ')
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: ' '
Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?
Try this:
import pandas as pd
import numpy as np
import io
datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names
Edit: To converter dtypes post imports.
df["number"] = df["data"].astype('int')
df["data"] = df["data"].astype('float')
Your data has mixed of blanks as str and numbers.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id 2 non-null object
flag 2 non-null object
number 2 non-null object
data 2 non-null object
data2 2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes
If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.
So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to
na_count_old = na_count
print(col_res)
for ind, row in enumerate(col_res):
k = kh_get_str(na_hashset, row.strip().encode())
if k != na_hashset.n_buckets:
col_res[ind] = np.nan
na_count += 1
else:
col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)
if na_count_old==na_count:
# float -> int conversions can fail the above
# even with no nans
col_res_orig = col_res
col_res = col_res.astype(col_dtype)
if (col_res != col_res_orig).any():
raise ValueError("cannot safely convert passed user dtype of "
"{col_dtype} for {col_res} dtyped data in "
"column {column}".format(col_dtype=col_dtype,
col_res=col_res_orig.dtype.name,
column=i))
which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).
While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

pandas pivot_table without grouping

What is the best way to use pandas.pivot_table to calculate aggregated functions over the whole table without providing the grouping?
For example, if I want to calculate the sum of A,B,C into one table with a single row without grouping by any of the columsn:
>>> x = pd.DataFrame({'A':[1,2,3],'B':[8,7,6],'C':[0,3,2]})
>>> x
A B C
0 1 8 0
1 2 7 3
2 3 6 2
>>> x.pivot_table(values=['A','B','C'],aggfunc=np.sum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/tools/pivot.py", line 103, in pivot_table
grouped = data.groupby(keys)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/generic.py", line 2434, in groupby
sort=sort, group_keys=group_keys, squeeze=squeeze)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 789, in groupby
return klass(obj, by, **kwds)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 238, in __init__
level=level, sort=sort)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 1622, in _get_grouper
raise ValueError('No group keys passed!')
ValueError: No group keys passed!
Also, I would like to use custom aggfunc, and the above np.sum is just an example.
Thanks.
I think you're asking how to apply a function to all columns of a Data Frame: To do this call the apply method of your dataframe:
def myfunc(col):
return np.sum(col)
x.apply(myfunc)
Out[1]:
A 6
B 21
C 5
dtype: int64
I had the same error, I was using pivot_table argument on a Pandas data frame,
import numpy as np
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales')
# Print mean_sales_by_type
print(mean_sales_by_type)
Here's the error:
File "<stdin>", line 889, in __init__
grouper, exclusions, obj = get_grouper(
File "<stdin>", line 896, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
Finally got it fixed it by specifying the index argument of the pivot_table function (after values)
mean_sales_by_type = sales.pivot_table(values='weekly_sales',index='type')
in your case try this:-
x.pivot_table(values=['A','B','C'],**value=[]**,aggfunc=np.sum)

convert_to_r_dataframe gives error no attribute dtype

I have a pandas dataframe of size 153895 rows x 644 columns (read from a csv file) and has a few columns that are string and others as integer and float. I am trying to save it as a Rda file.
I tried:
import pandas.rpy.common as com
myDFinR = com.convert_to_r_dataframe(myDF)
I get the following error:
Traceback (most recent call last):
File "C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\IPython\core\interactiveshell.py", line 2828, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-101-7d2a8ae98ea4>", line 1, in <module>
dDataR=com.convert_to_r_dataframe(dData)
File "C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\rpy\common.py", line 305, in convert_to_r_dataframe
value_type = value.dtype.type
File "C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\core\generic.py", line 1815, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'dtype'
I tried to do myDF.dtypes and it didn't give me anything unusual output
col1 object
col2 object
col3 int64
...
col642 float64
col643 float64
col644 float64
Length: 644, dtype: object
When I tried for i,j in enumerate(myDF.columns): print(i,":",myDF[j].dtype) then it gave me an error at column 359. However, if I try myDF[[359]].dtypes it gives me
col359 float64
dtype: object
What could be the issue?
I can reproduce the error messages when myDF has non-unique column names:
import pandas as pd
import pandas.rpy.common as com
myDF = pd.DataFrame([[1,2],[3,4]], columns=['A','B'])
myDFinR = com.convert_to_r_dataframe(myDF)
print(myDFinR) # 1
myDF2 = pd.DataFrame([[1,2],[3,4]], columns=['A','A'])
myDFinR2 = com.convert_to_r_dataframe(myDF2)
print(myDFinR2) # 2
Prints
A B
0 1 2
1 3 4
Raises AttributeError:
AttributeError: 'DataFrame' object has no attribute 'dtype'
If this is indeed the source of your problem, you can fix it by renaming the columns to something unique:
myDF.columns = ['col{i}'.format(i=i) for i in range(len(myDF.columns))]

Apply SequenceMatcher to DataFrame

I'm new to pandas and Python in general, so I'm hoping someone can help me with this simple question. I have a large dataframe m with several million rows and seven columns, including an ITEM_NAME_x and ITEM_NAME_y. I want to compare ITEM_NAME_x and ITEM_NAME_y using SequenceMatcher.ratio(), and add a new column to the dataframe with the result.
I've tried to come at this several ways, but keep running into errors:
>>> m.apply(SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio(), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4480, in _apply_standard
results[i] = func(v)
TypeError: ("'float' object is not callable", 'occurred at index 0')
Could someone help me fix this?
You have to apply a function, not a float which expression SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio() is.
Working demo (a draft):
import difflib
from functools import partial
import pandas as pd
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, s[c1], s[c2]).ratio()
df = pd.DataFrame({'A': {1: 'one'}, 'B': {1: 'two'}})
print df.apply(partial(apply_sm, c1='A', c2='B'), axis=1)
output:
1 0.333333
dtype: float64

Categories