Apply SequenceMatcher to DataFrame - python

I'm new to pandas and Python in general, so I'm hoping someone can help me with this simple question. I have a large dataframe m with several million rows and seven columns, including an ITEM_NAME_x and ITEM_NAME_y. I want to compare ITEM_NAME_x and ITEM_NAME_y using SequenceMatcher.ratio(), and add a new column to the dataframe with the result.
I've tried to come at this several ways, but keep running into errors:
>>> m.apply(SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio(), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4480, in _apply_standard
results[i] = func(v)
TypeError: ("'float' object is not callable", 'occurred at index 0')
Could someone help me fix this?

You have to apply a function, not a float which expression SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio() is.
Working demo (a draft):
import difflib
from functools import partial
import pandas as pd
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, s[c1], s[c2]).ratio()
df = pd.DataFrame({'A': {1: 'one'}, 'B': {1: 'two'}})
print df.apply(partial(apply_sm, c1='A', c2='B'), axis=1)
output:
1 0.333333
dtype: float64

Related

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.
this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

Need to assign dic to Pandas Dataframe

I have problems when I try to assign a dict to the df DataFrame,
df.loc[index,'count'] = dict()
as I get this error message:
Incompatible indexer with Series
To work around this problem, I can do this,
df.loc[index,'count'] = [dict()]
, but I don't like this solution since I have to resolve the list before getting the dictionary i.e.
a = (df.loc[index,'count'])[0]
How can I solve this situation in a more elegant way?
EDIT1
One way to replicate the whole code is as follow
Code:
import pandas as pd
df = pd.DataFrame(columns= ['count', 'aaa'])
d = dict()
df.loc[0, 'count'] = [d]; print('OK!');
df.loc[0, 'count'] = d
Output:
OK!
Traceback (most recent call last):
File "<ipython-input-193-67bbd89f2c69>", line 4, in <module>
df.loc[0, 'count'] = d
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 194, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 625, in _setitem_with_indexer
value = self._align_series(indexer, Series(value))
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 765, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series

How can I add rows in pandas by using "loc" and "for"?

I want to add a data of dataframe to new dataframe by "loc". I used "loc" but an error was occurred. Can I add a data?
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1.0, 1.2, 3.4, 4.1, 8.2]})
>>> import pandas as pd
>>> df_new = pd.DataFrame(columns=['A'])
>>> for i in df:
... df_new.loc[i] = df.loc[i]
...
Traceback (most recent call last):
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1434, in _has_valid_type
error()
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1429, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [A] is not in the [index]'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1328, in __getitem__
return self._getitem_axis(key, axis=0)
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1551, in _getitem_axis
self._has_valid_type(key, axis)
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1442, in _has_valid_type
error()
File "/Users/Hajime/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1429, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [A] is not in the [index]'
But an following code is succeed.
>>> df_new.loc[1] = df.loc[1]
>>> df_new
A
1 1.2
Why don't you take a look at what for is iterating over here?
In [353]: for i in df:
...: print(i)
...:
A
Conclusion - Iteration over df results in iteration over the column names. What you're looking for is something along the lines of df.iterrows, or iterating over df.index.
For example,
for i, r in df.iterrows():
df_new.loc[i, :] = r
df_new
A
0 1.0
1 1.2
2 3.4
3 4.1
4 8.2
The error is in this part:
for i in df:
df_new.loc[i] = df.loc[i]
for loc, the first argument is for index. but i is a column name
if you just want add df to df_new. use concat.
df_new = pd.concat([df_new, df])
import pandas as pd
df = pd.DataFrame({'A': [1.0, 1.2, 3.4, 4.1, 8.2]})
import pandas as pd
df_new = pd.DataFrame(columns=['A'])
for i in df:
Just adding :, before i will do what you want in the first place
df.loc[index of row, column name]
Now what are you doing wrong ? You are passing column name as row index which does not exist
df_new.loc[:,i] = df.loc[:,i]
Anyhow you can pass all the columns in 1 go :
df_new[col_names]=df[col_names]
col_names is a list

Zeppelin fail to z.show() pandas DataFrame

According to the docs, DataFrames are displayed nicely by Zeppelin if I just do:
import pandas as pd
rates = pd.read_csv("bank.csv", sep=";")
z.show(rates)
But I trying the same wity my DataFrame:
df = pd.DataFrame.from_dict({
('a', 'b') : {'value': 1}
}, orient='index')
z.show(df)
it gives:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 27, in show
File "<stdin>", line 43, in show_dataframe
TypeError: string argument expected, got 'numpy.int64'
I thought it was a problem with Multilevel Indexes, but even after using df.reset_index() I can't make it work.
UPDATE
I can also reproduce this with a csv like
id,name,score
a,b,1.1
and using the read_csv method. If I drop the score column, the z.show works.
Is this a known issue? Is there a workaround? Or is it my mistake? I'm using Zeppelin version 0.6.2.

Get the first pandas DataFrame's column?

I want to calculate std of my first prices DataFrame's column.
Here is my code:
import pandas as pd
def std(returns):
return pd.DataFrame(returns.std(axis=0, ddof=0))
prices = pd.DataFrame([[-0.33333333, -0.25343423, -0.1666666667],
[+0.23432323, +0.14285714, -0.0769230769],
[+0.42857143, +0.07692308, +0.1818181818]])
print(std(prices.ix[:,0]))
When I run it, i get the following error:
Traceback (most recent call last):
File "C:\Users\*****\Documents\******\******\****.py", line 12, in <module>
print(std(prices.ix[:,0]))
File "C:\Users\*****\Documents\******\******\****.py", line 10, in std
return pd.DataFrame(returns.std(axis=0, ddof=0))
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 453, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
How can I fix that?
Thank you!
Take a closer look at what is going in in your code:
>>> prices.ix[:,0]
0 -0.333333
1 0.234323
2 0.428571
>>> prices.ix[:,0].std(axis=0, ddof=0)
0.32325861621668445
So you are calling the DataFrame constructor like this:
pd.DataFrame(0.32325861621668445)
The constructor has no idea what to do with single float parameter. It needs some kind of sequence or iterable. Maybe what you what is this:
>>> pd.DataFrame([0.32325861621668445])
0
0 0.323259
It should be as simple as this:
In [0]: prices[0].std()
Out[0]: 0.39590933234452624
Columns of DataFrames are Series. You can call Series methods on them directly.

Categories