Zeppelin fail to z.show() pandas DataFrame - python

According to the docs, DataFrames are displayed nicely by Zeppelin if I just do:
import pandas as pd
rates = pd.read_csv("bank.csv", sep=";")
z.show(rates)
But I trying the same wity my DataFrame:
df = pd.DataFrame.from_dict({
('a', 'b') : {'value': 1}
}, orient='index')
z.show(df)
it gives:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 27, in show
File "<stdin>", line 43, in show_dataframe
TypeError: string argument expected, got 'numpy.int64'
I thought it was a problem with Multilevel Indexes, but even after using df.reset_index() I can't make it work.
UPDATE
I can also reproduce this with a csv like
id,name,score
a,b,1.1
and using the read_csv method. If I drop the score column, the z.show works.
Is this a known issue? Is there a workaround? Or is it my mistake? I'm using Zeppelin version 0.6.2.

Related

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.
this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

I get 'TypeError:: 'type' object is not iterable' when I use pandas.groupby

I have a csv file containing values of user's social media activity for 20 days I want to get the details of the user activity on Day 1
I did get it by using this piece of code
df['d'] = pd.to_datetime(df['date_time'], format='(%Y,%m,%d,%H,%M,%S)')
day1 = df['d'].dt.date[0]
df = df[df['d'].dt.date.eq(day1)]
df = df.melt(['date_time','d'])
df = df[df['value'].eq('Y')]
d = df.groupby('variable')['date_time'].agg(list).to_dict()
for x,y in d.items():
print(x,y)
when I print it this what I get on Google Colab notebook
Instagram [Timestamp('2020-08-23 04:19:05.637617'), Timestamp('2020-08-23 04:20:07.351783'), Timestamp('2020-08-23 04:21:09.069061')]
Facebook [Timestamp('2020-08-23 04:44:49.635657'), Timestamp('2020-08-23 04:45:51.402162'), Timestamp('2020-08-23 05:01:18.989306')]
Now when I run the same on my terminal using after saving it as python file I get this error
Traceback (most recent call last):
File "example.py", line 43, in <module>
d=df.groupby('variable')['date_time'].agg(list).to_dict()
TypeError: 'type' object is not iterable
EDIT:
There was problem with pandas version, so for pandas 0.22 need change:
d=df.groupby('variable')['date_time'].agg(list).to_dict()
to:
d=df.groupby('variable')['date_time'].apply(list).to_dict()

concatenating the contents of two pandas columns python unboundlocalerror in ops.py

I have a dataframe - anydataframe eg.
d = ({
'A' : ['Foo','Bar','Foo','zee'],
'B' : ['X','Bar','X','Bar'],
'C' : ['foo','bar','Nacho','Y'],
})
df = pd.DataFrame(data=d)
i want to join two columns in a new column
df['UID']=df['A']+df['B']
but im getting the unboundlocalerror
Traceback (most recent call last):
File "<ipython-input-26-d763f11a6023>", line 1, in <module>
df['UID']=df['A']+df['B']
File "C:\Anaconda3\lib\site-packages\pandas\core\ops.py", line 723, in wrapper
result,
UnboundLocalError: local variable 'result' referenced before assignment
is my installation corrupt? i can get it to work on my laptop

Python Pandas: creating a dataframe using a function for one of the fields

I am trying to create a dataframe where one of the fields is calculated using a function. To do this I use the following code:
import pandas as pd
def didSurvive(sex):
return int(sex == "female")
titanic_df = pd.read_csv("test.csv")
submission = pd.DataFrame({
"PassengerId": titanic_df["PassengerId"],
"Survived": didSurvive(titanic_df["Sex"])
})
submission.to_csv('titanic-predictions.csv', index=False)
when I run this code I get the following error:
D:\Documents\kaggle\titanic>python predictor.py
File "predictor.py", line 3
def didSurvive() {
^
SyntaxError: invalid syntax
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
Traceback (most recent call last):
File "predictor.py", line 10, in
"Survived": didSurvive(titanic_df["Sex"])
File "predictor.py", line 4, in didSurvive
return int(sex == "female")
File "C:\Python34\lib\site-packages\pandas\core\series.py", line 92,
in wrapper
"{0}".format(str(converter)))
TypeError: cannot convert the series to
D:\Documents\kaggle\titanic>
I think what is happening is I'm trying to run the int() on a series of booleans instead of an individual boolean. How do I go about fixing this?
To convert the data type of a Series, you can use astype() function, this should work:
def didSurvive(sex):
return (sex == "female").astype(int)
You can also reformat data during the import from csv file
titanic_df = pd.read_csv("test.csv", converters={'Sex':didSurvive})
submission = pd.DataFrame(titanic_df, columns=['PassengerId', 'Sex'])

Apply SequenceMatcher to DataFrame

I'm new to pandas and Python in general, so I'm hoping someone can help me with this simple question. I have a large dataframe m with several million rows and seven columns, including an ITEM_NAME_x and ITEM_NAME_y. I want to compare ITEM_NAME_x and ITEM_NAME_y using SequenceMatcher.ratio(), and add a new column to the dataframe with the result.
I've tried to come at this several ways, but keep running into errors:
>>> m.apply(SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio(), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 4480, in _apply_standard
results[i] = func(v)
TypeError: ("'float' object is not callable", 'occurred at index 0')
Could someone help me fix this?
You have to apply a function, not a float which expression SequenceMatcher(None, str(m.ITEM_NAME_x), str(m.ITEM_NAME_y)).ratio() is.
Working demo (a draft):
import difflib
from functools import partial
import pandas as pd
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, s[c1], s[c2]).ratio()
df = pd.DataFrame({'A': {1: 'one'}, 'B': {1: 'two'}})
print df.apply(partial(apply_sm, c1='A', c2='B'), axis=1)
output:
1 0.333333
dtype: float64

Categories