Py-Polars DateTime Conversion - python

I am currently exploring Py-Polars and are having some difficulties with getting the Date32 format in its dataframe. I have tried the following means:
Conversion from Pandas to PyPolars directly
import pandas as pd
import pypolars as pyp
a = pd.read_csv(*CSV File*)
b = pyp.from_pandas(a)
The error code is as follows:
Traceback (most recent call last):
File "<pyshell#29>", line 1, in <module>
pyp.from_pandas(a)
File "C:\Users\*Username*\AppData\Local\Programs\Python\Python37\lib\site-packages\pypolars\functions.py", line 235, in from_pandas
pl_s = Series(k, s, nullable=True).cast(datatypes.Date64)
File "C:\Users\*Username*\AppData\Local\Programs\Python\Python37\lib\site-packages\pypolars\series.py", line 783, in cast
return wrap_s(f())
RuntimeError: Any(ArrowError(ComputeError("Casting from Int32 to Date64 not supported")))
Conversion DateTime to String in Pandas, convert to PyPolars, converting String to DateTime in PyPolars
def changeDateTime(value):
return str(value)
a["ACTUAL_DROP_DATE"] = a["ACTUAL_DROP_DATE"].apply(changeDateTime)
a["ACTUAL_END_DATE"] = a["ACTUAL_END_DATE"].apply(changeDateTime)
b = pyp.from_pandas(a)
def changeStrBack(value):
if value == np.str("NaT"):
return ""
else:
year = int(value[0:4])
month = int(value[5:7])
day = int(value[8:10])
return pyp.datetime(year, month, day)
b["ACTUAL_DROP_DATE"] = b["ACTUAL_DROP_DATE"].apply(changeStrBack, dtype_out = pyp.Date32)
b["ACTUAL_END_DATE"] = b["ACTUAL_END_DATE"].apply(changeStrBack, dtype_out = pyp.Date32)
This has thrown me all the null values upon conversion. (i.e. both columns are completely null).
Hope anyone have some ideas on how I can get the columns to datetime in PyPolars.
Thank you!

Related

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.
this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

tz_localize: KeyError: ('Asia/Singapore', u'occurred at index 0')

Reference to: Python pandas convert unix timestamp with timezone into datetime
Did a search on this topic but still can't find the answer.
I have a dataframe whichh is the following format:
df timestamp
1 1549914000
2 1549913400
3 1549935000
3 1549936800
5 1549936200
I use the following to convert epoch to date:
df['date'] = pd.to_datetime(df['timestamp'], unit='s')
This line will produce a date that is always 8 hours behind my local time.
So I followed the example in the link to use apply + tz.localize to Asia/Singapore, I tried the following code on the next line after the above code.
df['date'] = df.apply(lambda x: x['date'].tz_localize(x['Asia/Singapore']), axis=1)
but python return an error as below:
Traceback (most recent call last):
File "/home/test/script.py", line 479, in <module>
schedule.every(10).minutes.do(main).run()
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/schedule/__init__.py", line 411, in run
ret = self.job_func()
File "/home/test/script.py", line 361, in main
df['date'] = df.apply(localize_ts, axis = 1)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/frame.py", line 4973, in _apply_standard
results[i] = func(v)
File "/home/test/script.py", line 359, in localize_ts
return pd.to_datetime(row['date']).tz_localize(row['Asia/Singapore'])
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/series.py", line 623, in __getitem__
result = self.index.get_value(self, key)
File "/opt/cloudera/parcels/Anaconda-4.0.0/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2574, in get_value
raise e1
KeyError: ('Asia/Singapore', u'occurred at index 0')
Did I replace .tz_localize(x['tz']) in correctly?
As written, your code is looking for a column named Asia/Singapore. Try this instead:
df['date'] = df['date'].dt.tz_localize('Asia/Singapore')
you can try
import numpy as np
import pandas as pd
df = pd.DataFrame({'timestamp': [1549952400, 1549953600]},index=['1', '2'])
df['timestamp2'] = df['timestamp'] + 28800
df['date'] = pd.to_datetime(df['timestamp2'], unit='s')
df = df.drop('timestamp2', 1)

How to print unique values of a column in a group using Pandas?

I am trying to print unique values of the column ADO_name in my data set. Following is the example data set and code I tried (which gives error):
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = {'ADO_name':['car1','car1','car1','car2','car2','car2'],
'Time_sec':[0,1,2,0,1,2],
'Speed.kph':[50,51,52,0,0,52]}
dframe = DataFrame(data)
for ado in dframe.groupby('ADO_name'):
ado_name = ado["ADO_name"]
adoID = ado_name.unique()
print(adoID)
Traceback (most recent call last):
File "C:\Users\Quinton\AppData\Local\Temp\Rtmp88ifpB\chunk-code-188c39fc7de8.txt", line 14, in <module>
ado_name = ado["ADO_name"]
TypeError: tuple indices must be integers or slices, not str
What am I doing wrong and how to fix it? Please help.
You can do: dframe["ADO_name"].unique().
You may want to correct your code or use the correct way.
Here is what you need to correct in your code.
for ado in dframe.groupby('ADO_name'):
ado_name = ado[1]["ADO_name"]
adoID = ado_name.unique()
print(adoID)

Getting error slicing time series with pandas

I'm trying to slice a time series, I can do it perfectly this way :
subseries = series['2015-07-07 01:00:00':'2015-07-07 03:30:00'] .
But the following code won't work
def GetDatetime():
Y = int(raw_input("Year "))
M = int(raw_input("Month "))
D = int(raw_input("Day "))
d = datetime.datetime(Y, M, D) #creates a datetime object
return d
filePath = "pathtofile.csv"
series = pd.read_csv(str(filePath), index_col='date')
series.index = pd.to_datetime(series.index, unit='s')
d = GetDatetime()
f = GetDatetime()
subseries = series[d:f]
The last line generates this error:
Traceback (most recent call last):
File "dontgivemeerrorsbrasommek.py", line 37, in <module>
brasla7nina= df[d:f]
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1952, in __getitem__
indexer = convert_to_index_sliceable(self, key)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 1896, in convert_to_index_sliceable
return idx._convert_slice_indexer(key, kind='getitem')
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 1407, in _convert_slice_indexer
indexer = self.slice_indexer(start, stop, step, kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/datetimes.py", line 1515, in slice_indexer
return Index.slice_indexer(self, start, end, step, kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3350, in slice_indexer
kind=kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3538, in slice_locs
start_slice = self.get_slice_bound(start, 'left', kind)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.2-py2.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 3487, in get_slice_bound
raise err
KeyError: 1435802520000000000
I think it's a time-stamp conversion problem so I tried the following but still it wouldn't work :
d3 = pandas.Timestamp(datetime(Y, M, D, H, m))
d2 = pandas.to_datetime(d)
Your help would be appreciated, thank you. :)
change def GetDatetime() function return value to:
return str(d)
This will return datetime string which times series will be able to deal with.
if I understand your code correctly, when you do this:
subseries = series['2015-07-07 01:00:00':'2015-07-07 03:30:00']
you're slicing series (btw, that's confusing seeing as there is a pandas datatype Series) from two strings.
if that works, then what you need from subseries= df[d:f] would be that d and f be strings.
you can do that by calling the datetime method .strftime() eg:
d= GetDatetime().strftime('%Y-%m-%d 00:00:00')
f= GetDatetime().strftime('%Y-%m-%d 00:00:00')

Python Pandas: creating a dataframe using a function for one of the fields

I am trying to create a dataframe where one of the fields is calculated using a function. To do this I use the following code:
import pandas as pd
def didSurvive(sex):
return int(sex == "female")
titanic_df = pd.read_csv("test.csv")
submission = pd.DataFrame({
"PassengerId": titanic_df["PassengerId"],
"Survived": didSurvive(titanic_df["Sex"])
})
submission.to_csv('titanic-predictions.csv', index=False)
when I run this code I get the following error:
D:\Documents\kaggle\titanic>python predictor.py
File "predictor.py", line 3
def didSurvive() {
^
SyntaxError: invalid syntax
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
Traceback (most recent call last):
File "predictor.py", line 10, in
"Survived": didSurvive(titanic_df["Sex"])
File "predictor.py", line 4, in didSurvive
return int(sex == "female")
File "C:\Python34\lib\site-packages\pandas\core\series.py", line 92,
in wrapper
"{0}".format(str(converter)))
TypeError: cannot convert the series to
D:\Documents\kaggle\titanic>
I think what is happening is I'm trying to run the int() on a series of booleans instead of an individual boolean. How do I go about fixing this?
To convert the data type of a Series, you can use astype() function, this should work:
def didSurvive(sex):
return (sex == "female").astype(int)
You can also reformat data during the import from csv file
titanic_df = pd.read_csv("test.csv", converters={'Sex':didSurvive})
submission = pd.DataFrame(titanic_df, columns=['PassengerId', 'Sex'])

Categories