Need a more efficient datetime conversion for/into pandas

Need a more efficient datetime conversion for/into pandas - python

Good morning Stackoverflow.
I am trying to find a better way to suck in a CSV file and parse the datetime. Unfortunately my data is coming in as a '%j:%H:%M:%S.%f', such as 234:17:33:00.000206700. I have the year sitting in another field from my header I skip over, so this was my method of converting prior to setting as index, since I have date rollovers to account for. It works, but is slower than I would like and is not intuitive.
dataframe = pd.read_csv(data_file,skiprows=np.arange(0,meta_lines),header=[0,1,2])
dataframe['Temp'] = meta['Date'].split('-')[2] + ' ' # splitting off the year from 08-22-2019
dataframe['Temp'] = dataframe[['Temp','AbsoluteTime']].apply(lambda x: ''.join(x),axis=1)
dataframe['AbsoluteTime'] = pd.to_datetime(dataframe['Temp'],format='%Y %j:%H:%M:%S.%f')
del dataframe['Temp']
dataframe.set_index('AbsoluteTime', inplace=True)
Originally I wanted to have pd.to_datetime parse without the %Y, resulting in the year 1900 and using time delta to add X years, however when I started down that path, I came across this error.
dataframe['AbsoluteTime']
Out[8]:
DDD:HH:MM:SS.sssssssss
Absolute Time
0 234:17:33:00.000206700
1 234:17:33:00.011264914
2 234:17:33:00.015721314
...
pd.to_datetime(dateframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
Traceback (most recent call last):
File "<ipython-input-9-6dfc074c2dc4>", line 1, in <module>
pd.to_datetime(dateframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
NameError: name 'dateframe' is not defined
pd.to_datetime(dataframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
Traceback (most recent call last):
File "<ipython-input-10-bfbf7ee22833>", line 1, in <module>
pd.to_datetime(dataframe['AbsoluteTime'],format='%j:%H:%M:%S.%f')
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 512, in to_datetime
result = _assemble_from_unit_mappings(arg, errors=errors)
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 582, in _assemble_from_unit_mappings
unit = {k: f(k) for k in arg.keys()}
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 582, in <dictcomp>
unit = {k: f(k) for k in arg.keys()}
File "C:\Users\fkatzenb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 577, in f
if value.lower() in _unit_map:
AttributeError: 'tuple' object has no attribute 'lower'
What gives? My problem isnt from having double brackets [[]] like other threads with this error addresses. If I do this as a test, I see...
pd.to_datetime(['234:17:33:00.000206700'],format='%j:%H:%M:%S.%f')
Out[6]: DatetimeIndex(['1900-08-22 17:33:00.000206700'], dtype='datetime64[ns]', freq=None)
I was then just going to add a timedelta to that to shift the year to the current year.
My only thought is I have is it having to do with my multiple column header (see my from_csv command). Thoughts? Suggestions?
Thanks!

Related

Pandas update column conditionally if another column exists

I have a dataframe with a column whose value depends on the existence of another column, so I tried to use np.where to condition the value like this:
the_dataframe["dependant_value"] = np.where(
"independant_value_" + the_dataframe["suffix"].str.lower()
in the_dataframe.columns,
the_dataframe["another_column"]
* the_dataframe[
"independent_value_" + the_dataframe["suffix"].str.lower()
],
0,
)
But I'm getting this error:
File "C:\the_file.py", line 271, in the_method
"independent_value_" + the_dataframe["suffix"].str.lower()
File "C:\the_file.py", line 4572, in __contains__
hash(key)
TypeError: unhashable type: 'Series'
I suppose there must be a propper way to make the logic evaluation of the condition, but I haven't found it.

There is no syntax like Series in Series/Index
$ s = pd.Series([1, 2])
$ s in s
Traceback (most recent call last):
File "~/sourcecode/test/so/73460893.py", line 44, in <module>
s in s
File "~/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 1994, in __contains__
return key in self._info_axis
File "~/.local/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 365, in __contains__
hash(key)
TypeError: unhashable type: 'Series'
You might want Series.isin
("independant_value_" + the_dataframe["suffix"].str.lower()).isin(the_dataframe.columns),

Vaex datetime error unknown variables or column

I got a vaex.dataframe.DataFrame called df holding a time column called timestamp of type string. I convert the column to datetime as follows
import numpy as np
from pandas.api.types import is_datetime64_any_dtype as is_datetime
if not is_datetime(df['timestamp']):
df['timestamp'] = df['timestamp'].apply(np.datetime64)
Then I just want to select rows of df where the timestamp is in a specific range. Lets say
sliced_df = df[(df['timestamp'] > np.datetime64("2022-01-01"))]
I am doing that in Sagemaker and it throws a huge error mainly saying the following error messages
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate
result = self[expression]
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 166, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'datetime64(__timestamp)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1327, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
ValueError: Error parsing datetime string "nan" at position 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 265, in __getitem__
values = self.evaluate(expression) # , out=self.buffers[variable])
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 188, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper
result = f(*args, **kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1312, in __call__
return vaex.multiprocessing.apply(self._apply, args, kwargs, self.multiprocessing)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/multiprocessing.py", line 32, in apply
result = _get_pool().apply(f, args, kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
ValueError: Error parsing datetime string "nan" at position 0
ERROR:MainThread:vaex.scopes:error in evaluating: 'timestamp'
"""
The df holds values similar to these under the column timestamp
<pyarrow.lib.StringArray object at 0x7f569e5f54b0>
[
"2021-12-19 06:01:10.789",
"2021-12-20 07:02:11.89",
"2022-01-01 08:02:12.678",
"2022-01-02 09:03:13.567",
"2022-01-03 10:04:14.456"
]
The time stamps look fine to me. I compared with previous data where the comparison worked and nothing seems to be different. I have no clue why this now is not working anymore. I am trying to wrap my head around it for days now but really can't find why its throwing that error.
When I check for
df[df.timestamp.isna()]
it returns nothing. So I don't understand why it found nan in the first position as stated in the error message above.
I appreciate any help. Thanks in advance!

It is probably because you are comparing arrow timestamps to numpy timestamps. You need to chose one framework and work with that.
This issues on vaex's github discusses what you are facing a bit, so it might clear things up more:
https://github.com/vaexio/vaex/issues/1704

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.

this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

How to create a diff column with the previous period value in python?

I'm just trying to create a column in my dataframe with the difference of the column value and the same column of the previous month. In case the previous month doesn't exist, don't calculate the difference.
Result table example
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['ID'], df_ranking['DATE'])['POINTS'].shift(1)
But the error message I get is:
Traceback (most recent call last):
File "C:/Users/jhoyo/PycharmProjects/Tennis-Ranking/venv/ranking_2_db.py", line 95, in <module>
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['licencia'], df_ranking['date'])['puntos'].shift(1)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 7629, in groupby
axis = self._get_axis_number(axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 351, in _get_axis_number
axis = cls._AXIS_ALIASES.get(axis, axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1816, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

You have to define groupby like this===>
df_ranking['cat_race'] = df_ranking.groupby(['ID','Date'])['POINTS'].shift(1)
Hope it will work

Python "InterfaceError: Error binding parameter 2 - probably unsupported type."

When I run the following code, I keep getting the "InterfaceError: Error binding parameter 2 - probably unsupported type" error, and I need help identifying where the problem is. Everything works fine up until I try to send the data to sql through.
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
import pandas as pd
import sqlite3
conn = sqlite3.connect("anagrams")
xsorted=sorted(anagrams,key=sorted)
xunique=[x[0] for x in anagrams]
xunique=pd.Series(xunique)
xanagrams=pd.Series(anagrams)
anagramsdf=pd.concat([xunique,dfcount,xanagrams],axis=1)
anagramsdf.columns=['ID','anagram_count','anagram_list']
c=conn.cursor()
c.execute("create table anagrams(ID, anagram_count, anagram_list)")
conn.commit()
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
cdf=pd.read_sql("select max(anagram_count) from anagrams;",conn)
cdf
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
cdf=pd.read_sql("select * from anagrams where anagram_count=12;",conn)
pd.set_option('max_colwidth',200)
Full traceback error:
Traceback (most recent call last):
File "sqlpandas.py", line 88, in <module>
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 982, in to_sql
dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 549, in to_sql
chunksize=chunksize, dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1567, in to_sql
table.insert(chunksize)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 728, in insert
self._execute_insert(conn, keys, chunk_iter)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1357, in _execute_insert
conn.executemany(self.insert_statement(), data_list)
sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
Snippet from Dataframe:
ID anagram_count anagram_list
0 aa 1 (aa,)
1 anabaena 1 (anabaena,)
2 baaskaap 1 (baaskaap,)
3 caracara 1 (caracara,)
4 caragana 1 (caragana,)

I used the following code to change the datatypes to strings, and this solved the problem:
anagramsdf.dtypes
anagramsdf['ID']= anagramsdf['ID'].astype('str')
anagramsdf['anagram_list']= anagramsdf['anagram_list'].astype('str')
anagramsdf.to_sql('anagramsdf',con=conn,if_exists='append',index=False)

Using Pandas 0.23.4 I have a column with datetime values (format '%Y-%m-%d %H:%M:%S') that is of data type "string" that was throwing the same error when I tried to pass it to "to_sql" method. After converting to "datetime" dtype it worked. Hope that's helpful to someone with the same issue :).
To convert:
df['date'] = pd.to_datetime(df['date'],format=datetimeFormat,errors='coerce')

Sparrow's solution worked for me. If the index is not converted to a datetime then SQL is going to throw "error binding parameter"
I used a column that had the datetimes to first convert to the correct format and then use it as the index:
df.set_index(pd.to_datetime(df['datetime']), inplace=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need a more efficient datetime conversion for/into pandas - python

Related

Pandas update column conditionally if another column exists

Vaex datetime error unknown variables or column

Pysolar get_azimuth function applied to pandas DataFrame

How to create a diff column with the previous period value in python?

Python "InterfaceError: Error binding parameter 2 - probably unsupported type."

Categories

Resources