Vaex datetime error unknown variables or column - python

I got a vaex.dataframe.DataFrame called df holding a time column called timestamp of type string. I convert the column to datetime as follows
import numpy as np
from pandas.api.types import is_datetime64_any_dtype as is_datetime
if not is_datetime(df['timestamp']):
df['timestamp'] = df['timestamp'].apply(np.datetime64)
Then I just want to select rows of df where the timestamp is in a specific range. Lets say
sliced_df = df[(df['timestamp'] > np.datetime64("2022-01-01"))]
I am doing that in Sagemaker and it throws a huge error mainly saying the following error messages
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate
result = self[expression]
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 166, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'datetime64(__timestamp)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1327, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
ValueError: Error parsing datetime string "nan" at position 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 265, in __getitem__
values = self.evaluate(expression) # , out=self.buffers[variable])
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 188, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper
result = f(*args, **kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1312, in __call__
return vaex.multiprocessing.apply(self._apply, args, kwargs, self.multiprocessing)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/multiprocessing.py", line 32, in apply
result = _get_pool().apply(f, args, kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
ValueError: Error parsing datetime string "nan" at position 0
ERROR:MainThread:vaex.scopes:error in evaluating: 'timestamp'
"""
The df holds values similar to these under the column timestamp
<pyarrow.lib.StringArray object at 0x7f569e5f54b0>
[
"2021-12-19 06:01:10.789",
"2021-12-20 07:02:11.89",
"2022-01-01 08:02:12.678",
"2022-01-02 09:03:13.567",
"2022-01-03 10:04:14.456"
]
The time stamps look fine to me. I compared with previous data where the comparison worked and nothing seems to be different. I have no clue why this now is not working anymore. I am trying to wrap my head around it for days now but really can't find why its throwing that error.
When I check for
df[df.timestamp.isna()]
it returns nothing. So I don't understand why it found nan in the first position as stated in the error message above.
I appreciate any help. Thanks in advance!

It is probably because you are comparing arrow timestamps to numpy timestamps. You need to chose one framework and work with that.
This issues on vaex's github discusses what you are facing a bit, so it might clear things up more:
https://github.com/vaexio/vaex/issues/1704

Related

Pandas update column conditionally if another column exists

I have a dataframe with a column whose value depends on the existence of another column, so I tried to use np.where to condition the value like this:
the_dataframe["dependant_value"] = np.where(
"independant_value_" + the_dataframe["suffix"].str.lower()
in the_dataframe.columns,
the_dataframe["another_column"]
* the_dataframe[
"independent_value_" + the_dataframe["suffix"].str.lower()
],
0,
)
But I'm getting this error:
File "C:\the_file.py", line 271, in the_method
"independent_value_" + the_dataframe["suffix"].str.lower()
File "C:\the_file.py", line 4572, in __contains__
hash(key)
TypeError: unhashable type: 'Series'
I suppose there must be a propper way to make the logic evaluation of the condition, but I haven't found it.
There is no syntax like Series in Series/Index
$ s = pd.Series([1, 2])
$ s in s
Traceback (most recent call last):
File "~/sourcecode/test/so/73460893.py", line 44, in <module>
s in s
File "~/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 1994, in __contains__
return key in self._info_axis
File "~/.local/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 365, in __contains__
hash(key)
TypeError: unhashable type: 'Series'
You might want Series.isin
("independant_value_" + the_dataframe["suffix"].str.lower()).isin(the_dataframe.columns),

pandas filter dataframe based on chained splits

I have a pandas dataframe which contains a column (column name filenames) with filenames. The filenames look something like:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
To filter, I do this (lets say `select_string="0"):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
but I get thrown this:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
I think it does not like me chaining the splits, but I vaguely remember doing this sometime ago and it did work.. so, I am perplexed why it throws this error.
PS: I do know how to solve using .contains but I would like to use this approach of comparig strings.
Any pointers would be great!
Here is another way, with .str.extract():
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
Now, create a boolean mask. The squeeze() method ensures we have a series, so the mask will work:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg
Assuming all rows contain .jpg, if not please change it to only . instead
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]
This part:
df_fp["filenames"].str.split(".jpg")[0]
returns you the first row of the DataFrame, not the first element of the list.
What you are looking for is expand (it will create a new columns for every element in the list after the split) parameter:
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
Alternatively you could do that via apply:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
but contains is definitely more appropriate here.

Using iloc on a dataframe gives me out of bound errors even when using a list of indices derived from the dataframe

So I have a DataFrame with about 400,000 columns. When I try to get all the data using iloc, it throws out of bound errors. Here is what I have tried.
index_second_update = the_data.index.tolist()
the_data.iloc[index_second_update]
Traceback (most recent call last):
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2130, in _get_list_axis
return self.obj.take(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/generic.py",
line 3604, in take
indices, axis=self._get_block_manager_axis(axis), verify=True
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py",
line 1389, in take
indexer = maybe_convert_indices(indexer, n)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexers.py",
line 201, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 1424, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2148, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2133, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
Some more details:
len(index_second_update) = 446882
index_second_update == the_data.index.tolist()
True
Strange thing is that it breaks down at around 200000 rows. Up until then it works perfectly fine.
df.loc access the pandas by the label of each row, which is not necessarily the row number.
here's code that will work for you, that accesses the data by the row label
index_second_update = the_data.index.tolist()
the_data.loc[index_second_update]
or even more simply:
the_data.loc[the_data.index]
as an example for an index which is not row numbers look in the dataframe below, the rows are labeled by name.
import pandas as pd
csv = """\
Name,Birth Year
Joe,2000
Bill,1998
Mike,1996
Frank,1995"""
from io import StringIO
df = pd.read_csv(StringIO(csv))
df.set_index('Name')
Birth Year
Name
Joe 2000
Bill 1998
Mike 1996
Frank 1995

How to create a diff column with the previous period value in python?

I'm just trying to create a column in my dataframe with the difference of the column value and the same column of the previous month. In case the previous month doesn't exist, don't calculate the difference.
Result table example
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['ID'], df_ranking['DATE'])['POINTS'].shift(1)
But the error message I get is:
Traceback (most recent call last):
File "C:/Users/jhoyo/PycharmProjects/Tennis-Ranking/venv/ranking_2_db.py", line 95, in <module>
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['licencia'], df_ranking['date'])['puntos'].shift(1)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 7629, in groupby
axis = self._get_axis_number(axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 351, in _get_axis_number
axis = cls._AXIS_ALIASES.get(axis, axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1816, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
You have to define groupby like this===>
df_ranking['cat_race'] = df_ranking.groupby(['ID','Date'])['POINTS'].shift(1)
Hope it will work

Memory Error while applying a .loc filter on Dataframe

I have a large dataframe with about 392 million rows and 9 columns. I want to apply a filter on the data set to extract a subset.
Here my original dataset is dh_activity_recos
dh_activity_approved = dh_activity_recos.loc[dh_activity_recos.approved_flag == 1]
Now, when I apply this filter I get the following memory error:
Traceback (most recent call last):
File "/mnt01/eh-datasci/ravinder/working/final_recos_processing.py", line 144, in <module>
dh_activity_approved = dh_activity_recos.loc[dh_activity_recos.approved_flag == 1]
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1227, in __getitem__
return self._getitem_axis(key, axis=0)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1344, in _getitem_axis
return self._getbool_axis(key, axis=axis)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1239, in _getbool_axis
raise self._exception(detail)
KeyError: MemoryError()
I am not able to understand the exact reason behind this. I have checked with dir() command; there are not any other memory-consuming resources except this large dataset. Moreover, I am executing this on cloud with 128GB RAM, so I'm not sure why this error is surfacing.

Categories