Index dimension must be <= 2 when adding to a DataFrame - python

I am trying to add a new column to a DataFrame using Pandas. Although I keep getting an error. Here is my code:
classes = pd.DataFrame(data['class'])
df['class'] = classes
And every time I run this I get:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\filter.py", line 147, in <module>
df['class'] = list(classes)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\_index.py", line 76, in __setitem__
row, col = self._validate_indices(key)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\_index.py", line 138, in _validate_indices
row = self._asindices(row, M)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\scipy\sparse\_index.py", line 162, in _asindices
raise IndexError('Index dimension must be <= 2')
IndexError: Index dimension must be <= 2
Why am I getting this?

Related

Vaex datetime error unknown variables or column

I got a vaex.dataframe.DataFrame called df holding a time column called timestamp of type string. I convert the column to datetime as follows
import numpy as np
from pandas.api.types import is_datetime64_any_dtype as is_datetime
if not is_datetime(df['timestamp']):
df['timestamp'] = df['timestamp'].apply(np.datetime64)
Then I just want to select rows of df where the timestamp is in a specific range. Lets say
sliced_df = df[(df['timestamp'] > np.datetime64("2022-01-01"))]
I am doing that in Sagemaker and it throws a huge error mainly saying the following error messages
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate
result = self[expression]
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 166, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'datetime64(__timestamp)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1327, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
ValueError: Error parsing datetime string "nan" at position 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 265, in __getitem__
values = self.evaluate(expression) # , out=self.buffers[variable])
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 188, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper
result = f(*args, **kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1312, in __call__
return vaex.multiprocessing.apply(self._apply, args, kwargs, self.multiprocessing)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/multiprocessing.py", line 32, in apply
result = _get_pool().apply(f, args, kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
ValueError: Error parsing datetime string "nan" at position 0
ERROR:MainThread:vaex.scopes:error in evaluating: 'timestamp'
"""
The df holds values similar to these under the column timestamp
<pyarrow.lib.StringArray object at 0x7f569e5f54b0>
[
"2021-12-19 06:01:10.789",
"2021-12-20 07:02:11.89",
"2022-01-01 08:02:12.678",
"2022-01-02 09:03:13.567",
"2022-01-03 10:04:14.456"
]
The time stamps look fine to me. I compared with previous data where the comparison worked and nothing seems to be different. I have no clue why this now is not working anymore. I am trying to wrap my head around it for days now but really can't find why its throwing that error.
When I check for
df[df.timestamp.isna()]
it returns nothing. So I don't understand why it found nan in the first position as stated in the error message above.
I appreciate any help. Thanks in advance!
It is probably because you are comparing arrow timestamps to numpy timestamps. You need to chose one framework and work with that.
This issues on vaex's github discusses what you are facing a bit, so it might clear things up more:
https://github.com/vaexio/vaex/issues/1704

Using iloc on a dataframe gives me out of bound errors even when using a list of indices derived from the dataframe

So I have a DataFrame with about 400,000 columns. When I try to get all the data using iloc, it throws out of bound errors. Here is what I have tried.
index_second_update = the_data.index.tolist()
the_data.iloc[index_second_update]
Traceback (most recent call last):
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2130, in _get_list_axis
return self.obj.take(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/generic.py",
line 3604, in take
indices, axis=self._get_block_manager_axis(axis), verify=True
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py",
line 1389, in take
indexer = maybe_convert_indices(indexer, n)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexers.py",
line 201, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 1424, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2148, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2133, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
Some more details:
len(index_second_update) = 446882
index_second_update == the_data.index.tolist()
True
Strange thing is that it breaks down at around 200000 rows. Up until then it works perfectly fine.
df.loc access the pandas by the label of each row, which is not necessarily the row number.
here's code that will work for you, that accesses the data by the row label
index_second_update = the_data.index.tolist()
the_data.loc[index_second_update]
or even more simply:
the_data.loc[the_data.index]
as an example for an index which is not row numbers look in the dataframe below, the rows are labeled by name.
import pandas as pd
csv = """\
Name,Birth Year
Joe,2000
Bill,1998
Mike,1996
Frank,1995"""
from io import StringIO
df = pd.read_csv(StringIO(csv))
df.set_index('Name')
Birth Year
Name
Joe 2000
Bill 1998
Mike 1996
Frank 1995

How to create a diff column with the previous period value in python?

I'm just trying to create a column in my dataframe with the difference of the column value and the same column of the previous month. In case the previous month doesn't exist, don't calculate the difference.
Result table example
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['ID'], df_ranking['DATE'])['POINTS'].shift(1)
But the error message I get is:
Traceback (most recent call last):
File "C:/Users/jhoyo/PycharmProjects/Tennis-Ranking/venv/ranking_2_db.py", line 95, in <module>
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['licencia'], df_ranking['date'])['puntos'].shift(1)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 7629, in groupby
axis = self._get_axis_number(axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 351, in _get_axis_number
axis = cls._AXIS_ALIASES.get(axis, axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1816, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
You have to define groupby like this===>
df_ranking['cat_race'] = df_ranking.groupby(['ID','Date'])['POINTS'].shift(1)
Hope it will work

ValueError in DataFrame Pandas

My objective is to..
if the dataframe is empty, i need to insert a row with index->value of the variable URL and columns-> value of URL along with the sorted_list
if non-empty, i need to insert a row with index->value of the variable URL and columns->sorted_list
What I did was... I initialized a DataFrame self.pd and then for each row with values as above said I created a local DataFrame variable df1 and append it to self.df.
My code:
import pandas as pd
class Reward_Matrix:
def __init__(self):
self.df = pd.DataFrame()
def add(self, URL, webpage_list):
sorted_list = []
check_list = list(self.df.columns.values)
print('check_list: ',check_list)
for i in webpage_list: #to ensure no duplication columns
if i not in check_list:
sorted_list.append(i)
if self.df.empty:
sorted_list.insert(0, URL)
df1 = pd.DataFrame(0,index=[URL], columns=[sorted_list])
else:
df1 = pd.DataFrame(0,index=[URL], columns=[sorted_list])
print(df1)
print('sorted_list: ',sorted_list)
print("length: ",len(df1.columns))
self.df.append(df1)
But I get the following error:
Traceback (most recent call last):
File "...Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4294, in create_block_manager_from_blocks
placement=slice(0, len(axes[0])))]
File "...Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 2719, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "...Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 115, in __init__
len(self.mgr_locs)))
ValueError: Wrong number of items passed 1, placement implies 450
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "...eclipse-workspace\Crawler\crawl_core\src_main\run.py", line 23, in test_start
test.crawl_run(self.URL)
File "...eclipse-workspace\Crawler\crawl_core\src_main\test_crawl.py", line 42, in crawl_run
self.reward.add(URL, webpage_list)
File "...eclipse-workspace\Crawler\crawl_core\src_main\dynamic_matrix.py", line 21, in add
df1 = pd.DataFrame(0,index=[URL], columns=[sorted_list])
File "...Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 352, in __init__
copy=False)
File "...Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 483, in _init_ndarray
return create_block_manager_from_blocks([values], [columns, index])
File "...Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4303, in create_block_manager_from_blocks
construction_error(tot_items, blocks[0].shape[1:], axes, e)
File "...Continuum\anaconda3\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (1, 1), indices imply (450, 1)
I am not well-versed with DataFrame and Pandas. I had been getting this error for quite some time and I am getting confused when I go through similar questions asked in StackOverflow as I can't understand where I went wrong!
Can someone help me out?
I think you need remove [], because else get nested list:
df1 = pd.DataFrame(0,index=[URL], columns=sorted_list)
Sample:
sorted_list = ['a','b','c']
URL = 'url1'
df1 = pd.DataFrame(0,index=[URL], columns=sorted_list)
print (df1)
a b c
url1 0 0 0
df1 = pd.DataFrame(0,index=[URL], columns=[sorted_list])
print (df1)
>ValueError: Shape of passed values is (1, 1), indices imply (3, 1)

Strange error in pandas indexing with range when length >= 1,000,000

Pandas raises a ValueError when assigning multiple values to a Series (or DataFrame) using range(x) where x > 1. This error is raised only when its length is one million or larger.
import pandas as pd
for x in [5, 999999, 1000000]:
s = pd.Series(index=range(x))
print('series length =', len(s))
# assigning value with range(1), always works
s.loc[range(1)] = 42
# reading values with range(x>1), always works
_ = s.loc[range(2)]
# assigning values with range(x>1), fails only len >= 1 million
s.loc[range(2)] = 42
Output:
series length = 5
series length = 999999
series length = 1000000
Traceback (most recent call last):
File "<stdin>", line 9, in <module>
File "/home/nekobon/.env_exp/lib/python3.4/site-packages/pandas/core/indexing.py", line 114, in __setitem__
indexer = self._get_setitem_indexer(key)
File "/home/nekobon/.env_exp/lib/python3.4/site-packages/pandas/core/indexing.py", line 109, in _get_setitem_indexer
return self._convert_to_indexer(key, is_setter=True)
File "/home/nekobon/.env_exp/lib/python3.4/site-packages/pandas/core/indexing.py", line 1042, in _convert_to_indexer
return labels.get_loc(obj)
File "/home/nekobon/.env_exp/lib/python3.4/site-packages/pandas/core/index.py", line 1692, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 145, in pandas.index.IndexEngine.get_loc (pandas/index.c:3680)
File "pandas/index.pyx", line 464, in pandas.index._bin_search (pandas/index.c:9124)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I use python 3.4 and pandas 0.17.0. This behavior doesn't seem to be reported yet. Does pandas do anything special on Series with length >= 1,000,000?

Categories