IndexError when replacing missing values with mode using groupby in pandas - python

I have a dataset which requires missing value treatment.
Column Missing Values
Complaint_ID 0
Date_received 0
Transaction_Type 0
Complaint_reason 0
Company_response 22506
Date_sent_to_company 0
Complaint_Status 0
Consumer_disputes 7698
Now the problem is, when I try to replace the missing values with mode of other columns using groupby:
Code:
data11["Company_response"] =
data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()
[0]))["Company_response"]
data11["Consumer_disputes"] =
data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()
[0]))["Consumer_disputes"]
I get the following error:
Stacktrace
Traceback (most recent call last):
File "<ipython-input-89-8de6a010a299>", line 1, in <module>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3741, in transform
return self._transform_general(func, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3699, in _transform_general
res = path(group)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4360, in apply
ignore_failures=ignore_failures)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4456, in _apply_standard
results[i] = func(v)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "<ipython-input-89-8de6a010a299>", line 1, in <lambda>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2434, in get_value
return libts.get_value_box(s, key)
File "pandas\_libs\tslib.pyx", line 923, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18843)
File "pandas\_libs\tslib.pyx", line 939, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18560)
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
I have checked the length of the dataframeand all of its columns and it is same: 43266.
I have also found a question similar to this but does not have correct answer: Click here
Please help resolve the error.
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
Here is a snapshot of the dataset if it helps in any way: Dataset Snapshot
I am using the below code successfully. But it does not serve my purpose exactly. Helps to fill the missing values though.
data11['Company_response'].fillna(data11['Company_response'].mode()[0],
inplace=True)
data11['Consumer_disputes'].fillna(data11['Consumer_disputes'].mode()[0],
inplace=True)
Edit1: (Attaching Sample)
Input Given:
Expected Output:
You can see that the missing values for company-response of Tr-1 and Tr-3 are filled by taking mode of Complaint-Reason.
And similarly for the Consumer-Disputes by taking mode of transaction-type, for Tr-5.
The below snippet consists of the dataframe and the code for those who want to replicate and give it a try.
Replication Code
import pandas as pd
import numpy as np
data11=pd.DataFrame({'Complaint_ID':['Tr-1','Tr-2','Tr-3','Tr-4','Tr-5','Tr-6'],
'Transaction_Type':['Mortgage','Credit card','Bank account or service','Debt collection','Credit card','Mortgage'],
'Complaint_reason':['Loan servicing, payments, escrow account','Incorrect information on credit report',"Cont'd attempts collect debt not owed","Cont'd attempts collect debt not owed",'Payoff process','Loan servicing, payments, escrow account'],
'Company_response':[np.nan,'Company chooses not to provide a public response',np.nan,'Company believes it acted appropriately as authorized by contract or law','Company has responded to the consumer and the CFPB and chooses not to provide a public response','Company disputes the facts presented in the complaint'],
'Consumer_disputes':['Yes','No','No','No',np.nan,'Yes']})
data11.isnull().sum()
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
data11["Consumer_disputes"] = data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()[0]))["Consumer_disputes"]

The error is raised because for at least one of the groups the values in corresponding aggregated columns contains only np.nan values. In this case pd.Series([np.nan]).mode() returns an empty series which leads to an error when you take the first value.
So, you may use something like transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty") ).

Try:
data11["Company_response"] = data11.groupby("Complaint_reason")['Company_response'].transform(lambda x: x.fillna(x.mode()[0]))
data11["Consumer_disputes"] = data11.groupby("Transaction_Type")['Consumer_disputes'].transform(lambda x: x.fillna(x.mode()[0]))

#Mikhail Berlinkov is almost certainly correct. I was able to reproduce your error, and then avoid it by using dropna():
data11.groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Returns IndexError
data11.dropna().groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Works

Related

how can I make a for loop to populate a DataFrame?

and from the begining I thanks everyone that seeks to help.
I have started to learn python and came across a opportunity to use python to my advantage at work
Im basically made a script that reads a google sheets file, import it into pandas and cleaned up the data.
In the end, I just wanna have the name of the agents in the columns and all of their values for resolucao column below them so I can take the average amount of time for all of the agentes, but I'm struggling to make it with a list comprehension / for loop.
This is what the DataFrame looks like after I cleaned it up
And this is the Code that I tried to Run
PS: Sorry for the messy code.
agentes_unique = list(df['Agente'].unique())
agentes_duplicated = df['Agente']
value_resolucao_duplicated = df['resolucao']
n_of_rows = []
for row in range(len(df)):
n_of_rows.append(row)
i = 0
while n_of_rows[i] < len(n_of_rows):
df2 = pd.DataFrame({agentes_unique[i]: (value for value in df['resolucao'][i] if df['Agente'][i] == agentes_unique[i])})
i+= 1
df2.to_excel('teste.xlsx',index = True, header = True)
But in the end it came to this error:
Traceback (most recent call last):
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\FELIPE\Desktop\Python\webscraping\bot_csv_extract\bot_login\main.py", line 50, in <module>
df2 = pd.DataFrame({'Agente': (valor for valor in df['resolucao'][i] if df['Agente'][i] == 'Gabriel')})
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 0
I feel like I'm making some obvious mistake but I cant fix it
Again, thanks to anyone who tries to help
Are you looking to do something like this? This is just sample data, but a good start for what you are looking to do if I understand what your wanting to do.
data = {
'Column1' : ['Data', 'Another_Data', 'More_Data', 'Last_Data'],
'Agente' : ['Name', 'Another_Name', 'More_Names', 'Last_Name'],
'Data' : [1, 2, 3, 4]
}
df = pd.DataFrame(data)
df = df.pivot(index = 'Column1', columns=['Agente'], values = 'Data')
df.reset_index()
It is not recommended to use for loops against pandas DataFrames: It is considered messy and inefficient.
With some practice you will be able to approach problems in such a way that you will not need to use for loops in these cases.
From what I understand, your goal can be realized in 3 simple steps:
1. Select the 2 columns of interest. I recommend you take a look at how to access different elements of a dataFrame:
df = df[["Agent", "resolucao"]]
2. Convert the column you want to average to a numeric value. Say seconds:
df["resolucao"] = pd.to_timedelta(df['resolucao'].astype(str)).dt.total_seconds()
3. Apply an average aggregation, via the groupby() function:
df = df.groupby(["Agente"]).mean().reset_index()
Hope this helps.
For the next time, I also recommend you to not post the database as an image in order to be able to reproduce your code.
Cheers and keep it up!

Vaex datetime error unknown variables or column

I got a vaex.dataframe.DataFrame called df holding a time column called timestamp of type string. I convert the column to datetime as follows
import numpy as np
from pandas.api.types import is_datetime64_any_dtype as is_datetime
if not is_datetime(df['timestamp']):
df['timestamp'] = df['timestamp'].apply(np.datetime64)
Then I just want to select rows of df where the timestamp is in a specific range. Lets say
sliced_df = df[(df['timestamp'] > np.datetime64("2022-01-01"))]
I am doing that in Sagemaker and it throws a huge error mainly saying the following error messages
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate
result = self[expression]
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 166, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'datetime64(__timestamp)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1327, in _apply
scalar_result = self.f(*[fix_type(k[i]) for k in args], **{key: value[i] for key, value in kwargs.items()})
ValueError: Error parsing datetime string "nan" at position 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 265, in __getitem__
values = self.evaluate(expression) # , out=self.buffers[variable])
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/scopes.py", line 188, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper
result = f(*args, **kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/expression.py", line 1312, in __call__
return vaex.multiprocessing.apply(self._apply, args, kwargs, self.multiprocessing)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/site-packages/vaex/multiprocessing.py", line 32, in apply
result = _get_pool().apply(f, args, kwargs)
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 261, in apply
return self.apply_async(func, args, kwds).get()
File "/home/ec2-user/SageMaker/custom-miniconda/miniconda/envs/custom_python/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
ValueError: Error parsing datetime string "nan" at position 0
ERROR:MainThread:vaex.scopes:error in evaluating: 'timestamp'
"""
The df holds values similar to these under the column timestamp
<pyarrow.lib.StringArray object at 0x7f569e5f54b0>
[
"2021-12-19 06:01:10.789",
"2021-12-20 07:02:11.89",
"2022-01-01 08:02:12.678",
"2022-01-02 09:03:13.567",
"2022-01-03 10:04:14.456"
]
The time stamps look fine to me. I compared with previous data where the comparison worked and nothing seems to be different. I have no clue why this now is not working anymore. I am trying to wrap my head around it for days now but really can't find why its throwing that error.
When I check for
df[df.timestamp.isna()]
it returns nothing. So I don't understand why it found nan in the first position as stated in the error message above.
I appreciate any help. Thanks in advance!
It is probably because you are comparing arrow timestamps to numpy timestamps. You need to chose one framework and work with that.
This issues on vaex's github discusses what you are facing a bit, so it might clear things up more:
https://github.com/vaexio/vaex/issues/1704

Merging two dataframes with pd.NA in merge column yields 'TypeError: boolean value of NA is ambiguous'

With Pandas 1.0.1, I'm unable to merge if the
df = df.merge(df2, on=some_column)
yields
File /home/torstein/code/fintechdb/Sheets/sheets/gild.py, line 42, in gild
df = df.merge(df2, on=some_column)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py, line 7297, in merge
validate=validate,
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 88, in merge
return op.get_result()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 643, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 862, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 841, in _get_join_indexers
self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1311, in _get_join_indexers
zipped = zip(*mapped)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1309, in <genexpr>
for n in range(len(left_keys))
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1918, in _factorize_keys
rlab = rizer.factorize(rk)
File pandas/_libs/hashtable.pyx, line 77, in pandas._libs.hashtable.Factorizer.factorize
File pandas/_libs/hashtable_class_helper.pxi, line 1817, in pandas._libs.hashtable.PyObjectHashTable.get_labels
File pandas/_libs/hashtable_class_helper.pxi, line 1732, in pandas._libs.hashtable.PyObjectHashTable._unique
File pandas/_libs/missing.pyx, line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
while this works:
df[some_column].fillna(np.nan, inplace=True)
df2[some_column].fillna(np.nan, inplace=True)
df = df.merge(df2, on=some_column)
# Works
If instead, I do
df[some_column].fillna(pd.NA, inplace=True)
then the error returns.
This has to do with pd.NA being implemented in pandas 1.0.0 and how the pandas team decided it should work in a boolean context. Also, you take into account it is an experimental feature, hence it shouldn't be used for anything but experimenting:
Warning Experimental: the behaviour of pd.NA can still change without warning.
In another link of pandas documentation, where it covers working with missing values, is where I believe the reason and the answer you are looking for can be found:
NA in a boolean context:
Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises an error: TypeError: boolean value of NA is ambiguous
Furthermore, it provides a valuable piece of advise:
"This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be pd.NA. In such cases, isna() can be used to check for pd.NA or condition being pd.NA can be avoided, for example by filling missing values beforehand."
I decided that the pd.NA instances in my data were valid, and hence I needed to deal with them rather than filling them, like with fillna(). If you're like me in this case, then convert it from pd.NA to either True or False by simply using pd.isna(val). Only you can decide whether the null should come out T or F, but here's a simple example:
val = pd.NA
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is null
Then,
val = 7
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is not null
Hope this helps other trying to get a definitive course of action (Celius's answer is accurate, but I wanted to provide actionable code for those struggling with this).

Need to assign dic to Pandas Dataframe

I have problems when I try to assign a dict to the df DataFrame,
df.loc[index,'count'] = dict()
as I get this error message:
Incompatible indexer with Series
To work around this problem, I can do this,
df.loc[index,'count'] = [dict()]
, but I don't like this solution since I have to resolve the list before getting the dictionary i.e.
a = (df.loc[index,'count'])[0]
How can I solve this situation in a more elegant way?
EDIT1
One way to replicate the whole code is as follow
Code:
import pandas as pd
df = pd.DataFrame(columns= ['count', 'aaa'])
d = dict()
df.loc[0, 'count'] = [d]; print('OK!');
df.loc[0, 'count'] = d
Output:
OK!
Traceback (most recent call last):
File "<ipython-input-193-67bbd89f2c69>", line 4, in <module>
df.loc[0, 'count'] = d
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 194, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 625, in _setitem_with_indexer
value = self._align_series(indexer, Series(value))
File "/usr/lib64/python3.6/site-packages/pandas/core/indexing.py", line 765, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series

Error in fit_transform: Input contains NaN, infinity or a value too large for dtype('float64')

I have a dataframe of shape (14407, 2564). I am trying to remove low variance features using the VarianceThreshold function. However, when I call fit_transform, I get the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Before usign VarianceThreshold, I replaces all the missing value from my df using the below code:
df.replace('null',np.NaN, inplace=True)
df.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
df.fillna(value=df.median(), inplace=True)
I checked my dataframe afterwards for any empty/infinite values using:
m = df.isnull().any()
print "========= COLUMNS WITH NULL VALUES ================="
print m[m]
print "========= COLUMNS WITH INFINITE VALUES ================="
m = np.isfinite(df.select_dtypes(include=['float64'])).any()
print m[m]
and I got an empty Series as an output, which means all my columns do not have any missing values. The output was:
========= COLUMNS WITH NULL VALUES =================
Series([], dtype: bool)
========= COLUMNS WITH INFINITE VALUES =================
Series([], dtype: bool)
Full error trace:
Traceback (most recent call last):
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 222, in <module>
main()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 218, in main
getAllData()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 95, in getAllData
predictors, labels, dropped_features = fselector.process(variance=True, corr=True, bestf=True, bestfk=200)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 54, in process
self.getVariance(threshold=(.95 * (1 - .95)))
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 136, in getVariance
self.removeLowVarianceColumns(df=self.X, thresh=threshold)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 213, in removeLowVarianceColumns
selector.fit_transform(df)
File "/usr/lib64/python2.7/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib64/python2.7/site-packages/sklearn/feature_selection/variance_threshold.py", line 64, in fit
X = check_array(X, ('csr', 'csc'), dtype=np.float64)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
So, I am not sure what to check, this does not seem like a missing value issue, but I am also not able to get what columns/values are causing the problem.
I've seen several threads here that all end in having a missing value, but that does not seem to be the problem here.
I solved this by casting my data to numeric. It appears that, although the error message states 'float64', my data was all objects only and objects did not work well with fit_transform.
Changing my data to float using:
df = df.apply(lambda x: pd.to_numeric(x,errors='ignore')) solved the issue.

Categories