Pandas Error Matching String - python

I have data like the SampleDf data below. I'm trying to check values in one column in my dataframe to see if they contain 'sum' or 'count' or 'Avg' and then create a new column with the value 'sum', 'count', or 'Avg'. When I run the code below on my real dataframe I'm getting the error below. When I run dtypes on my real dataframe it says all the columns are objects. The code below is related to the post below. Unfortunately I don't get the same errors when I run the code on the SampleDf I've provided, but I couldn't post my whole dataframe.
post:
Pandas and apply function to match a string
Code:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Avg(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then LOS end),0)"]],columns=['ReportField','OtherField'])
search1='Sum'
search2='Count'
search3='Avg'
def Agg_type(x):
if search1 in x:
return 'sum'
elif search2 in x:
return 'count'
elif search3 in x:
return 'Avg'
else:
return 'Other'
SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
SampleDf.head()
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-a2b4920246a7> in <module>()
17 return 'Other'
18
---> 19 SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
20
21 #SampleDf.head()
C:\Users\Name\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:66124)()
<ipython-input-17-a2b4920246a7> in Agg_type(x)
8
9 def Agg_type(x):
---> 10 if search1 in x:
11 return 'sum'
12 elif search2 in x:
TypeError: argument of type 'float' is not iterable

You can try this:
SampleDf['new_col'] = np.where(SampleDf.OtherField.str.contains("Avg"),"Avg",
np.where(SampleDf.OtherField.str.contains("Count"),"Count",
np.where(SampleDf.OtherField.str.contains("Sum"),"Sum","Nothing")))
please notice that this will work properly if you don't have both Avg and Count or Sum in the same string.
If you do, please notice me i'll look for a better approach.
Of course if something doesn't suit your needs also report it back.
Hope this was helpful
explanation:
what's happening is that you're looking for indexes where Avg is in the string inside OtherField column and fill new_col with "Avg" in these indexes. for the remaining fields( where there isn't "Avg", you look for Count and do the same and last you do the same for Sum.
documentation:
np.where
pandas.series.str.contains

Related

Is there a way to automate data cleaning for pandas DataFrames?

I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
However when I pass the training data DataFrame:
train_data = data_cleaning(train_data)
I get the following error:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?
This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).
If you intend to make this usable for another file in same directory, you can just call it in another file by:
from file_that_contain_function import function_name
And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.
So yes, the other answer explains where the error is coming from.
However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.

Need to strip CSV Column Number Data of Letters - Pandas

I am working on a .csv that has columns in which numerical data includes letters. I want to strip the letters so that the column can be a float or int.
I have tried the following:
using the loop/def process to strip object columns of string data, in "MPG" column and leave only numerical values.
it should print the names of the columns where there is at least one entry ending in the characters 'mpg'
CODING IN JUPYTER NOTEBOOK CELLS:
Step 1:
MPG_cols = []
for colname in df.columns[df.dtypes == 'object']:
if df[colname].str.endswith('mpg').any():
MPG_cols.append(colname)
print(MPG_cols)
using .str so I can use an element-wise string method
only want to consider the string columns
THIS GIVES ME OUTPUT:
[Power]. #good so far
STEP 2:
#define the value to be removed using loop
def remove_mpg(pow_val):
"""For each value, take the number before the 'mpg'
unless it is not a string value. This will only happen
for NaNs so in that case we just return NaN.
"""
if isinstance(pow_val, str):
i=pow_val.replace('mpg', '')
return float(pow_val.split(' ')[0])
else:
return np.nan
position_cols = ['Vehicle_type']
for colname in MPG_cols:
df[colname] = df[colname].apply(remove_mpg)
df[Power_cols].head()
The Error I get:
ValueError Traceback (most recent call last)
<ipython-input-37-45b7f6d40dea> in <module>
15
16 for colname in MPG_cols:
---> 17 df[colname] = df[colname].apply(remove_mpg)
18
19 df[MPG_cols].head()
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-37-45b7f6d40dea> in remove_mpg(pow_val)
8 if isinstance(pow_val, str):
9 i=pow_val.replace('mpg', '')
---> 10 return float(pow_val.split(' ')[0])
11 else:
12 return np.nan
ValueError: could not convert string to float: 'null'
I applied similar code to a different column and it worked on that column, but not here.
Any guidance will be greatly appreciated.
Bests,
i think you would need to revisit the logic for function remove_mpg one way you can tweak as follows:
import re
import numpy as np
def get_me_float(pow_val):
my_numbers = re.findall(r"(\d+.*\d+)mpg", pow_val)
if len(my_numbers) > 0 :
return float(my_numbers[0])
else:
return np.nan
for example, need to test the function.
my_pow_val=['34mpg','34.6mpg','0mpg','mpg','anything']
for each_pow in my_pow_val:
print(get_me_float(each_pow))
output:
34.0
34.6
nan
nan
nan
This will work,
import pandas as pd
pd.to_numeric(pd.Series(['$2', '3#', '1mpg']).str.replace('[^0-9]', '', regex=True))
0 2
1 3
2 1
dtype: int64
for complete solution,
for i in range(df.shape[1]):
if(df.iloc[:,i].dtype == 'object'):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes
Select columns not to be changed
for i in range(df.shape[1]):
# 'colA', 'colB' are columns which should remain same.
if((df.iloc[:,i].dtype == 'object') & df.column[i] not in ['colA','colB']):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes
Why don't you use the converters paremeter to the read_csv function to strip the extra characters when you load the csv file?
def strip_mpg(s):
return float(s.rstrip(' mpg'))
df = read_csv(..., converters={'Power':strip_mpg}, ...)

How to separate the pandas table into separate groups?

I am struggling to separate the data. I tried looking at the groupby function pandas has, but it doesn't seem to work. I don't understand what I am doing wrong.
data = pd.read_csv("path/file")
y=data['JIN-DIF']
y1=data['JEX-DIF']
y2=data['JEL-DIF']
y3=data['D3E']
d={'Induction':y,'Exchange':y1,'Dispersion':y3,'Electrostatic':y2}
df=pd.DataFrame(d)
grouped_df2= df.groupby('Exchange')
grouped_df2.filter(lambda x: x.Exchange > 0)
When I run this code, I get an "TypeError: filter function returned a Series, but expected a scalar bool error". I'm not sure about how to upload the data, so I have just attached a picture of it.
It will work when I change line 9 to
grouped_df2.filter(lambda x: x.Exchange.mean() > 0)
Here is a picture of sample data
The error message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-71-0e26eb8f080b> in <module>
7 df=pd.DataFrame(d)
8 grouped_df2= df.groupby('Exchange')
----> 9 grouped_df2.filter(lambda x: x.Exchange > -0.1)
~/anaconda3/lib/python3.6/site-packages/pandas/core/groupby/generic.py in filter(self, func, dropna, *args, **kwargs)
1584 # non scalars aren't allowed
1585 raise TypeError(
-> 1586 f"filter function returned a {type(res).__name__}, "
1587 "but expected a scalar bool"
1588 )
TypeError: filter function returned a Series, but expected a scalar bool

find which which row in dataframe causes groupby or transform to fail

I have a dataframe of shape (2061, 5) and the following line:
df[6] = df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
..causes the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-27721ddd8064> in <module>
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in transform(self, func, *args, **kwargs)
463
464 if not isinstance(func, str):
--> 465 return self._transform_general(func, *args, **kwargs)
466
467 elif func not in base.transform_kernel_whitelist:
~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
487 for name, group in self:
488 object.__setattr__(group, "name", name)
--> 489 res = func(group, *args, **kwargs)
490
491 if isinstance(res, (ABCDataFrame, ABCSeries)):
<ipython-input-19-27721ddd8064> in <lambda>(x)
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
TypeError: sequence item 0: expected str instance, float found
I developed that code on a subset of the dataframe and it seemed to be doing exactly what I wanted to the data. So now if I for example do this:
df = df.head(50)
..and run the code, the error message goes away again.
I think somewhere, a type cast is happening except at one of the lines it decides to do something else. How can I efficiently find which row in the df is causing this without manually reading through the whole two thousand long column or a trial an error thing with .head() of different sizes?
EDITED: Mask column in question to keep only rows where column has a float value, then check first index. IE:
mask = df['column_in_q'].apply(lambda x: type(x) == float)
#This returns a Boolean DF that can be used to keep only True values
float_df = df[mask] # Subset of DF that meets condition
print(df.head())
I think this is because the Groupby method returns a groupby object, not a
dataframe. You have to specify aggregation methods, which you could then subset. That is:
df[6] = df.groupby(df.index).sum()[6].transform(lambda x: ' '.join(x))
See here for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Trying to make a new column in pandas dataframe by filtering another column using a if statement

Trying to make a column named loan_status_is_great on my pandas dataframe. It should contain the integer 1 if loan_status is "Current" or "Fully Paid." Else it should contain the integer 0.
I'm using https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip as my dataset.
My problem code is:
def loan_great():
if (df['loan_status']).any == 'Current' or (df['loan_status']).any == 'Fully Paid':
return 1
else:
return 0
df['loan_status_is_great']=df['loan_status'].apply(loan_great())
TypeError Traceback (most recent call last)
in ()
----> 1 df['loan_status_is_great']=df['loan_status'].apply(loan_great())
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4043 else:
4044 values = self.astype(object).values
-> 4045 mapped = lib.map_infer(values, f, convert=convert_dtype)
4046
4047 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
TypeError: 'int' object is not callable
Let's try a different approach using isin to create a boolean series and convert to integer:
df['loan_status'].isin(['Current','Fully Paid']).astype(int)
I find that the numpy where function is a good choice for these simple column creations while maintaining good speed. Something like the below should work:
import numpy as np
df['loan_status_is_great'] = np.where(df['loan_status']=='Current'|
df['loan_status']=='Fully Paid',
1,
0)

Categories