replace with multiple conditions not updating in pandas - python

I am trying to replace a value based on the row index, and for only certain columns in a dataframe.
for columns b and c, I want to replace the value 1 with np.nan, for rows 1, 2 and 3
df = pd.DataFrame(data={'a': ['"dog", "cat"', '"dog"', '"mouse"', '"mouse", "cat", "bird"', '"circle", "square"', '"circle"', '"triangle", "square"', '"circle"'],
'b': [1,1,3,4,5,1,2,3],
'c': [3,4,1,3,2,1,0,0],
'd': ['a','a','b','c','b','c','d','e'],
'id': ['group1','group1','group1','group1', 'group2','group2','group2','group2']})
I am using the following line but its not updating in place, and if I try assigning it, returns only the subset of amended rows, rather than an update version of the original dataframe.
df[df.index.isin([1,2,3])][['b','c']].replace(1, np.nan, inplace=True)

You could do it like this:
df.loc[1:3, ['b', 'c']] = df.loc[1:3, ['b', 'c']].replace(1, np.nan)
Output:
>>> df
a b c d id
0 "dog", "cat" 1.0 3.0 a group1
1 "dog" NaN 4.0 a group1
2 "mouse" 3.0 NaN b group1
3 "mouse", "cat", "bird" 4.0 3.0 c group1
4 "circle", "square" 5.0 2.0 b group2
5 "circle" 1.0 1.0 c group2
6 "triangle", "square" 2.0 0.0 d group2
7 "circle" 3.0 0.0 e group2
A more dynamic version:
cols = ['b', 'c']
rows = slice(1, 3) # or [1, 2, 3] if you want
df.loc[rows, cols] = df.loc[rows, cols].replace(1, np.nan)

Related

How to get average between first row and current row per each group in data frame?

i have data frame like this,
id
value
a
2
a
4
a
3
a
5
b
1
b
4
b
3
c
1
c
nan
c
5
the resulted data frame contain new column ['average'] and to get its values will be:
make group-by(id)
first row in 'average' column per each group is equal to its corresponding value in 'value'
other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
the resulted data frame must be :
id
value
average
a
2
2
a
4
2
a
3
3
a
5
3
b
1
1
b
4
1
b
3
2.5
c
1
1
c
nan
1
c
5
1
You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0
Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)

How to aggregate and groupby in pandas

I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe

Divide each element by 2 and it should ignore "String" values

Divide each element by 2 and it should ignore "String" values. End results should be in Pandas Data frame only
df=pd.DataFrame({'a':[3,6,9], 'b':[2,4,6], 'c':[1,2,3]})
print(df)
You can do:
I added a column with string values for demonstration
df=pd.DataFrame({'a':[3,6,9], 'b':[2,4,6], 'c':[1,2,3], 'd':['a', 'b', 'c']})
for i in list(df.keys()):
try:
df[i] = df[i]/2
except(TypeError):
df[i] = df[i]
print(df)
This gives:
a b c d
0 1.5 1.0 0.5 a
1 3.0 2.0 1.0 b
2 4.5 3.0 1.5 c
If you have columns with mixed integer and string types :
df=pd.DataFrame({'a':[3,6,9, 'a'], 'b':[2,4,6, 8], 'c':[1,2,3,4], 'd':['a', 'b', 'c', 'd']})
for i in list(df.keys()):
try:
df[i] = df[i]/2
except(TypeError):
df[i] = df[i].astype(str)
mynewlist = [int(s)/2 if s.isdigit() else s for s in list(df[i].values)]
df[i] = mynewlist
print(df)
which gives:
a b c d
0 1.5 1.0 0.5 a
1 3 2.0 1.0 b
2 4.5 3.0 1.5 c
3 a 0.5 0.5 d

Python: How to replace missing values column wise by median

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

Pandas create a data frame based on two other 'sub' frames

I have two Pandas data frames. df1 has columns ['a','b','c'] and df2 has columns ['a','c','d']. Now, I create a new data frame df3 with columns ['a',
b','c','d'].
I want to fill df3 with all the inputs from df1 and df2. For example, if I have x rows in df1, and y rows in df2, then I will have x+y rows in df3.
Which Pandas function fills the new dataframe based on partial columns?
Example data:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 4], 'd':['h', 'j', 'k']})
df2 = pd.DataFrame({'a':[5, 6, 7], 'b':[1, 1, 1], 'c':[2, 2, 2]})
Code:
df1.append(df2)
Out:
a b c d
0 1 2 NaN h
1 2 3 NaN j
2 3 4 NaN k
0 5 1 2.0 NaN
1 6 1 2.0 NaN
2 7 1 2.0 NaN
use append function of dataframe https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
anotherFrame = df1.append(df2, ignore_index=True)
another way is merge - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
df1.merge(df2, how='outer')
How about:
df1 = pd.DataFrame({"a": [1,2], "b": [3,4], "c": [5,6]})
df2 = pd.DataFrame({"a": [7,8], "c": [9,10], "d": [11,12]})
df3 = df1.append(df2, sort=False)
df3
a b c d
0 1 3.0 5 NaN
1 2 4.0 6 NaN
0 7 NaN 9 11.0
1 8 NaN 10 12.0

Categories