Pandas: Get mean of different rows when columns are equal - python

I'm trying to find the mean of values in different rows, grouped by similarities in other columns. Example:
In [14]: pd.DataFrame({'col1':[1,2,1,2], 'col2':['A','C','A','B'], 'col3':[1, 5, 6, 9]})
Out[14]:
col1 col2 col3
0 1 A 1
1 2 C 5
2 1 A 6
3 2 B 9
What I would like is to add a column with the means of col3, for all rows where the combination of col1 and col2 match. Desired output:
Out[14]:
col1 col2 col3 mean
0 1 A 1 3.5
1 2 C 5 5
2 1 A 6 3.5
3 2 B 9 9
I have tried several things with groupby in combination with apply but couldn't get proper results.

its a transform my man
df['mean'] = df.groupby(['col1','col2']).col3.transform('mean')

Related

How to remove duplicate rows with a condition in pandas

i.e
i want to drop duplicates pairs using col1 and col2 as the subset only if the values are the opposite in col3 (one negative and one positive). similar to drop_duplicates function but i want to impose a condition and only want to remove the first pair (i.e if 3 duplicates, just remove 2, leave 1)
my dataset (df):
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
4 1 2 -1
5 1 2 1
6 1 2 1
I want:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
rows 4 and 5 are duplicated in col1 and col2 but value in col3 is the opposite, therefore we remove both. row 0 and row 2 have duplicate values in col1 and col2 but col3 is the same, so we don't remove those rows.
i've tried using drop_duplicates but realised it wouldn't work as it will only remove all duplicates and not consider anything else.
We can do transform
out = df[df.groupby(['col1','col2']).col3.transform('sum').ne(0) & df.col3.ne(0)]
Out[252]:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
Recreating the dataset:
import pandas as pd
data = [
[1, 1, 1],
[2, 2, 2],
[1, 1, 1],
[3, 5, 7],
[1, 2, -1],
[1, 2, 1],
[1, 2, 1],
]
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
if your data is not massive, you can use an iterrows function on a subset of the data.
The subset contains all duplicate values after all values have been turned into absolute values.
Next, we check if col3 is negative and if the opposite of col3 is in the duplicate subset.
If so, we drop the row from df.
df_dupes = df[df.abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
df.drop(labels=i, axis=0, inplace=True)
This code should remove row 4.
In your desired output, you left row 5 for some reason.
If you can explain why you left row 5 but kept row 0, then I can adjust my code to more accurately match your desired output.
I used #Petar Luketina code here with an adjustment and it worked. However I would like to use it for a massive dataset -> 1million rows and 43 columns. This code takes forever:
df_dupes = df[df['col3'].abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
print(row.col3)
try:
c = np.where((df['col1'] ==row.col1) & (df['col2'] ==row.col2) &
(df['col3'] ==-row.col3))[0][0]
df.drop(labels=[i,df.index.values[c]], axis=0, inplace=True)
except:
pass
I know this is an old question, but for those people interested, here is an alternative that avoids iterating over the rows:
First use a flag to identify the pair of rows to be removed (row plus the next row when col1 and col2 are the same and col3 are the negative of each other)
df.loc[(df.col1 == df.col1.shift(1)) & (df.col2 == df.col2.shift(1)) & (df.col3 == -df.col3.shift(1)), 'removeFlag'] = True
df.loc[df.removeFlag.shift(-1) == True, 'removeFlag'] = True
col1 col2 col3 removeFlag
0 1 1 1 NaN
1 2 2 2 NaN
2 1 1 1 NaN
3 3 5 7 NaN
4 1 2 -1 True
5 1 2 1 True
6 1 2 1 NaN
Then use this flag to delete to offending rows:
df = df[~(df.removeFlag == True)]
df.drop(columns=['removeFlag'], inplace=True)
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
This approach probably needs a little more refinement if row 6 had been the same as row 4 (ie the first half of a repeated identical pair) but you get the idea.

Fetch row data from known index in pandas

df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1

How to multiply numeric elements of a dataframe by a scalar? [duplicate]

How to multiply all the numeric values in the data frame by a constant without having to specify column names explicitly? Example:
In [13]: df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
In [14]: df
Out[14]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
I tried df.multiply but it affects the string values as well by concatenating them several times.
In [15]: df.multiply(3)
Out[15]:
col1 col2 col3
0 AAA 3 90
1 BBB 6 30
2 CCC 9 60
Is there a way to preserve the string values intact while multiplying only the numeric values by a constant?
you can use select_dtypes() including number dtype or excluding all columns of object and datetime64 dtypes:
Demo:
In [162]: df
Out[162]:
col1 col2 col3 date
0 A 1 30 2016-01-01
1 B 2 10 2016-01-02
2 C 3 20 2016-01-03
In [163]: df.dtypes
Out[163]:
col1 object
col2 int64
col3 int64
date datetime64[ns]
dtype: object
In [164]: df.select_dtypes(exclude=['object', 'datetime']) * 3
Out[164]:
col2 col3
0 3 90
1 6 30
2 9 60
or a much better solution (c) ayhan:
df[df.select_dtypes(include=['number']).columns] *= 3
From docs:
To select all numeric types use the numpy dtype numpy.number
The other answer specifies how to multiply only numeric columns. Here's how to update it:
df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
s = df.select_dtypes(include=[np.number])*3
df[s.columns] = s
print (df)
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
One way would be to get the dtypes, match them against object and datetime dtypes and exclude them with a mask, like so -
df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
Sample run -
In [273]: df
Out[273]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
In [274]: df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
In [275]: df
Out[275]:
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
This should work even over mixed types within columns but is likely slow over large dataframes.
def mul(x, y):
try:
return pd.to_numeric(x) * y
except:
return x
df.applymap(lambda x: mul(x, 3))
A simple solution using assign() and select_dtypes():
df.assign(**df.select_dtypes('number')*3)

how to iterate in pandas dataframe columns

i need do some operations with my dataframe
my dataframe is
df = pd.DataFrame(data={'col1':[1,2],'col2':[3,4]})
col1 col2
0 1 3
1 2 4
my operatin is column dependent
for example, i need to add (+) .max() of column to each value in this column
so df.col1.max() is 2 and df.col2.max() is 4
so my output should be:
col1 col2
0 3 7
1 4 8
i have been try this:
for i in df.columns:
df.i += df.i.max()
but
AttributeError: 'DataFrame' object has no attribute 'i'
you can chain df.add and df.max and specify the axis which avoids any loops.
df1 = df.add(df.max(axis=0))
print(df1)
col1 col2
0 3 7
1 4 8
To loop through the columns and add the maximum of each column you can do the following:
for col in df:
df[col] += df[col].max()
This gives
col1 col2
0 3 7
1 4 8

Pandas pivoting and adding a column from CSV with consecutive rows

I have consecutive row duplicates in two column.
I want to delete the second row duplicate based on [col1,col2] and move the value of another column to a new one.
Example:
Input
col1 col2 col3
X A 1
X A 2
Y A 3
Y A 4
X B 5
X B 6
Z C 7
Z C 8
Output
col1 col2 col3 col4
X A 1 2
Y A 3 4
X B 5 6
Z C 7 8
I found out about pivoting but I am struggling to understand how to add another column and avoid indexing, I would to preserve everything as written in the example
This is similar to Question 10 here:
(df.assign(col=df.groupby(['col1','col2']).cumcount())
.pivot_table(index=['col1','col2'], columns='col', values='col3')
.reset_index()
)
Output:
col col1 col2 0 1
0 X A 1 2
1 X B 5 6
2 Y A 3 4
3 Z C 7 8

Categories