Filling cells with conditional column means - python

Consider the following DataFrame:
df2 = pd.DataFrame({
'VAR_1' : [1,1,1,3,3],
'GROUP': [1,1,1,2,2],
})
My goal ist to create a seperate column "GROUP_MEAN" which holds the column "VAR_1" arithmetic mean value.
But - it should always consider the row value in "GROUP".
GROUP VAR_1 GROUP_MEAN
0 1 1 Mean Value GROUP = 1
1 1 1 Mean Value GROUP = 1
2 1 1 Mean Value GROUP = 1
3 2 3 Mean Value GROUP = 2
4 2 3 Mean Value GROUP = 2
I can easily access the overall mean:
df2['GROUP_MEAN'] = df2['VAR_1'].mean()
How do I go about making this conditional on a another column value?

I think this is a perfect use case for transform:
>>> df2 = pd.DataFrame({'VAR_1' : [1,2,3,4,5], 'GROUP': [1,1,1,2,2]})
>>> df2["GROUP_MEAN"] = df2.groupby('GROUP')['VAR_1'].transform('mean')
>>> df2
GROUP VAR_1 GROUP_MEAN
0 1 1 2.0
1 1 2 2.0
2 1 3 2.0
3 2 4 4.5
4 2 5 4.5
[5 rows x 3 columns]
Typically you use transform when you want to broadcast the result across all entries of the group.

assuming that the actual data-frame has columns in addition to VAR_1
ts = df2.groupby( 'GROUP' )['VAR_1'].aggregate( np.mean )
df2[ 'GROUP_MEAN' ] = ts[ df2.GROUP ].values
alternatively last line could also be:
df2 = df2.join( ts, on='GROUP', rsuffix='_MEAN' )

Related

Count number of occurences in Dataframe per column

I have a sample dataframe whereby all numbers are userID:
from
to
1
3
1
2
2
3
How do I count the number of occurrences for each columns, sum it up based on the same values and displays in the following format in a new dataframe?
UserID
Occurences
1
2
2
2
3
2
Thank you.
IIUC, you can stack then value_counts
out = (df.stack().value_counts()
.to_frame('Occurences')
.rename_axis('UserID')
.reset_index())
print(out)
UserID Occurences
0 1 2
1 2 2
2 3 2
Use DataFrame.melt with GroupBy.size:
df = df.melt(value_name='UserID').groupby('UserID').size().reset_index(name='Occurences')
print (df)
UserID Occurences
0 1 2
1 2 2
2 3 2
The pd.Series.value counts method may be used to count the instances of each userID in the columns "from" and "to," and pd.concat can be used to combine the results. At the end create a dataframe from the resulting series using the pd.DataFrame.reset index method:
import pandas as pd
data_frame = pd.DataFrame({'from': [1, 1, 2], 'to': [3, 2, 3]})
occur = pd.concat([df['from'].value_counts(), df['to'].value_counts()])
result_df = occur.reset_index()
result_df.columns = ['UserID', 'occur']
result_df = result_df.groupby(['UserID'])['occur'].sum().reset_index()
UserID Occur
0 1 2
1 2 2
2 3 2

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

How to combine numeric columns in pandas dataframe with NaN?

I have a dataframe with this format:
ID measurement_1 measurement_2
0 3 NaN
1 NaN 5
2 NaN 7
3 NaN NaN
I want to combine to:
ID measurement measurement_type
0 3 1
1 5 2
2 7 2
For each row there will be a value in either measurement_1 or measurement_2 column, not in both, the other column will be NaN.
In some rows both columns will be NaN.
I want to add a column for the measurement type (depending on which column has the value) and take the actual value out of both columns, and remove the rows that have NaN in both columns.
Is there an easy way of doing this?
Thanks!
Use DataFrame.stack to reshape the dataframe then use reset_index and use DataFrame.assign to assign the column measurement_type by using Series.str.split + Series.str[:1] on level_1:
df1 = (
df.set_index('ID').stack().reset_index(name='measurement')
.assign(mesurement_type=lambda x: x.pop('level_1').str.split('_').str[-1])
)
Result:
print(df1)
ID measurement mesurement_type
0 0 3.0 1
1 1 5.0 2
2 2 7.0 2
Maybe combine_first could help?
import numpy as np
df["measurement"] = df["measurement_1"].combine_first(df["measurement_2"])
df["measurement_type"] = np.where(df["measurement_1"].notnull(), 1, 2)
df.drop(["measurement_1", "measurement_2"], 1)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Set a threshold and drop any that has more than one NaN. Use df.assign to fillna() measurement_1 and apply np.where on measurement_2
df= df.dropna(thresh=2).assign(measurement=df.measurement_1.fillna\
(df.measurement_2), measurement_type=np.where(df.measurement_2.isna(),1,2)).drop(columns=['measurement_1','measurement_2'])
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
You could use pandas melt :
(
df.melt("ID", var_name="measurement_type", value_name="measurement")
.dropna()
.assign(measurement_type=lambda x: x.measurement_type.str[-1])
.iloc[:, [0, -1, 1]]
.astype("int8")
)
or wide to long :
(
pd.wide_to_long(df, stubnames="measurement", i="ID",
j="measurement_type", sep="_")
.dropna()
.reset_index()
.astype("int8")
.iloc[:, [0, -1, 1]]
)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2

Pandas dataframe ratio of difference of consecutive columns to first value

Suppose I have the DataFrame (called df)
'name' 'order' 'quantity'
'A' 1 10
'A' 2 15
'A' 3 5
'B' 1 2
'B' 2 6
What I want is building another dataframe containing a column with the ratio of the differences of consecutive columns (consecutive in terms of column order) to the first value.
I am easily able to retrieve the difference in said ratio (the numerator) as
def compute_diff(x):
quantity_diff = x.quantity.diff()
return quantity_diff
diff_df = df.sort_values('order').groupby('name').apply(compute_diff).reset_index(name='diff')
This gives me
'name' 'level_1' 'quantity'
'A' 0 NaN
'A' 1 5
'A' 1 -10
'B' 1 NaN
'B' 2 4
Now I want the ratio instead, as per description. Specifically, I'd want
'name' 'level_1' 'quantity'
'A' 1 NaN
'A' 2 0.5
'A' 3 -0.6666
'B' 1 NaN
'B' 2 2
How to?
After performing your groupby, use pct_change:
# Sort the DataFrame, if necessary.
df = df.sort_values(['name', 'order'])
# Use groupby and pcnt_change on the 'quantity' column.
df['quantity'] = df.groupby('name')['quantity'].pct_change()
The resulting output:
name order quantity
0 A 1 NaN
1 A 2 0.500000
2 A 3 -0.666667
3 B 1 NaN
4 B 2 2.000000
You could take your result and divide it by the shifted 'quantity' column in df:
diff_df.quantity = diff_df.quantity / df.quantity.shift(1)

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories