pandas groudby dataframe and get mean and most common value per group

pandas groudby dataframe and get mean and most common value per group - python

I have a dataframe with 2 columns.
df=pd.DataFrame({'values':arrays,'ii':lin_index})
I want to group the values by the lin_index and get the mean per group and the most common value per group
I try this
bii=df.groupby('ii').median()
bii2=df.groupby('ii').agg(lambda x:x.value_counts().index[0])
bii3=df.groupby('ii')['values'].agg(pd.Series.mode)
I wonder if bii2 and bii3 return the same values
Then I want to return the mean and most common value to the original array
bs=np.zeros((np.unique(array).shape[0],1))
bs[bii.index.values]=bii.values
Does this look good?
df looks like
values ii
0 1.0 10446786
1 1.0 11316289
2 1.0 16416704
3 1.0 12151686
4 1.0 30312736
... ...
93071038 3.0 28539525
93071039 3.0 19667948
93071040 3.0 22240849
93071041 3.0 22212513
93071042 3.0 41641943
[93071043 rows x 2 columns]

something like this maybe:
# get the mean
df.groupby(['ii']).mean()
# get the most frequent
df.groupby(['ii']).agg(pd.Series.mode)
your question seems similar to
GroupBy pandas DataFrame and select most common value
this link might also be useful https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

Related

python Pandas lambda apply doesn't work for NaN

been trying to do an efficient vlookup style on pandas, with IF function...
Basically, I want to apply to this column ccy_grp, that if the value (in a particular row) is 'NaN', it will take the value from another column ccy
def func1(tkn1, tkn2):
if tkn1 == 'NaN:
return tkn2
else:
return tkn1
tmp1_.ccy_grp = tmp1_.apply(lambda x: func1(x.ccy_grp, x.ccy), axis = 1)
but nope, doesn't work. The code cannot seem to detect 'NaN'. I tried another way of np.isnan(tkn1), but I just get a boolean error message...
Any experienced python pandas code developer know?

use pandas.isna to detect a value whether a NaN
generate data
import pandas as pd
import numpy as np
data = pd.DataFrame({'value':[np.NAN, None, 1,2,3],
'label':['str:np.NAN', 'str: None', 'str: 1', 'str: 2', 'str: 3']})
data
create a function
def func1(x):
if pd.isna(x):
return 'is a na'
else:
return f'{x}'
apply function to data
data['func1_result'] = data['value'].apply((lambda x: func1(x)))
data

There is a pandas method for what you are trying to do. Check out combine_first:
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with
non-null values from the other Series.
tmp1_.ccy_grp = tmp1_.ccy_grp.combine_first(tmp1_.ccy)

This looks like it should be a pandas mask/where/fillna problem, not an apply:
Given:
value values2
0 NaN 0.0
1 NaN 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0
Doing:
df.value.fillna(df.values2, inplace=True)
print(df)
# or
df.value.mask(df.value.isna(), df.values2, inplace=True)
print(df)
# or
df.value.where(df.value.notna(), df.values2, inplace=True)
print(df)
Output:
value values2
0 0.0 0.0
1 0.0 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0

Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace

.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7

Finding the mean of consecutive columns

I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.
name x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1 x y 2 3 2 1
gene2 x y 5 7 6 2
My goal for each gene is to find the mean of each set of repetitions.
At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.
My thinking right now is to use something like this:
for row in df:
df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean()
But I have no idea how to actually make this work.
The df.iloc[3] is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.
I don't even know where to begin in terms of "merging" the 3 columns into only 1.
Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.

I would first build a Series of final names indexed by the original columns:
names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
index = df.columns[3:])
I would then use it to ask a mean of a groupby on axis 1:
tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')
It gives a new dataframe indexed like the original one and having the averaged columns:
gh_00hr_bio gh_06hr_bio
0 2.333333 1.0
1 6.000000 2.0
You can then horizontally concat it to the first dataframe or to its 3 first columns:
result = pd.concat([df.iloc[:, :3], tmp], axis=1)
to get:
name x y gh_00hr_bio gh_06hr_bio
0 gene1 x y 2.333333 1.0
1 gene2 x y 6.000000 2.0

You're pretty close.
df['avg'] = df.iloc[:, 2:].mean(axis=1)
will get you this:
x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1 avg
gene1 x y 2 3 2 1 2.0
gene2 x y 5 7 6 2 5.0
If you wish to get the mean from different sets of columns, you could do something like this:
for col in range(10):
df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)
If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.

python pandas: Find top n and then m in the top n

I have a pandas data frame that looks like the following:
fastmoving[['dist','unique','id']]
Out[683]:
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09
What I want to achieve is to:
Find top n longest-distance entries. Column 'dist'
Find which ids have the largest percentage m in the top n entries. Column 'id'.
So far I was able to write the code for the maximum entries.
#Get the first id with the largest dist:
fastmoving.loc[fastmoving['dist'].idxmax(),'id']
#Get all id's with the largest dist:
fastmoving.loc[fastmoving['dist']==fastmoving['dist'].max(),'id']
what I miss is to my code to work for more than one value.
So instead of the maximum value, to work for a range of maximum values (top n values).
And then get all the ids that belong with over some m percentage in those n maximum values.
Can you please help me on how I can achieve that in pandas?
Thanks a lot
Alex

you can use nlargest for top n and quantile for top m%, like this:
import pandas as pd
from io import StringIO
fastmoving = pd.read_csv(StringIO("""
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09"""), sep="\s+")
n = 3
m = 50
top_n_dist = fastmoving.nlargest(n, ["dist"])
top_m_precent_id_in_top_n_dist = top_n_dist[top_n_dist['id']>top_n_dist['id'].quantile(m/100)]
print(top_m_precent_id_in_top_n_dist)

IIUC, you can leverage nlargest. The following example would take the top 3 values of dist, and from that, extract the top 2 values of id:
fastmoving.nlargest(3, ["dist", "id"]).nlargest(2, "id")
dist unique id
1 0.434386 4.0 8.288070e+09
1 0.406677 4.0 4.997434e+09

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.

IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0

Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groudby dataframe and get mean and most common value per group - python

Related

python Pandas lambda apply doesn't work for NaN

Fill missing data with random values from categorical column - Python

Finding the mean of consecutive columns

python pandas: Find top n and then m in the top n

Slice column in panda database and averaging results

Categories

Resources