Pandas: Syntax in Groupby - python

I have the following dataset:
id window Rank member
1 2 2 0
1 3 2 0
2 3 1 0
2 2 1 0
I want to make member to be equal to Rank when window==3. To do that, I have the following command:
df["member"]= df[df['window']==3]['Rank'][0]
However, I want to do that in a groupby statement grouping on id. The command below returns an error. It is probably a simple thing that I am missing here, but I cannot get around it how to use groupby in the above command.Any help is greatly appreciated.
df["member"]= df.groupby("id")[df[df['window']==3]['Rank'][0]]

You can achieve this by using pandas.DataFrame.where -
df = pd.DataFrame({'id':[1,1,2,2],'window':[2,3,3,2],'Rank':[2,2,1,1],'member':[0,0,0,0]})
=>
Rank id member window
0 2 1 0 2
1 2 1 0 3
2 1 2 0 3
3 1 2 0 2
df['member'] = df['Rank'].where(df['window']==3, df['member'])
print(df)
=>
Rank id member window
0 2 1 0 2
1 2 1 2 3
2 1 2 1 3
3 1 2 0 2

You can use numpy.where or DataFrame.loc:
df['member'] = np.where(df['window']==3, df['Rank'], df['member'])
print (df)
id window Rank member
0 1 2 2 0
1 1 3 2 2
2 2 3 1 1
3 2 2 1 0
df.loc[df['window']==3, 'member'] = df['Rank']
print (df)
id window Rank member
0 1 2 2 0
1 1 3 2 2
2 2 3 1 1
3 2 2 1 0

Related

keep second level of multi-index intact while sorting on first one pandas python

I have sorted my first level of index using the following method : Custom sort order function for groupby pandas python
def my_func(group):
return sum(group["B"]*group["C"])
idx=df.groupby('A').apply(my_func).reindex(df.index.get_level_values(0))
df.iloc[idx.argsort()]
The issue is that the second level ordering is jumbled up after sorting on the first. How can I make sure that the intra-group order is kept ?
from
A B C
1 0 1 8
1 3 3
2 0 1 2
1 2 2
3 0 1 3
1 2 4
to
A B C
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 0 1 8
1 3 3
and not (last 2 lines inverted)
A B C
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 1 3 3
0 1 8
I think you need stable sorting algo - mergesort:
idx=df.index.get_level_values(0).map(df.groupby('A').apply(my_func))
df = df.iloc[idx.argsort(kind='mergesort')]
print (df)
B C
A
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 0 1 8
1 3 3

Insert values into columns without NaN

I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_a event_type_b event_type_c
1 a 2 1 0
1 a 2 1 0
1 b 2 1 0
2 a 1 1 1
2 b 1 1 1
2 c 1 1 1
I`ve tried code like
df[' event_type_a'] = df['user_id', 'event_type'].where(df['event_type']=='a').groupby([user_id]).count()
and get table like
user_id count_a
1 2
2 1
How i should insert this values into default df, to fill all rows without NaN items?
Maybe exsists method like, for exaple, "insert into df_1['column'] from df_2['column'] where df_1['user_id'] == df_1['user_id'] "
Use crosstab with add_prefix for new columns names and join:
df2 = pd.crosstab(df['user_id'],df['event_type'])
#alternatives
#df2 = df.groupby(['user_id','event_type']).size().unstack(fill_value=0)
#df2 = df.pivot_table(index='user_id', columns='event_type', fill_value=0, aggfunc='size')
df = df.join(df2.add_prefix('event_type_'), on='user_id')
print (df)
user_id event_type event_type_a event_type_b event_type_c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1
Here is another way for getting df2 as Jez mentioned but slightly different , since I using the transform and did not provide the agg format , So the df2 shape has the same length as original df
df2= df.set_index('user_id').event_type.str.get_dummies().groupby(level=0).transform('sum')
df2
Out[11]:
a b c
user_id
1 2 1 0
1 2 1 0
1 2 1 0
2 1 1 1
2 1 1 1
2 1 1 1
Then using concat
df2.index=df.index
pd.concat([df,df2],axis=1)
Out[19]:
user_id event_type a b c
0 1 a 2 1 0
1 1 a 2 1 0
2 1 b 2 1 0
3 2 a 1 1 1
4 2 b 1 1 1
5 2 c 1 1 1

Pandas - updating sequence of values

I have this Sample DataFrame:
pd.DataFrame(data={1:[0,3,4,1], 2:[4,1,0,0], 3:[0,0,1,2], 4:[1,2,3,4] })
1 2 3 4
0 0 4 0 1
1 3 1 0 2
2 4 0 1 3
3 1 0 2 4
But i want to convert it to the format below:
pd.DataFrame(data={1:[1,1,1,1], 2:[0,2,0,2], 3:[0,3,3,0], 4:[4,0,4,4] })
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Is there any way or a function to do this as i have more than 100,000 rows so for loops, dictionaries, lists won't work.
My entry:
data = df.reset_index().melt("index").query("value > 0")
out = data.pivot("index", "value", "value").fillna(0).astype(int)
giving
In [273]: out
Out[273]:
value 1 2 3 4
index
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Unfortunately you'd have to clear the index and column names if you want to get rid of them, using either df.index.name = df.columns.name = None or df.rename_axis(None).rename_axis(None, 1) or something.
Using get_dummies:
s = pd.get_dummies(df, columns=df.columns, prefix_sep='', prefix='')
out = s.groupby(s.columns, axis=1).sum().drop('0', 1)
out.mask(out.ne(0)).fillna(dict(zip(out.columns, out.columns))).astype(int)
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Using zip and np.isin
pd.DataFrame([ np.isin(y, x)*df.columns.values for x , y in zip([df.columns.values]*len(df),df.values)])
Out[900]:
0 1 2 3
0 0 2 0 4
1 1 2 0 4
2 1 0 3 4
3 1 0 3 4

How to add incremental number to Dataframe using Pandas

I have original dataframe:
ID T value
1 0 1
1 4 3
2 0 0
2 4 1
2 7 3
The value is same previous row.
The output should be like:
ID T value
1 0 1
1 1 1
1 2 1
1 3 1
1 4 3
2 0 0
2 1 0
2 2 0
2 3 0
2 4 1
2 5 1
2 6 1
2 7 3
... ... ...
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
For solution is necessary unique integer values in T for each group.
Use groupby with custom function - for each group use reindex and then replace NaNs in value column by forward filling ffill:
df1 = (df.groupby('ID')['T', 'value']
.apply(lambda x: x.set_index('T').reindex(np.arange(x['T'].min(), x['T'].max() + 1)))
.ffill()
.astype(int)
.reset_index())
print (df1)
ID T value
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 3
5 2 0 0
6 2 1 0
7 2 2 0
8 2 3 0
9 2 4 1
10 2 5 1
11 2 6 1
12 2 7 3
If get error:
ValueError: cannot reindex from a duplicate axis
it means some duplicated values per group like:
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 1 <-4 is duplicates per group 2
4 2 4 3 <-4 is duplicates per group 2
5 2 7 3
Solution is aggregate values first for unique T - e.g.by sum:
df = df.groupby(['ID', 'T'], as_index=False)['value'].sum()
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 4
4 2 7 3

Pandas - Assign unique ID to each group in grouped data

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!
Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Categories