roll value within grouped regions in a dataframe - python

consider the df
idx = map('first {}'.format, range(2)) + map('last {}'.format, range(3))
df = pd.DataFrame(np.arange(25).reshape(5, -1), idx, idx)
df
I want to group the dataframe into four quadrants based on the text in the row and column headers. Meaning that the upper left quadrant consists of columns with 'first' and rows with 'first'. The upper right quadrant consists of columns with 'last' and rows with 'first' and so on.
Then within each group, I want to
roll each element one to right if it can
otherwise start on next row at the beggining if it can
otherwise start at the very beginning
This should help illustrate
The expected output should look like this.

Using a nested groupby-apply pattern and np.roll. Perform a groupby on the columns, followed by a groupby on the index to get the desired subgroups to roll. Then use np.roll to perform the roll, wrapping the output in a DataFrame since np.roll only returns an array.
def roll_frame(df, shift):
return pd.DataFrame(np.roll(df, shift), index=df.index, columns=df.columns)
# Groupers for the index and the columns.
idx_groups = df.index.map(lambda x: x.split()[0])
col_groups = df.columns.map(lambda x: x.split()[0])
# Nested groupby, then perform the roll..
df = df.groupby(col_groups, axis=1) \
.apply(lambda grp: grp.groupby(idx_groups).apply(roll_frame, 1))
Kind of gross, but gets the job done. The order in which you perform the nested groupby doesn't really matter.
The resulting output:
first 0 first 1 last 0 last 1 last 2
first 0 6 0 9 2 3
first 1 1 5 4 7 8
last 0 21 10 24 12 13
last 1 11 15 14 17 18
last 2 16 20 19 22 23

my solution
sdf = df.stack()
tups = sdf.index.to_series().apply(lambda x: tuple(pd.Series(x).str.split().str[0]))
sdf.groupby(tups).apply(lambda x: pd.Series(np.roll(x.values, 1), x.index)).unstack()

Related

How to add multiple rows in dataframe in python

I have a dataframe(df) like below (there are more rows actually).
number
0
21
1
35
2
467
3
965
4
2754
5
34r
6
5743
7
841
8
8934
9
275
I want to insert multiple 6 rows in between rows for example I want to get random 6 values within range of index 0 and 1 and add these 6 rows between index 0 and 1.
Same goes to index 1 and 2, 2 and 3 and so forth until the end.
np.linspace(df["number"][0], df["number"][1],8)
Is there a function or any other method to generate 6 additional rows between all existing 9 rows so therefore the final number of rows will be not 9 but 64 rows (after adding 54 rows)?
You could try the following:
from random import uniform
def rng_numbers(row):
left, right = row.iat[0], row.iat[1]
n = left
if pd.isna(right):
return [n]
if right < left:
left, right = right, left
return [n] + [uniform(left, right) for _ in range(6)]
df["number"] = (
pd.concat([df["number"], df["number"].shift(-1)], axis=1)
.apply(rng_numbers, axis=1)
)
df = df.explode("number", ignore_index=True)
First create a dataframe with 2 columns that form the interval boundaries: the number column and number column shifted 1 forth.
Then .apply the function rng_numbers to the rows of the new dataframe: rng_numbers first sorts the interval boundaries and then returns a list that starts with the resp. item from column number and then num_rows many random numbers in the interval. In the last row the left boundary is NaN (due to the .shift(-1)): in this case the function returns the list without the random numbers.
Then .explode df on the new column number.
You could do something similar with NumPy, which is probably faster:
rng = np.random.default_rng()
limits = pd.concat([df["number"], df["number"].shift(-1)], axis=1)
left = limits.min(axis=1).values.reshape(-1, 1)
right = limits.max(axis=1).values.reshape(-1, 1)
df["number"] = (
pd.Series(df["number"].values.reshape(len(df), 1).tolist())
+ pd.Series(rng.uniform(left, right, size=(len(df), 6)).tolist())
)
df["number"].iat[-1] = df["number"].iat[-1][:1]
df = df.explode("number", ignore_index=True)

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

Finding Top-n based on a column in Dataframe and lumping the rest into others

I have a dataframe that looks like this:
I would like to create another one that looks like this (for the purposes of plotting and reporting) :
I am able to get this done but it feels ultra clunky and inelegant. I would really appreciate it if someone could suggest a nice and pythonic way of going about it. My code is below. Thx
import pandas as pd
import numpy as np
import random
#Creating a sample dataset
alphabet = [chr(letter) for letter in range(97,108)]
scores = [random.randint(0,15) for item in range(11)]
res = dict(zip(alphabet, scores))
test_df = pd.DataFrame(list(res.items()),columns = ['Name','Score'])
#Sorting the dataframe in descending order based on column of interest
test_df = test_df.sort_values(by=['Score'], ascending = False)
#Creating a column of rank, where low rank equals high number
test_df['ranking'] = np.arange(1,len(test_df)+1)
# converting the rank into top-5 and then everything else 6
test_df['ranking'] = test_df['ranking'].apply(lambda x: x if x < 6 else 6)
# Grouping by rank columns, this lumps everything else together
test_df_group = test_df.groupby(['ranking']).agg({"Name" : ''.join, "Score" : sum})
#renaming the data as "others"
test_df_group['Name'] = test_df_group['Name'].apply(lambda x: x if len(x) < 2 else "Others")
nlargest and append
We can use nlargest to select the first n rows (ordered in descending order by Score column), then create a dictionary which contains the aggregation of rest of the rows, now append this dictionary to topn dataframe to get the desired result
topn = test_df.nlargest(5, 'Score')
remaining = {'Name': 'Others', 'Score': test_df['Score'].drop(topn.index).sum()}
topn.append(remaining, ignore_index=True)
Name Score
0 h 15
1 b 10
2 a 9
3 k 9
4 f 8
5 Others 32
We could solve this by combining nlargest with np.where and groupby :
(df
.assign(Name = lambda df: np.where(df.Score.isin(df.Score.nlargest()),
df.Name,
'others')
)
.groupby('Name', as_index=False)
.sum()
)
Name Score
0 a 12
1 f 10
2 h 10
3 j 15
4 k 14
5 others 26

Merge subgroup into adjacent subgroup after groupby

If we run the following code
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
We get average values for each grouping of the feature set:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
In the table above, some of the groups has too few observations and I want to merge it into 'adjacent' group by some rules. For example, I may want to merge the group [0,0,0] with group [0,0,1] since it has no more than 30 observations. I wonder if there is any good way of operating such group combinations according to columns values without creating a separate dictionary? More specifically, I may want to merge from the smallest count group to its adjacent group (the next group within the index order) until the total number of groups is no more than 10.
A simple way to do it is with a loop for on indexes meeting your condition:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

Categories