I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 + shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])
Related
(Versions: Python 3.10.4, Pandas 1.4.3, NumPy 1.23.1)
I have this dataframe:
df = pd.DataFrame({
"Group" : ["A", "A", "A", "A", "B", "B", "B", "B"],
"Mass" : [100, 200, 300, 400, 100, 200, 300, 400],
"Speed" : [ 5, 3, 1, 7, 2, 2, 4, 9]
})
Group Mass Speed
0 A 100 5
1 A 200 3
2 A 300 1
3 A 400 7
4 B 100 2
5 B 200 2
6 B 300 4
7 B 400 9
And I have a function that takes a (sub-)dataframe and returns a scalar:
def max_speed_of_small_masses(sub_df):
speed_of_small_masses = sub_df.loc[sub_df["Mass"] < 400, "Speed"]
return speed_of_small_masses.max()
I want to apply this function to every group and add the results as a new column to the dataframe.
expected_output = pd.DataFrame({
"Group" : ["A", "A", "A", "A", "B", "B", "B", "B"],
"Mass" : [100, 200, 300, 400, 100, 200, 300, 400],
"Speed" : [ 5, 3, 1, 7, 2, 2, 4, 9],
"SmallMax" : [ 5, 5, 5, 5, 4, 4, 4, 4]
})
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4
So first I group by Group:
grouped = df.groupby(["Group"])[["Mass", "Speed"]]
I cannot use apply now in a single step, since it gives
applied = grouped.apply(max_speed_of_small_masses)
Group
A 5
B 4
which doesn't have the proper shape, and if I tried to add this as a column, I'd get NaNs:
df["SmallMax"] = applied
Group Mass Speed SmallMax
0 A 100 5 NaN
1 A 200 3 NaN
2 A 300 1 NaN
3 A 400 7 NaN
4 B 100 2 NaN
5 B 200 2 NaN
6 B 300 4 NaN
7 B 400 9 NaN
But I cannot use transform either, since it cannot access columns of the sub-dataframe:
transformed = grouped.transform(max_speed_of_small_masses)
KeyError: 'Mass'
What is an elegant way to achieve this?
IMO, the best is to pre-process the data to replace the non small values by NaN before the groupby:
df["SmallMax"] = (df['Speed']
.where(df['Mass'].lt(400))
.groupby(df['Group']).transform('max')
)
output:
Group Mass Speed SmallMax
0 A 100 5 5.0
1 A 200 3 5.0
2 A 300 1 5.0
3 A 400 7 5.0
4 B 100 2 4.0
5 B 200 2 4.0
6 B 300 4 4.0
7 B 400 9 4.0
A quite straightforward way would to use merge, after filtering out the rows with 'Speed' more than 40, based on each 'Group':
pd.merge(df,df.loc[df["Mass"] < 400].groupby('Group',as_index=False)['Speed'].max().rename({'Speed':'SmallMax'},axis=1),on='Group',how='left')
prints:
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4
You can try
out = (df.groupby(df['Group'])
.apply(lambda g: g.assign(SmallMax=g.loc[g["Mass"] < 400, 'Speed'].max())))
print(out)
Group Mass Speed SmallMax
0 A 100 5 5
1 A 200 3 5
2 A 300 1 5
3 A 400 7 5
4 B 100 2 4
5 B 200 2 4
6 B 300 4 4
7 B 400 9 4
I have a data frame like the following:
values = random.sample(range(1, 101), 15)
df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 'n': [100, 100, 100, 'reference', 'reference', 'reference', 500, 500, 500, 100, 100, 100, 'reference', 'reference', 'reference'], 'value': values})
The values labeled as 'reference' in the n column are reference values, which I will eventually plot against. To help with this, I need to make a data frame that has the reference values in a different column, so columns = ['x', 'n', 'value', 'value_reference']
Value reference is the reference value for all values of n as long as x is the same. Therefore, I want to make a data frame like the following:
desired_df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 4, 4, 4], 'n': [100, 100, 100, 500, 500, 500, 100, 100, 100], 'value': [values[i] for i in [0, 1, 2, 6, 7, 8, 9, 10, 11]], 'value_reference':[values[i] for i in [3, 4, 5, 3, 4, 5, 12, 13, 14]]})
I got the result here by hard coding exactly what I want to make a reproducible example. However, I am looking for the correct way of doing this operation.
How can this be done?
Thanks,
Jack
One way might be this:
df["tick"] = df.groupby(["x", "n"]).cumcount()
numbers = df.loc[df["n"] != "reference"]
ref = df.loc[df["n"] == "reference"]
ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
out = numbers.merge(ref).drop("tick", axis=1)
out = out.sort_values(["x", "n"])
which gives me
In [282]: out
Out[282]:
x n value reference
0 3 100 6 67
2 3 100 9 29
4 3 100 34 51
1 3 500 42 67
3 3 500 36 29
5 3 500 12 51
6 4 100 74 5
7 4 100 48 37
8 4 100 7 70
Step by step, first we add a tick column so we know which row of value matches with which row of reference:
In [290]: df
Out[290]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
3 3 reference 67 0
4 3 reference 29 1
5 3 reference 51 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
12 4 reference 5 0
13 4 reference 37 1
14 4 reference 70 2
Then we separate out the value and reference parts of the table:
In [291]: numbers = df.loc[df["n"] != "reference"]
...: ref = df.loc[df["n"] == "reference"]
...: ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
...:
...:
In [292]: numbers
Out[292]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
In [293]: ref
Out[293]:
x reference tick
3 3 67 0
4 3 29 1
5 3 51 2
12 4 5 0
13 4 37 1
14 4 70 2
and then we merge, where the merge will align on the shared columns, which are "x" and "tick". A sort to clean things up and we're done.
I have a pandas DataFrame like so:
import pandas as pd
df = pd.DataFrame({
'cohort': [1, 1, 1, 1, 2, 2, 2],
'age': [-1, 0, 1, 2, 0, 1, 2],
'bal': [100, 1000, 1400, 1500, 1000, 1200, 1300]
})
Where applicable, I want to add bal where age is less than 0 to the bal values where age is zero for each cohort. Ultimately I want df to look like this:
df
cohort age bal
1 1 -1 100
2 1 0 1100
3 1 1 1400
4 1 2 1500
5 2 0 1000
6 2 1 1200
7 2 2 1300
This is how pandas will achieve it
df.loc[df.age==0, 'bal']= df.loc[df.age==0, 'bal']+ df.loc[df.age<0, 'bal'].values
df
Out[339]:
age bal cohort
0 -1 100 1
1 0 1100 1
2 1 1400 1
3 2 1500 1
4 -1 120 2
5 0 1120 2
6 1 1200 2
7 2 1300 2
Update :
df.loc[df.age==0, 'bal']=df.loc[df.age<=0].groupby('cohort').bal.sum().values
I will describe my problem through an example:
x = pd.DataFrame.from_dict({'row':[5, 10, 12], 'val_x': [11,222, 333]})
y = pd.DataFrame.from_dict({'row':[2, 4, 9, 13], 'val_y': [1, 12, 123, 4]})
In [4]: x
row val_x
0 5 11
1 10 222
2 12 333
In [5]: y
row val_y
0 2 1
1 4 12
2 9 123
3 13 4
I want each row in x to be joined with a row in y, that is immediately before it according to the row column (equal value is also allowed)
In other words, the output looks like
row val_x row_y val_y
0 5 11 4 12
1 10 222 9 123
2 12 333 9 123
I know I need some sort of special merge on the row columns, but I have no idea exactly how to express it.
Try using pd.merge_asof
pd.merge_asof(x,y,on='row',direction ='backward').merge(y,left_on='val_y',right_on='val_y')
Out[828]:
row_x val_x val_y row_y
0 5 11 12 4
1 10 222 123 9
2 12 333 123 9
EDIT:
from itertools import product
import pandas as pd
DF=pd.DataFrame(list(product(x.row, y.row)), columns=['l1', 'l2'])
DF['DIFF']=DF.l1-DF.l2
DF=DF.loc[DF.DIFF>=0,:]
DF=DF.sort_values(['l1','DIFF']).drop_duplicates(['l1'],keep='first')
x.merge(DF,left_on='row',right_on='l1',how='left').merge(y,left_on='l2',right_on='row')[['row_x','val_x','row_y','val_y']]
row_x val_x row_y val_y
0 5 11 4 12
1 10 222 9 123
2 12 333 9 123
I have a pandas dataframe which looks like that :
qseqid sseqid qstart qend
2 1 125 345
4 1 150 320
3 2 150 450
6 2 25 300
8 2 50 500
I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend].
Is this possible with pandas ?
df = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
'qseqid': [2, 4, 3, 6, 8],
'qstart': [125, 150, 150, 25, 50],
'sseqid': [1, 1, 2, 2, 2]})
def remove_rows(df):
merged = pd.merge(df.reset_index(), df, on='sseqid')
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
result = df.loc[df_mask]
return result
result = remove_rows(df)
print(result)
yields
qend qseqid qstart sseqid
0 345 2 125 1
3 300 6 25 2
4 500 8 50 2
The idea is to use pd.merge to form a DataFrame with every pairing of rows
with the same sseqid:
In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]:
index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y
0 0 345 2 125 1 345 2 125
1 0 345 2 125 1 320 4 150
2 1 320 4 150 1 345 2 125
3 1 320 4 150 1 320 4 150
4 2 450 3 150 2 450 3 150
5 2 450 3 150 2 300 6 25
6 2 450 3 150 2 500 8 50
7 3 300 6 25 2 450 3 150
8 3 300 6 25 2 300 6 25
9 3 300 6 25 2 500 8 50
10 4 500 8 50 2 450 3 150
11 4 500 8 50 2 300 6 25
12 4 500 8 50 2 500 8 50
Each row of merged contains data from two rows of df. You can then compare every two rows using
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
and find the labels in df.index that do not match this condition:
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
and select those rows:
result = df.loc[df_mask]
Note that this assumes df has a unique index.