Pandas assign a column value basis previous row different column value - python

I have a df like this:
and the resultDF I want needs to be like this:
So except first row I want Supply value to be added with Available value of previous row and then subtract it with order value. E.g. for row 3 in resultDF, Supply value (2306) is generated by adding Available value (145, row 2) from resultDF and Supply value (2161, row 3) from df. And then simply Available value is calculated using Supply - Order. Can anyone help me with how to generate resultDF.

Use cumsum:
df["Available"] = df["Supply"].cumsum() - df["Order"].cumsum()
df["Supply"] = df["Available"] + df["Order"]
>>> df
product Month Order Supply Available
0 xx-xxx 202107 718 1531.0 813.0
1 None 202108 668 813.0 145.0
2 None 202109 5030 2306.0 -2724.0
3 None 202110 667 -2724.0 -3391.0

Use cumsum to compute right values:
Assuming:
you want to fix your rows per product
your rows are already ordered by (product, month)
# Setup
data = {'Product': ['xx-xxx', 'xx-xxx', 'xx-xxx', 'xx-xxx'],
'Month': [202107, 202108, 202109, 202110],
'Order': [718, 668, 5030, 667],
'Supply': [1531, 0, 2161, 0],
'Available': [813, -668, -2869, -667]}
df = pd.DataFrame(data)
df[['Supply', 'Available']] = df.groupby('Product').apply(lambda x: \
pd.DataFrame({'Supply': x['Order'] + x['Supply'].cumsum() - x['Order'].cumsum(),
'Available': x['Supply'].cumsum() - x['Order'].cumsum()}))
Output:
>>> df
Product Month Order Supply Available
0 xx-xxx 202107 718 1531 813
1 xx-xxx 202108 668 813 145
2 xx-xxx 202109 5030 2306 -2724
3 xx-xxx 202110 667 -2724 -3391

Related

Assign new column in DataFrame based on if value is in a certain value range

I have two DataFrames as follows:
df_discount = pd.DataFrame(data={'Graduation' : np.arange(0,1000,100), 'Discount %' : np.arange(0,50,5)})
df_values = pd.DataFrame(data={'Sum' : [20,801,972,1061,1251]})
Now my goal is to get a new column df_values['New Sum'] for my df_values that applies the corresponding discount to df_values['Sum'] based on the value of df_discount['Graduation']. If the Sum is >= the Graduation the corresponding discount is applied.
Examples: Sum 801 should get a discount of 40% resulting in 480.6, Sum 1061 gets 45% resulting in 583.55.
I know I could write a funtion with if else conditions and the returning values. However, is there a better way to do this if you have very many different conditions?
You could try if pd.merge_asof() works for you:
df_discount = pd.DataFrame({
'Graduation': np.arange(0, 1000, 100), 'Discount %': np.arange(0, 50, 5)
})
df_values = pd.DataFrame({'Sum': [20, 100, 101, 350, 801, 972, 1061, 1251]})
df_values = (
pd.merge_asof(
df_values, df_discount,
left_on="Sum", right_on="Graduation",
direction="backward"
)
.assign(New_Sum=lambda df: df["Sum"] * (1 - df["Discount %"] / 100))
.drop(columns=["Graduation", "Discount %"])
)
Result (without the last .drop(columns=...) to see what's happening):
Sum Graduation Discount % New_Sum
0 20 0 0 20.00
1 100 100 5 95.00
2 101 100 5 95.95
3 350 300 15 297.50
4 801 800 40 480.60
5 972 900 45 534.60
6 1061 900 45 583.55
7 1251 900 45 688.05
pandas.cut() is made for problems like this where you need to segment your data into bins (i.e. discount % based on value range).
First define the column, the ranges, and the corresponding bins.
# The column we need to segment
col = df_values['Sum']
# The ranges: [0, 100, 200,... ,900, np.inf] means (0,100), (100,200), ... (900,inf)
graduation = np.append(df_discount['Graduation'], np.inf)
# For each range what is the corresponding bin (i.e. discount)
discount = df_discount['Discount %']
Now call pandas.cut() and do the discount calculation.
df_values['Discount %'] = pd.cut(col,
graduation,
labels=discount)
# Convert the string label to an int for calculation
df_values['Discount %'] = df_values['Discount %'].astype(int)
df_values['New Sum'] = df_values['Sum'] * (1-df_values['Discount %']/100)
Sum Discount % New Sum
0 20 0 20.00
1 801 40 480.60
2 972 45 534.60
3 1061 45 583.55
4 1251 45 688.05
You can use pandas.DataFrame.mask. Basically if your condition is true it replaces the value. But for that your sum column has to be inside first dataframe.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

Calculate and add up Data from a reference dataframe

I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

Calculate new column as the mean of other columns in pandas [duplicate]

This question already has answers here:
Row-wise average for a subset of columns with missing values
(3 answers)
Closed 5 years ago.
I have a this data frame and I would like to calculate a new column as the mean of salary_1, salary_2 and salary_3:
df = pd.DataFrame({
'salary_1': [230, 345, 222],
'salary_2': [235, 375, 292],
'salary_3': [210, 385, 260]
})
salary_1 salary_2 salary_3
0 230 235 210
1 345 375 385
2 222 292 260
How can I do it in pandas in the most efficient way? Actually I have many more columns and I don't want to write this one by one.
Something like this:
salary_1 salary_2 salary_3 salary_mean
0 230 235 210 (230+235+210)/3
1 345 375 385 ...
2 222 292 260 ...
Use .mean. By specifying the axis you can take the average across the row or the column.
df['average'] = df.mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average
0 230 235 210 225.000000
1 345 375 385 368.333333
2 222 292 260 258.000000
If you only want the mean of a few you can select only those columns. E.g.
df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average_1_3
0 230 235 210 220.0
1 345 375 385 365.0
2 222 292 260 241.0
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column name
df['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.

How to filter pivot tables on python

How do I filter pivot tables to return specific columns. Currently my dataframe is this:
print table
sum
Sex Female Male All
Date (Intervals)
April 166 191 357
August 212 263 475
December 173 263 436
February 192 298 490
January 148 195 343
July 189 260 449
June 165 238 403
March 165 278 443
May 236 253 489
November 167 247 414
October 185 287 472
September 175 306 481
All 2173 3079 5252
I want to display results of only the male column. I tried the following code:
table.query('Sex == "Male"')
However I got this error
TypeError: Expected tuple, got str
How would I be able to filter my table with specified rows or columns.
It looks like table has a column MultiIndex:
sum
Sex Female Male All
One way to check if your table has a column MultiIndex is to inspect table.columns:
In [178]: table.columns
Out[178]:
MultiIndex(levels=[['sum'], ['All', 'Female', 'Male']],
labels=[[0, 0, 0], [1, 2, 0]],
names=[None, 'sex'])
To access a column of table you need to specify a value for each level of the MultiIndex:
In [179]: list(table.columns)
Out[179]: [('sum', 'Female'), ('sum', 'Male'), ('sum', 'All')]
Thus, to select the Male column, you would use
In [176]: table[('sum', 'Male')]
Out[176]:
date
April 42.0
August 34.0
December 32.0
...
Since the sum level is unnecessary, you could get rid of it by specifying the values parameter when calling df.pivot or df.pivot_table.
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
# sex Female Male All
# date
# April 40.0 40.0 80.0
# August 48.0 32.0 80.0
# December 48.0 44.0 92.0
For example,
import numpy as np
import pandas as pd
import calendar
np.random.seed(2016)
N = 1000
sex = np.random.choice(['Male', 'Female'], size=N)
date = np.random.choice(calendar.month_name[1:13], size=N)
df = pd.DataFrame({'sex':sex, 'date':date, 'sum':1})
# This reproduces a table similar to yours
table = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True)
print(table[('sum', 'Male')])
# table2 has a single level Index
table2 = df.pivot_table(index='date', columns='sex', aggfunc='sum', margins=True,
values='sum')
print(table2['Male'])
Another way to remove the sum level would be to use table = table['sum'],
or table.columns = table.columns.droplevel(0).

Categories