Add column in dataframe from another dataframe doing some arithmetic calculations python - python

i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column price_product2 in my first df, taking values of price from df2 with respect to product_2.
And also find the percentage difference of product_2 with respect to product_1
and make one more column %_diff .
i.e say product_1 = 100 and product_2 = 200. therefore product_2 is 200% of the price of 100.
similarly if product_1 = 400 and product_2 = 100, it is a decline in price.
therefore product_2 is -25% of product_1.
my final output should be. df =
id product_1 product_2 count price_product_2 %_diff
1 100 200 10 10 +200
2 200 600 20 30 +300
3 100 500 30 25 +500
4 400 100 40 5 -25
5 500 700 50 35 +140
6 200 500 60 25 +250
7 100 400 70 20 -71.42
Any ideas how to achieve it?
i was trying to use map functions.
df['price_product_2'] = df['product_2'].map(df2.set_index('product_id')['price'])
but i could get only one column , how do i get the %_diff column?

Use merge (or map) twice, once for each product, then calculate the difference.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['pct_diff'] = (df3.price_y - df3.price_x) / df3.price_x

Suppose you have the following data frames:
In [32]: df1
Out[32]:
index id product_1 product_2 count
0 0 1 100 200 10
1 1 2 200 600 20
2 2 3 100 500 30
3 3 4 400 100 40
4 4 5 500 700 50
5 5 6 200 500 60
6 6 7 100 400 70
In [33]: df2
Out[33]:
product price
0 100 5
1 200 10
2 300 15
3 400 20
4 500 25
5 600 30
6 700 35
It is probably easier simply to set product as the index for df2:
In [35]: df2.set_index('product', inplace=True)
In [36]: df2
Out[36]:
price
product
100 5
200 10
300 15
400 20
500 25
600 30
700 35
Then you can do things like the following:
In [37]: df2.loc[df1['product_2']]
Out[37]:
price
product
200 10
600 30
500 25
100 5
700 35
500 25
400 20
Use the values explicitly to set, or else the product index will screw things up:
In [38]: df1['price_product_2'] = df2.loc[df1['product_2']].values
In [39]: df1
Out[39]:
index id product_1 product_2 count price_product_2
0 0 1 100 200 10 10
1 1 2 200 600 20 30
2 2 3 100 500 30 25
3 3 4 400 100 40 5
4 4 5 500 700 50 35
5 5 6 200 500 60 25
6 6 7 100 400 70 20
For the percentage difference, you can also use vectorized operations:
In [40]: df1.product_2 / df1.product_1 * 100
Out[40]:
0 200.0
1 300.0
2 500.0
3 25.0
4 140.0
5 250.0
6 400.0
dtype: float64

Solution with map by dict d with divide by div:
d = df2.set_index('product')['price'].to_dict()
df['price_product_2'] = df['product_2'].map(d)
df['price_product_1'] = df['product_1'].map(d)
df['diff'] = df['price_product_2'].div(df['price_product_1']).mul(100)
print (df)
id product_1 product_2 count price_product_2 price_product_1 diff
0 1 100 200 10 10 5 200.0
1 2 200 600 20 30 10 300.0
2 3 100 500 30 25 5 500.0
3 4 400 100 40 5 20 25.0
4 5 500 700 50 35 25 140.0
5 6 200 500 60 25 10 250.0
6 7 100 400 70 20 5 400.0
But it seems only divide is necessary if multiple by same constant columns product_1 and product_2, then difference is same:
df['diff1'] = df['product_2'].div(df['product_1']).mul(100)
print (df)
id product_1 product_2 count diff1
0 1 100 200 10 200.0
1 2 200 600 20 300.0
2 3 100 500 30 500.0
3 4 400 100 40 25.0
4 5 500 700 50 140.0
5 6 200 500 60 250.0
6 7 100 400 70 400.0

Related

Groupby sequence in order by date, find the min, max based on other column value

I started to learn pandas 40 days ago. I only know pandas basics functions.
I have a data frame as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 P 2018-08-22 150
5 1 F 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 20
34 7 F 2018-10-10 500
35 7 F 2019-01-10 200
The data set is sorted based on ID and Date.
please note that Last Status in all IDs are F.
From the above data frame I would like to prepare below data frame.
ID SLS Cost#SLS Min_Cost Max_Cost Avg_Cost
1 F 120 100 600 261.67
2 M 100 100 600 266.67
3 P 100 100 700 258.33
4 M 200 100 800 344.00
7 F 500 20 800 360.00
SLS = Second Last Status
Please note that Min, Max and Avg Cost are calculated without considering last rows per IDs.
And from that replace Cost#SLS = 1000, if SLS == F
The expected data frame is as shown below.
ID SLS Cost#SLS Min_Cost Max_Cost Avg_Cost
1 F 1000 100 600 261.67
2 M 100 100 600 266.67
3 P 100 100 700 258.33
4 M 200 100 800 344.00
7 F 1000 20 800 360.00
Here is one way slightly modify piR's answer
s=df[df.ID.duplicated(keep='last')].groupby('ID').agg({'Status': ['last'], 'Cost': [ 'last','min', 'max', 'mean']})
s.loc[s[('Status','last')]=='F',('Cost','last')]=1000
s
Status Cost
last last min max mean
ID
1 F 1000 100 600 261.666667
2 M 100 100 600 266.666667
3 P 100 100 700 258.333333
4 M 200 100 800 344.000000
7 F 1000 20 800 355.000000

Python pandas: groupby and devide by the first value of each group

I have a pandas dataframe like this.
>data
ID Distance Speed
1 100 40
1 200 20
1 200 10
2 400 20
2 500 30
2 100 40
2 600 20
2 700 90
3 800 80
3 700 10
3 400 20
I want to groupby the table by ID, and create a new column time by dividing each value in the Distance column by the first row of the Speed column of each ID group. So the result should look like this.
>data
ID Distance Speed Time
1 100 40 2.5
1 200 20 5
1 200 10 5
2 400 20 20
2 500 30 25
2 100 40 5
2 600 20 30
2 700 90 35
3 800 80 10
3 700 10 8.75
3 400 20 5
My attempt:
data['Time'] = data['Distance'] / data.loc[data.groupby('ID')['Speed'].head(1).index, 'Speed']
But the result seems to be not good. How do you do it?
Use transform with first for return same length Series as original df:
data['Time'] = data['Distance'] /data.groupby('ID')['Speed'].transform('first')
Or use drop_duplicates with map:
s = data.drop_duplicates('ID').set_index('ID')['Speed']
data['Time'] = data['Distance'] / data['ID'].map(s)
print (data)
ID Distance Speed Time
0 1 100 40 2.50
1 1 200 20 5.00
2 1 200 10 5.00
3 2 400 20 20.00
4 2 500 30 25.00
5 2 100 40 5.00
6 2 600 20 30.00
7 2 700 90 35.00
8 3 800 80 10.00
9 3 700 10 8.75
10 3 400 20 5.00

Complex Map operation on two dataframes in pandas with condition

i have a table in pandas df.
id prod1 prod2 count
1 10 30 100
2 10 20 200
3 20 10 200
4 30 10 100
5 30 40 300
also i am having another table in df2
product price master_product
1000 1 10
5000 2 10
2000 2 20
9000 5 20
8000 1 20
30 3 0
4000 4 50
Check if prod1 and prod2 belongs to the values in master_product,
if yes
i want to replace prod1 and prod2 in my first df with the cheapest product in my master_product.
if the prod1 and prod2 values donot match with the values in master_product,
leave the values as it is.
i am looking for the final table as.
id prod1 prod2 count
1 1000 4000 100
2 1000 8000 200
3 8000 1000 200
4 30 1000 100 #since 30 is not in master_product,leave as it is
5 30 40 300
i was trying to use .map function to achieve this
but i could only reach to this.
df['prod1'] = df['prod1'].map(df2.set_index('master_product')['product'])
df['prod2'] = df['prod2'].map(df2.set_index('master_product')['product'])
but it will try to replace every values in prod1 and prod2 with matching values in master_product from df2.
Any ideas how to achieve this?
You can first modify df1 for minimal price by master_product by groupby with idxmin - get all indices with minimal price:
df1 = df1.loc[df1.groupby('master_product')['price'].idxmin()]
print (df1)
product price master_product
5 30 3 0
0 1000 1 10
4 8000 1 20
6 4000 4 50
Create dict for mapping:
d = df1.set_index('master_product')['product'].to_dict()
print (d)
{0: 30, 10: 1000, 20: 8000, 50: 4000}
Last map and if value is missing add it by combine_first:
df.prod1 = df.prod1.map(d).combine_first(df.prod1)
df.prod2 = df.prod2.map(d).combine_first(df.prod2)
print (df)
id prod1 prod2 count
0 1 1000.0 30.0 100
1 2 1000.0 8000.0 200
2 3 8000.0 1000.0 200
3 4 30.0 1000.0 100
4 5 30.0 40.0 300

Pandas : Delete rows based on other rows

I have a pandas dataframe which looks like that :
qseqid sseqid qstart qend
2 1 125 345
4 1 150 320
3 2 150 450
6 2 25 300
8 2 50 500
I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend].
Is this possible with pandas ?
df = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
'qseqid': [2, 4, 3, 6, 8],
'qstart': [125, 150, 150, 25, 50],
'sseqid': [1, 1, 2, 2, 2]})
def remove_rows(df):
merged = pd.merge(df.reset_index(), df, on='sseqid')
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
result = df.loc[df_mask]
return result
result = remove_rows(df)
print(result)
yields
qend qseqid qstart sseqid
0 345 2 125 1
3 300 6 25 2
4 500 8 50 2
The idea is to use pd.merge to form a DataFrame with every pairing of rows
with the same sseqid:
In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]:
index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y
0 0 345 2 125 1 345 2 125
1 0 345 2 125 1 320 4 150
2 1 320 4 150 1 345 2 125
3 1 320 4 150 1 320 4 150
4 2 450 3 150 2 450 3 150
5 2 450 3 150 2 300 6 25
6 2 450 3 150 2 500 8 50
7 3 300 6 25 2 450 3 150
8 3 300 6 25 2 300 6 25
9 3 300 6 25 2 500 8 50
10 4 500 8 50 2 450 3 150
11 4 500 8 50 2 300 6 25
12 4 500 8 50 2 500 8 50
Each row of merged contains data from two rows of df. You can then compare every two rows using
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
and find the labels in df.index that do not match this condition:
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
and select those rows:
result = df.loc[df_mask]
Note that this assumes df has a unique index.

Multiple new columns dependent on other column value

I have a dataframe that looks like this:
Node Node1 Length Spaces Dist T
1 2 600 30 300 100
1 3 400 20 200 100
2 1 600 30 300 100
2 6 500 25 250 400
3 1 400 20 200 100
3 4 400 20 200 200
3 12 400 20 200 200
4 3 400 20 200 200
4 5 200 10 100 500
4 11 600 30 300 1400
5 4 200 10 100 500
5 6 400 20 200 200
5 9 500 25 250 800
6 2 500 25 250 400
6 5 400 20 200 200
6 8 200 10 100 800
This tells us that, for example in the first row, there are 30 spaces between nodes 1 and 2. How could I create, say, 30 new columns with a value of 1 to represent each space seperately. Then do the same for each row.
The code below should work (col 'A' is your 'NoSpaces'):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
max_val = df['A'].max()
for itr in range(max_val):
colname = 'A%d' % itr
df[colname] = (df['A'] >=itr).astype('int')

Categories