I need to do 'pandas non equi join',
when first table joined with second table in range.
first_table
EMPLOYEE_ID SALARY
100 3000.00
101 17000.00
102 17000.00
103 9000.00
104 6000.00
105 4800.00
106 4800.00
………….. …………
………………. …………
second_table
grade_id lowest_sal highest_sal grade_level
1 0 3500 GRADE-A
2 3501 7000 GRADE-B
3 7001 10000 GRADE-C
4 10000 20000 GRADE-D
Need_table(OUTPUT):
EMPLOYEE_ID SALARY grade_level
115 3000 GRADE-A
116 17000 GRADE-D
117 17000 GRADE-D
118 9000 GRADE-C
119 6000 GRADE-B
125 4800 GRADE-B
126 4800 GRADE-B
This equivalent SQL query as:
SELECT f.EMPLOYEE_ID,
f.SALARY,
s.grade_level
FROM first_table f JOIN second_table s
ON f.SALARY BETWEEN s.lowest_sal AND s.highest_sal
Can't to use 'pd.merge' method to join tables because not have any common column....
Please help to find method
Thanks
If df1 is your first table and df2 is your second table, you could do for example this:
d = df2.set_index('grade_level').to_dict('split')
df1['GRADE'] = df1['SALARY'].apply(
lambda x: next((c for i, c in enumerate(d['index']) if d['data'][i][1] <= x <= d['data'][i][2]), np.nan)
)
print(df1)
Prints:
EMPLOYEE_ID SALARY GRADE
0 100 3000.0 GRADE-A
1 101 17000.0 GRADE-D
2 102 17000.0 GRADE-D
3 103 9000.0 GRADE-C
4 104 6000.0 GRADE-B
5 105 4800.0 GRADE-B
6 106 4800.0 GRADE-B
One option is with conditional_join from pyjanitor, which avoids a cartesian join (helpful with memory, and performance, depending on the data size):
# pip install pyjanitor
import pandas as pd
import janitor
(first_table
.astype({'SALARY':int})
.conditional_join(
second_table,
('SALARY', 'lowest_sal', '>='),
('SALARY', 'highest_sal', '<='))
.loc[:, ['EMPLOYEE_ID', 'SALARY', 'grade_level']]
)
EMPLOYEE_ID SALARY grade_level
0 100 3000 GRADE-A
1 101 17000 GRADE-D
2 102 17000 GRADE-D
3 103 9000 GRADE-C
4 104 6000 GRADE-B
5 105 4800 GRADE-B
6 106 4800 GRADE-B
Related
I have a pandas dataframe , screenshot shown below:
ID Price
100 1040.0
101 1025.0
102 750.0
103 891.0
104 924.0
Expected output shown below
ID Price Price_new
100 1040.0 1050
101 1025.0 1050
102 750.0 750
103 891.0 900
104 920.0 900
This is what I have done but it's not what I want. I want to round off to the nearest fifty in such a way that at 1025 it should round to 1050.
df['Price_new'] = (df['Price'] / 50).round().astype(int) * 50
This is due to the issue : round with python 3
s = (df['Price'] % 50)
df['new'] = df['Price'] + np.where(s>=25,50-s,-s)
df
Out[33]:
ID Price new
0 100 1040 1050
1 101 1025 1050
2 102 750 750
3 103 891 900
4 104 924 900
Follow my suggestion:
import pandas as pd
dt = pd.DataFrame({'ID':[100,101,102,103,104], 'Price':
[1040,1025,750,891,924]})
#VERSION1
dt['Price_new'] = round((dt['Price']+1)/50).astype(int)*50
#VERSION2
dt['Price_new_v2'] = dt['Price']-(dt['Price'].map(lambda x: x%50)) +
(dt['Price'].map(lambda x: round((((x%50)+1)/50))))*50
ID Price Price_new Price_new_V2
0 100 1040 1050 1050
1 101 1025 1050 1050
2 102 750 750 750
3 103 891 900 900
4 104 924 900 900
Just plus 1 in your math you will be able to find your correct answer. But there is another way to do it, my opnião is more understandable than the second version even though I used the modulo operator.
I have a dataframe with open, high, low, close prices of a stock. I want to add an additional column that has the percent change between today's open and yesterday's high. This is my current implementation, however, the resulting column contains percent changes between the current day's high and open.
df
open high low close
0 100 110 95 103
1 103 113 103 111
2 111 132 109 124
3 124 136 114 130
My attempt (incorrect):
df['prevhigh_curropen'] = (df['open'] - df['high']).shift(-1) / df['high'].shift(-1)
Output (incorrect):
open high low close prevhigh_curropen
0 100 110 95 103 -0.091
1 103 113 103 111 -0.089
2 111 132 109 124 -0.159
3 124 136 114 130 -0.088
Desired output:
open high low close prevhigh_curropen
0 100 110 95 103 nan
1 103 113 103 111 -0.064
2 111 132 109 124 -0.018
3 124 136 114 130 -0.061
Is there a non-iterative way to do this like I attempted above?
Your formula is wrong, you have to use df['high'].shift():
df = pd.DataFrame({'open': range(1, 11), 'high': range(1, 11)})
df['prevhigh_curropen'] = df['open'].sub(df['high'].shift()) \
.div(df['high'].shift()) \
.mul(100)
>>> df
open high prevhigh_curropen
0 1 1 NaN
1 2 2 100.000000
2 3 3 50.000000
3 4 4 33.333333
4 5 5 25.000000
5 6 6 20.000000
6 7 7 16.666667
7 8 8 14.285714
8 9 9 12.500000
9 10 10 11.111111
For your sample the output is:
>>> df
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -6.363636
2 111 132 109 124 -1.769912
3 124 136 114 130 -6.060606
The first value is NaN because we don't know the high value from the previous day.
We can simplify the terms slightly from (a - b) / b to (a / b) - (b / b) to (a / b) - 1.
Mathematical Operators:
df['prevhigh_curropen'] = (df['open'] / df['high'].shift()) - 1
or with Series Methods:
df['prevhigh_curropen'] = df['open'].div(df['high'].shift()).sub(1)
*The benefit here is that we only need to shift once, and maintain 1 copy of df['high'].shift()
Resulting df:
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -0.063636
2 111 132 109 124 -0.017699
3 124 136 114 130 -0.060606
Setup Used:
import pandas as pd
df = pd.DataFrame({
'open': [100, 103, 111, 124],
'high': [110, 113, 132, 136],
'low': [95, 103, 109, 114],
'close': [103, 111, 124, 130]
})
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I have a DataFrame df_sale in Python that I want to reshape, count the sum across the price column and add a new coloumn total. Below is the df_sale:
b_no a_id price c_id
120 24 50 2
120 56 100 2
120 90 25 2
120 45 20 2
231 89 55 3
231 45 20 3
231 10 250 3
Excepted Output after reshaping:
b_no a_id_1 a_id_2 a_id_3 a_id_4 total c_id
120 24 56 90 45 195 2
231 89 45 10 0 325 3
What I have tried so far is use the sum() on df_sale['price'] separately for 120 and 231. I do not understand how should I reshape the data, add new column headers and get the total without being computationally inefficient. Thanks.
This might not be the cleanest method (at all), but it gets the outcome you want:
reshaped_df = (df.groupby('b_no')[['price', 'c_id']]
.first()
.join(df.groupby('b_no')['a_id']
.apply(list)
.apply(pd.Series)
.add_prefix('a_id_'))
.drop('price',1)
.join(df.groupby('b_no')['price'].sum().to_frame('total'))
.fillna(0))
>>> reshaped_df
c_id a_id_0 a_id_1 a_id_2 a_id_3 total
b_no
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325
You can achieve this grouping by b_no and c_id, summing total, and flattening a_id:
import pandas as pd
d = {"b_no": [120,120,120,120,231,231, 231],
"a_id": [24,56,90,45,89,45,10],
"price": [50,100,25,20,55,20,250],
"c_id": [2,2,2,2,3,3,3]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['b_no', 'c_id'])['a_id'].apply(list).apply(pd.Series).add_prefix('a_id_').fillna(0)
df2["total"] = df.groupby(['b_no', 'c_id'])['price'].sum()
print(df2)
a_id_0 a_id_1 a_id_2 a_id_3 total
b_no c_id
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325
I want to achieve this simple R code in pandas with simple syntax
here R code
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> mtcars$year <- c(1973, 1974)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb year
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1973
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1974
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1973
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1973
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1974
as you can see column year has been added to the data frame and filled with two repetitive values till the column ends
How can I achieve this in pandas with simple code
please note that I don't want to use for loop in the solution as it will take so much time if I'm working with big data set.
Thanks!
When adding a column to a Pandas DF, you must supply an object whose length matches the number of rows in the DF (unless every value is the same, in which case a scalar value can be assigned to the column). To do this, you could use a generator expression that repeats the elements of a list longer than the length of the DF, then slice it to the correct length:
mtcars['year'] = ([1973, 1974] * (len(mtcars) // 2 + 1))[:len(mtcars)]
Thanks to MaxU for inspiration with this solution.
For the case where the DF has an even number of rows you could simply repeat the elements of a list to the length of the DF:
mtcars['year'] = [1973, 1974] * (len(mtcars) // 2)
Using numpy tile (much faster than the list generation technique):
import numpy as np
years = (1973, 1974)
mtcars['year'] = np.tile(years, int(len(mtcars) / len(years)) + 1)[:len(mtcars)]
Numpy tile with a 1 million row dataframe:
mtcars = pd.DataFrame(np.arange(1000000))
years = (1973, 1974)
mtcars['year'] = np.tile(years, int(len(mtcars) / len(years)) + 1)[:len(mtcars)]
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 3.81 ms
List generation with a 1 million row dataframe:
mtcars['year'] = ([1973, 1974] * (len(mtcars) // 2 + 1))[:len(mtcars)]
CPU times: user 140 ms, sys: 0 ns, total: 140 ms
Wall time: 136 ms
I propose this:
def new_vect(vect, n_row):
l_vect = len(vect)
l_new_vect = n_row / l_vect + 1
new_vect = vect * l_new_vect
return new_vect[:n_row]
mtcars['year'] = new_vect([1973,1974],mtcars.shape[0])
it's maybe a bit complex, but it will work also for even number of row