Searching first closer value within my dataframe with two conditions - python

I have one dataframe with N Timestamps.
I need an additional column 'Timestamp_reached_TP' with those constraints :
'Timestamp_reached_TP' must be the first higher value than 'Timestamp'
Additionally, 'Long_TP' value must be between 'Open' and 'Close'
Here is my code below to create the datafrae
d = {'Timestamp': [1,2,3,4,5,6,7,8,9,10], 'Open': [100,110,200,200,240,250,300,180,200,200], 'Close': [110,200,200,240,250,300,180,200,200,100], 'Long_TP': [220,220,250,300,400,260,305,200,210,205]}
df = pd.DataFrame(data=d)
Actual df :
Timestamp Open Close Long_TP
0 1 100 110 220
1 2 110 200 220
2 3 200 200 250
3 4 200 240 300
4 5 240 250 400
5 6 250 300 260
6 7 300 180 305
7 8 180 200 200
8 9 200 200 210
9 10 200 100 205
Expected result :
Timestamp Open Close Long_TP Timestamp_reached_TP
0 1 100 110 220 4
1 2 110 200 220 4
2 3 200 200 250 5
3 4 200 240 300 6
4 5 240 250 400 NaN
5 6 250 300 260 6
6 7 300 180 305 NaN
7 8 180 200 200 8
8 9 200 200 210 Nan
9 10 200 100 205 NaN
I have tried to find a workaround with the left / merge but it does not seem I can join on multiple conditions.
Thank you very much in advance for you help guys !

Related

Pandas groupby apply a random day to each group of years

I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64
You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6
With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366
​
years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))
​
Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164

Add a new column to Pandas Dataframe based on values from other column

This is my first time posting here. I tried almost everything but could not find a solution to it. Please help!
I have a python pandas dataframe which has these columns - ID, Step, X, Y. Each ID has a number of steps. I want to add a new column (new_id) to it which takes integer values starting from "1". And provides the same value for each ID if it contains same values for "X" & "Y" for all the steps using a for loop. Otherwise, add 1 to the previous new_ID value
DataFrame (df)
ID Step X Y
1001 0 100 200
1001 1 200 300
1001 2 100 250
1001 3 150 200
1002 0 150 200
1002 1 200 250
1002 2 250 300
1002 3 300 150
1003 0 100 200
1003 1 200 300
1003 2 100 250
1003 3 150 200
1004 0 150 200
1004 1 200 250
1004 2 250 300
1004 3 300 150
1005 0 125 220
1005 1 200 250
1005 2 250 300
1005 3 300 150
Newly Created DataFrame (df)
ID Step X Y new_id
1001 0 100 200 1
1001 1 200 300 1
1001 2 100 250 1
1001 3 150 200 1
1002 0 150 200 2
1002 1 200 250 2
1002 2 250 300 2
1002 3 300 150 2
1003 0 100 200 1
1003 1 200 300 1
1003 2 100 250 1
1003 3 150 200 1
1004 0 150 200 2
1004 1 200 250 2
1004 2 250 300 2
1004 3 300 150 2
1005 0 125 220 3
1005 1 200 250 3
1005 2 250 300 3
1005 3 300 150 3

How do I select columns while having few conditions in pandas

I've got datframe:
1990 1991 1992 .... 2015 2016 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
How do I select df<2000 & df>2005?
I tried code below but it failed:
1. df[(df.loc[:, :2000]) & (df.loc[:, 2005:])]
2. df[(df <2000) & (df>2005)]
Compare columns names:
print (df)
1999 2002 2003 2005 2006 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
df = df.loc[:, (df.columns <2000) | (df.columns>2005)]
print (df)
1999 2006 2017
0 9 200 554
1 9 200 554
2 5 200 554
3 8 200 554
4 7 280 145
5 9 207 554
6 2 200 554
7 2 200 554
8 5 200 554
9 7 200 554

Is there a way to optimize pandas apply function during groupby?

I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00

Summing up previous 10 rows of a dataframe

I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:
You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0

Categories