I have 2 files:
First looks like this:
ID1;ID2;ID3;VALUE
1000;100;01;12
1000;100;02;4129
1000;100;03;128
1000;100;04;412
1000;100;05;12818
1000;100;06;4129
1000;100;07;546
1000;100;08;86
1000;100;09;12818
1000;100;10;754
1000;100;11;2633
1000;100;12;571
1000;200;01;13
1000;200;02;319
1000;200;03;828
1000;200;04;46
1000;200;05;118
1000;200;06;41
1000;200;07;546
1000;200;08;86
1000;200;09;129
1000;200;10;7564
1000;200;11;233
1000;200;12;572
The second one looks like this:
01
01
02
01
02
03
01
02
03
04
....
01
02
03
04
05
06
07
08
09
10
11
12
I want to search through the first file, using the groups from the second one, and then sum up de values.
The output should look like this:
ID1;ID2;ID3;VALUE
1000;100;M01;12
1000;100;M02;4141 (12+4129)
1000;100;M03;4269
---
1000;100;M12;39036
1000;200;M01;13
1000;200;M02;332
Is there any way to do this? Thank you in advance, any idea is welcomed.
import pandas as pd
df1 = pd.read_csv('df1.txt', sep =';', header=0)
df2 = pd.read_csv('df2.txt', header=None)
df1['preliminary'] = 0
df1['test'] = 0
print(df1)
print(df2)
aaa = df2.index[-1]
def my_func(x):
ind = x.index[x.index > aaa]
bbb = 0
if len(ind) > 0:
bbb = df2[x.index[0]:ind[0]]
else:
bbb = df2[x.index[0]:x.index[-1]+1]
indGR = x[x['ID3'].isin(bbb[0].values)].index
df1.loc[indGR, 'preliminary'] = df1.loc[indGR, 'VALUE']
df1.loc[x.index,'test'] = df1.loc[x.index, 'preliminary'].cumsum()
df1.groupby('ID2').apply(my_func)
print(df1)
Input
ID1 ID2 ID3 VALUE preliminary test
0 1000 100 1 12 0 0
1 1000 100 2 4129 0 0
2 1000 100 3 128 0 0
3 1000 100 4 412 0 0
4 1000 100 5 12818 0 0
5 1000 100 6 4129 0 0
6 1000 100 7 546 0 0
7 1000 100 8 86 0 0
8 1000 100 9 12818 0 0
9 1000 100 10 754 0 0
10 1000 100 11 2633 0 0
11 1000 100 12 571 0 0
12 1000 200 1 13 0 0
13 1000 200 2 319 0 0
14 1000 200 3 828 0 0
15 1000 200 4 46 0 0
16 1000 200 5 118 0 0
17 1000 200 6 41 0 0
18 1000 200 7 546 0 0
19 1000 200 8 86 0 0
20 1000 200 9 129 0 0
21 1000 200 10 7564 0 0
22 1000 200 11 233 0 0
23 1000 200 12 572 0 0
0
0 1
1 1
2 2
3 1
4 2
5 3
6 1
7 2
8 3
9 4
10 1
11 2
12 3
13 4
14 5
15 6
16 7
17 8
18 9
19 10
20 11
21 12
Output
ID1 ID2 ID3 VALUE preliminary test
0 1000 100 1 12 12 12
1 1000 100 2 4129 4129 4141
2 1000 100 3 128 128 4269
3 1000 100 4 412 412 4681
4 1000 100 5 12818 0 4681
5 1000 100 6 4129 0 4681
6 1000 100 7 546 0 4681
7 1000 100 8 86 0 4681
8 1000 100 9 12818 0 4681
9 1000 100 10 754 0 4681
10 1000 100 11 2633 0 4681
11 1000 100 12 571 0 4681
12 1000 200 1 13 0 0
13 1000 200 2 319 0 0
14 1000 200 3 828 828 828
15 1000 200 4 46 46 874
16 1000 200 5 118 118 992
17 1000 200 6 41 41 1033
18 1000 200 7 546 546 1579
19 1000 200 8 86 86 1665
20 1000 200 9 129 129 1794
21 1000 200 10 7564 7564 9358
22 1000 200 11 233 233 9591
23 1000 200 12 572 572 10163
I have one dataframe with N Timestamps.
I need an additional column 'Timestamp_reached_TP' with those constraints :
'Timestamp_reached_TP' must be the first higher value than 'Timestamp'
Additionally, 'Long_TP' value must be between 'Open' and 'Close'
Here is my code below to create the datafrae
d = {'Timestamp': [1,2,3,4,5,6,7,8,9,10], 'Open': [100,110,200,200,240,250,300,180,200,200], 'Close': [110,200,200,240,250,300,180,200,200,100], 'Long_TP': [220,220,250,300,400,260,305,200,210,205]}
df = pd.DataFrame(data=d)
Actual df :
Timestamp Open Close Long_TP
0 1 100 110 220
1 2 110 200 220
2 3 200 200 250
3 4 200 240 300
4 5 240 250 400
5 6 250 300 260
6 7 300 180 305
7 8 180 200 200
8 9 200 200 210
9 10 200 100 205
Expected result :
Timestamp Open Close Long_TP Timestamp_reached_TP
0 1 100 110 220 4
1 2 110 200 220 4
2 3 200 200 250 5
3 4 200 240 300 6
4 5 240 250 400 NaN
5 6 250 300 260 6
6 7 300 180 305 NaN
7 8 180 200 200 8
8 9 200 200 210 Nan
9 10 200 100 205 NaN
I have tried to find a workaround with the left / merge but it does not seem I can join on multiple conditions.
Thank you very much in advance for you help guys !
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I have data that I've left in a format that will allow me to pivot on dates that look like:
Region 0 1 2 3
Date 2005-01-01 2005-02-01 2005-03-01 ....
East South Central 400 500 600
Pacific 100 200 150
.
.
Mountain 500 600 450
I need to pivot this table so it looks like:
0 Date Region value
1 2005-01-01 East South Central 400
2 2005-02-01 East South Central 500
3 2005-03-01 East South Central 600
.
.
4 2005-03-01 Pacific 100
4 2005-03-01 Pacific 200
4 2005-03-01 Pacific 150
.
.
Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output.
How can I go about this?
I think this is the solution you are looking for. Shown by example.
import pandas as pd
import numpy as np
N=100
regions = list('abcdef')
df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)],
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)),
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))])
df.index = ['Region', 'Date', 'a', 'b', 'c', 'd']
print(df)
This gives
0 1 2 3 4 5 6 7 \
Region 0 1 2 3 4 5 6 7
Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7
a 96 432 181 64 87 355 339 314
b 360 23 162 98 450 78 114 109
c 143 375 420 493 321 277 208 317
d 371 144 207 108 163 67 465 130
And the solution to pivot this into the form you want is
df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd'])
which gives
Date variable value
0 2016-0 a 96
1 2016-1 a 432
2 2016-2 a 181
3 2016-3 a 64
4 2016-4 a 87
5 2016-5 a 355
6 2016-6 a 339
7 2016-7 a 314
8 2016-8 a 111
9 2016-9 a 121
10 2016-10 a 124
11 2016-11 a 383
12 2016-12 a 424
13 2016-13 a 453
...
393 2016-93 d 176
394 2016-94 d 277
395 2016-95 d 256
396 2016-96 d 174
397 2016-97 d 349
398 2016-98 d 414
399 2016-99 d 132
I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:
You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0