I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64
You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6
With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366
years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))
Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164
This is my first time posting here. I tried almost everything but could not find a solution to it. Please help!
I have a python pandas dataframe which has these columns - ID, Step, X, Y. Each ID has a number of steps. I want to add a new column (new_id) to it which takes integer values starting from "1". And provides the same value for each ID if it contains same values for "X" & "Y" for all the steps using a for loop. Otherwise, add 1 to the previous new_ID value
DataFrame (df)
ID Step X Y
1001 0 100 200
1001 1 200 300
1001 2 100 250
1001 3 150 200
1002 0 150 200
1002 1 200 250
1002 2 250 300
1002 3 300 150
1003 0 100 200
1003 1 200 300
1003 2 100 250
1003 3 150 200
1004 0 150 200
1004 1 200 250
1004 2 250 300
1004 3 300 150
1005 0 125 220
1005 1 200 250
1005 2 250 300
1005 3 300 150
Newly Created DataFrame (df)
ID Step X Y new_id
1001 0 100 200 1
1001 1 200 300 1
1001 2 100 250 1
1001 3 150 200 1
1002 0 150 200 2
1002 1 200 250 2
1002 2 250 300 2
1002 3 300 150 2
1003 0 100 200 1
1003 1 200 300 1
1003 2 100 250 1
1003 3 150 200 1
1004 0 150 200 2
1004 1 200 250 2
1004 2 250 300 2
1004 3 300 150 2
1005 0 125 220 3
1005 1 200 250 3
1005 2 250 300 3
1005 3 300 150 3
I've got datframe:
1990 1991 1992 .... 2015 2016 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
How do I select df<2000 & df>2005?
I tried code below but it failed:
1. df[(df.loc[:, :2000]) & (df.loc[:, 2005:])]
2. df[(df <2000) & (df>2005)]
Compare columns names:
print (df)
1999 2002 2003 2005 2006 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
df = df.loc[:, (df.columns <2000) | (df.columns>2005)]
print (df)
1999 2006 2017
0 9 200 554
1 9 200 554
2 5 200 554
3 8 200 554
4 7 280 145
5 9 207 554
6 2 200 554
7 2 200 554
8 5 200 554
9 7 200 554
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:
You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0