I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64
You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6
With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366
years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))
Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164
Related
I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))
I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Say I have a DataFrame full of positive samples and context features for a given user:
target user cashtag sector industry
0 1 170 4979 3 70
1 1 170 5539 3 70
2 1 170 7271 3 70
3 1 170 7428 3 70
4 1 170 686 7 139
where a positive sample is a user having interacted with a cashtag and is denoted by target = 1.
What is a quick way for me to generate negative samples in the ratio 1:2 (+ve:-ve) for each interaction, denoted by target = -1?
EDIT: Sample for clarity below (for the first two positive samples)
target user cashtag sector industry
0 1 170 4979 3 70
1 -1 170 3224 7 181
2 -1 170 4331 7 180
3 1 170 5539 3 70
4 -1 170 9304 4 59
5 -1 170 3833 6 185
For instance, for each cashtag a user has interacted with, I'd like to pick at random 2 other cashtags that they haven't interacted with and add them as negative samples to the dataframe; effectively increasing the size of the dataframe to 3 times its original size.
It would also be helpful to check if the negative sample hasn't already been entered for that user, cashtag combination.
Here my solution:
data="""
target user cashtag sector industry
1 170 4979 3 70
1 170 5539 3 70
1 170 7271 3 70
1 170 7428 3 70
1 170 686 7 139
"""
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df1 = pd.DataFrame(columns = df.columns)
cashtag = df['cashtag'].values.tolist()
#function to randomize some numbers
def randomnumber(v):
return np.random.randint(v, size=1)
def addNewRow(x):
for i in range(2): #add 2 new rows
cash = cashtag[0]
while cash in cashtag: #check if cashtag already used
cash = randomnumber(5000)[0] #random number between 0 and 5000
cashtag.append(cash)
sector = randomnumber(10)[0]
industry = randomnumber(200)[0]
df1.loc[df1.shape[0]] = [-1, x.user, cash, sector, industry]
df.apply(lambda x: addNewRow(x), axis=1)
df = df.append(df1).reset_index()
print(df)
output:
index target user cashtag sector industry
0 0 1 170 4979 3 70
1 1 1 170 5539 3 70
2 2 1 170 7271 3 70
3 3 1 170 7428 3 70
4 4 1 170 686 7 139
5 0 -1 170 544 2 59
6 1 -1 170 3202 8 165
7 2 -1 170 2673 0 40
8 3 -1 170 4021 1 30
9 4 -1 170 682 6 3
10 5 -1 170 2446 1 80
11 6 -1 170 4026 9 193
12 7 -1 170 4070 9 197
13 8 -1 170 2900 1 57
14 9 -1 170 3287 0 21
The new random rows are put at the end of dataframe
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113
I have golf data from the PGA Tour that I am trying to clean. Originally, I had white space in front of some of the hole scores. This is what a unique count looked like:
df["Hole Score"].value_counts()
Out[76]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score, dtype: int64
I was able to run a whitespace remover function that got rid of leading white space. However, my count values function returns the same frequency counts:
df["Hole Score_2"].value_counts()
Out[74]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score_2, dtype: int64
For reference, this is the helper function that I used:
def remove_whitespace(x):
try:
x = "".join(x.split())
except:
pass
return x
df["Hole Score_2"] = df["Hole Score"].apply(remove_whitespace)
My question: How do I get the unique counts to list one Hole Score per count? What could be causing the double counting?