Pandas: Quick random negative sampling

Pandas: Quick random negative sampling - python

Say I have a DataFrame full of positive samples and context features for a given user:
target user cashtag sector industry
0 1 170 4979 3 70
1 1 170 5539 3 70
2 1 170 7271 3 70
3 1 170 7428 3 70
4 1 170 686 7 139
where a positive sample is a user having interacted with a cashtag and is denoted by target = 1.
What is a quick way for me to generate negative samples in the ratio 1:2 (+ve:-ve) for each interaction, denoted by target = -1?
EDIT: Sample for clarity below (for the first two positive samples)
target user cashtag sector industry
0 1 170 4979 3 70
1 -1 170 3224 7 181
2 -1 170 4331 7 180
3 1 170 5539 3 70
4 -1 170 9304 4 59
5 -1 170 3833 6 185
For instance, for each cashtag a user has interacted with, I'd like to pick at random 2 other cashtags that they haven't interacted with and add them as negative samples to the dataframe; effectively increasing the size of the dataframe to 3 times its original size.
It would also be helpful to check if the negative sample hasn't already been entered for that user, cashtag combination.

Here my solution:
data="""
target user cashtag sector industry
1 170 4979 3 70
1 170 5539 3 70
1 170 7271 3 70
1 170 7428 3 70
1 170 686 7 139
"""
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df1 = pd.DataFrame(columns = df.columns)
cashtag = df['cashtag'].values.tolist()
#function to randomize some numbers
def randomnumber(v):
return np.random.randint(v, size=1)
def addNewRow(x):
for i in range(2): #add 2 new rows
cash = cashtag[0]
while cash in cashtag: #check if cashtag already used
cash = randomnumber(5000)[0] #random number between 0 and 5000
cashtag.append(cash)
sector = randomnumber(10)[0]
industry = randomnumber(200)[0]
df1.loc[df1.shape[0]] = [-1, x.user, cash, sector, industry]
df.apply(lambda x: addNewRow(x), axis=1)
df = df.append(df1).reset_index()
print(df)
output:
index target user cashtag sector industry
0 0 1 170 4979 3 70
1 1 1 170 5539 3 70
2 2 1 170 7271 3 70
3 3 1 170 7428 3 70
4 4 1 170 686 7 139
5 0 -1 170 544 2 59
6 1 -1 170 3202 8 165
7 2 -1 170 2673 0 40
8 3 -1 170 4021 1 30
9 4 -1 170 682 6 3
10 5 -1 170 2446 1 80
11 6 -1 170 4026 9 193
12 7 -1 170 4070 9 197
13 8 -1 170 2900 1 57
14 9 -1 170 3287 0 21
The new random rows are put at the end of dataframe

Related

Pandas groupby apply a random day to each group of years

I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64

You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6

With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366

years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))

Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

Add a new line to .txt file in python

I have the following .txt file:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
I need to copy the first line and add it to the .txt file as last line in order to get this:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0
How can I do that? (Notice the 13 as the first entry of the last line)

Try the following. I have added some comments to describe the steps
with open('yourfile.txt') as f:
t=f.readlines()
row=t[-1].split()[0] #get the last row index
row=str(int(row)+1) #increase the last row index
new_line=t[0] #copy first line
new_line=new_line.replace('0', row, 1).replace(' ', '',len(row)-1) #add the next row index to new line, taking care of spaces
t[-1]=t[-1]+'\n'
t.append(new_line) #append the new line
with open('yourfile.txt', 'w') as f:
f.writelines(t)
Applied to your existing .txt, result of the above code is:
0 40 50 0 0 1236 0 0 0
1 45 70 -20 825 870 90 3 0
2 42 68 -10 727 782 90 4 0
3 40 69 20 621 702 90 0 1
4 38 70 10 534 605 90 0 2
5 25 85 -20 652 721 90 11 0
6 22 75 30 30 92 90 0 10
7 22 85 -40 567 620 90 9 0
8 20 80 -10 384 429 90 12 0
9 20 85 40 475 528 90 0 7
10 18 75 -30 99 148 90 6 0
11 15 75 20 179 254 90 0 5
12 15 80 10 278 345 90 0 8
13 40 50 0 0 1236 0 0 0

You can use:
with open('fileName.txt') as file:
first_line = file.readline()
count = sum(1 for _ in file)
line1 = first_line.split()
line1[0] = count
str = ' '.join(line1)
#then you can add it to the end of the file with:
file_object.write(str)

Matplotlib heatmap using pandas dataframe

I want to produce a heatmap using matplolib and this pandas dataframe:
class time day
0 URxvt 2 6
1 Emacs 3 6
2 Firefox 90 6
3 KeePassXC 5 6
4 URxvt 91 6
.. ... ... ...
144 Matplotlib 1 1
145 Matplotlib 1 1
146 Matplotlib 2 1
147 Matplotlib 5 1
148 Matplotlib 93 1
[149 rows x 3 columns]
I want to produce a heatmap with day (from 0 to 6 (but for the moment 0, 1 and 6)) on x-axis and class on y-axis, values are aggregate sums according to class and day).
I tried to groupby these two columns which produces:
time
class day
Emacs 0 1149
1 130
6 634
Eog 1 83
6 66
Evince 0 775
6 60
File-roller 0 40
Firefox 0 32109
1 6344
6 9887
GParted 1 25
Gedit 0 77
1 7
Gimp-2.10 6 25
Gmpc 1 73
Gnome-disks 1 21
Gtk-recordmydesktop 0 57
Gufw.py 6 100
KeePassXC 0 44
1 17
6 126
Matplotlib 1 151
Org.gnome.Nautilus 0 141
1 559
6 68
Scangearmp2 6 28
Totem 0 12
URxvt 0 346
1 488
6 3364
vlc 0 22
but I can't get a proper heatmap (with X:day Y:class and values: time)

You can try:
sns.heatmap(df.groupby(['class','day'])['time'].sum()
.unstack('day',fill_value=0)
)
Output:

Python pandas groupby with cumsum and percentage

Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...

It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Quick random negative sampling - python

Related

Pandas groupby apply a random day to each group of years

Get Row in Other Table

Add a new line to .txt file in python

Matplotlib heatmap using pandas dataframe

Python pandas groupby with cumsum and percentage

Categories

Resources