Renaming a subset of index from a dataframe - python

I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you

I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]

You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'

Related

How could i count the rating for each item_id?

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

pad rows on a pandas dataframe with zeros till N count

Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.
reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]
We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')

make another column in dataframe to filter out the week of the month based on date

I have a code as below:
from datetime import datetime
import random
pd.DataFrame({'date':pd.date_range(datetime.today(), periods=100).tolist(),
'country': random.sample(range(1,101), 100),
'amount': random.sample(range(1,101), 100),
'others': random.sample(range(1,101), 100)})
I wish to have an output such as:
month_week sum(country) sum(amount) sum(other)
4_1
4_2
4_3
4_4
the sum is actually the value sum of the week.
Something like this:
In [713]: df['month_week'] = df['date'].dt.month.map(str) + '_' + df['date'].apply(lambda d: (d.day-1) // 7 + 1).map(str)
In [725]: df.groupby('month_week').sum().reset_index()
Out[725]:
month_week country amount others
0 4_3 377 367 290
1 4_4 315 445 475
2 4_5 128 48 47
3 5_1 395 355 293
4 5_2 382 500 430
5 5_3 286 196 250
6 5_4 291 448 343
7 5_5 151 147 109
8 6_1 434 359 437
9 6_2 371 301 487
10 6_3 303 475 243
11 6_4 327 270 274
12 6_5 174 114 161
13 7_1 432 253 360
14 7_2 272 321 361
15 7_3 353 404 327
16 7_4 59 47 163

Can I separate data for each curve?

I want to get the points in quadratic curve to get the quadratic equation:
ay^2 + by + c = d
I get a set of data,
x = [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
95 0 92 0 92 96 0 92 96 0 92 96 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92 0 92
0 92 0 92 0 92 0 92 0 92 153 0 92 0 92 0 92 149
0 92 0 92 146 0 92 145 0 92 144 0 92 0 92 0 92 140
0 92 139 0 92 138 0 92 137 0 92 136 0 92 135 0 92 134
0 92 133 0 92 132 0 92 131 0 92 130 0 92 128 129 0 92
128 0 92 127 0 92 126 127 0 92 125 126 0 92 124 125 0 92
124 0 92 123 0 92 122 0 121 0 120 121 0 119 120 0 118 119
0 117 118 0 117 0 116 117 0 115 116 0 114 115 0 114 0 113
114 0 112 113 0 112 0 111 0 110 111 0 109 110 0 109 0 108
0 107 108 0 107 0 106 0 105 106 0 105 0 104 105 0 103 104
0 103 0 102 103 0 102 0 101 0 100 0 99 100 0 99 0 98
99 0 98 0 97 0 96 97 0 96 0 95 96 0 95 0 94 0
94 0 93 0 93 0 92 0 91 92 0 91 0 90 91 0 90 0
89 90 0 89 0 88 89 0 88 89 0 88 0 88 0 88 0 87
0 87 0 0 0 0 0 0 0 0 0 0 0 0 0]
y =
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 89 90 91
92 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
201 201 202 202 203 203 204 204 205 205 206 206 207 207 208 208 209 209
210 210 211 211 212 212 213 213 214 214 215 215 216 216 217 217 218 218
219 219 220 220 221 221 222 222 223 223 224 224 225 225 226 226 227 227
228 228 229 229 230 230 231 231 232 232 233 233 234 234 235 235 236 236
237 237 238 238 239 239 240 240 241 241 242 242 243 243 244 244 245 245
246 246 247 247 248 248 249 249 250 250 251 251 252 252 253 253 254 254
254 255 255 256 256 256 257 257 257 258 258 258 259 259 260 260 261 261
262 262 263 263 264 264 265 265 266 266 267 267 268 268 269 269 270 270
271 271 272 272 273 273 274 274 275 275 276 276 277 277 278 278 279 279
280 280 281 281 282 282 283 283 284 284 285 285 286 286 287 287 288 288
289 289 290 290 291 291 292 292 293 293 294 294 295 295 296 296 297 297
298 298 299 299 300 300 301 301 302 302 303 303 304 304 305 305 306 306
307 307 308 308 309 309 310 310 311 311 312 312 313 313 314 314 315 315
316 316 317 317 318 318 319 319 320 320 320 321 321 322 322 323 323 323
324 324 325 325 325 326 326 326 327 327 327 328 328 329 329 330 330 330
331 331 331 332 332 332 333 333 333 334 334 334 335 335 335 336 336 336
337 337 337 338 338 338 339 339 339 340 340 340 341 341 341 341 342 342
342 343 343 343 344 344 344 344 345 345 345 345 346 346 346 346 347 347
347 348 348 348 349 349 349 350 350 351 351 351 352 352 352 353 353 353
354 354 354 355 355 356 356 356 357 357 357 358 358 358 359 359 360 360
360 361 361 361 362 362 363 363 364 364 364 365 365 365 366 366 367 367
368 368 368 369 369 370 370 371 371 371 372 372 373 373 373 374 374 374
375 375 376 376 376 377 377 378 378 379 379 380 380 380 381 381 382 382
382 383 383 384 384 385 385 385 386 386 387 387 387 388 388 389 389 390
390 391 391 392 392 393 393 394 394 394 395 395 396 396 396 397 397 398
398 398 399 399 400 400 400 401 401 401 402 402 403 403 404 404 405 405
406 406 407 408 409 410 411 412 413 414 415 416 417 418 419]
I can view there were 3 lines in the plot. Can I separate data for each curve?
or can I only extract value of the quadratic curve?
Try DBSCAN algorithm, it is implemented in sklearn already.
It works well if your sample in each curve are very dense to each other in the same curve but very far from others in other curves

Editting values in a dataframe based of the information in another dataframe

I have one dataframe called _df1 which looks like this. Please not that this is not the entire dataframe but parts of it.
_df1:
frame id x1 y1 x2 y2
1 1 1363 569 103 241
2 1 1362 568 103 241
3 1 1362 568 103 241
4 1 1362 568 103 241
964 5 925 932 80 255
965 5 925 932 79 255
966 5 925 932 79 255
967 5 924 932 80 255
968 5 924 932 79 255
16 6 631 761 100 251
17 6 631 761 100 251
18 6 631 761 100 251
19 6 631 761 100 251
20 6 631 761 100 251
21 6 631 761 100 251
88 7 623 901 144 123
89 7 623 901 144 123
90 7 623 901 144 123
91 7 623 901 144 123
92 7 623 901 144 123
93 7 623 901 144 123
94 7 623 901 144 123
In the full database, there are 108003 rows and 141 unique IDs in the dataframe. An ID represents a specific object and the ID is repeated as long as that frame has that object. In other words, my data has 141 different objects and 108003 frames. I wrote a code to identify frames that have the same objects but is labelled with a different ID. This is saved in another dataframe called _df2 which looks like this. This is also only part of the dataframe, not the entire thing.
_df2:
indexID matchID
4 5
6 7
8 9
12 13
18 19
20 21
.
.
.
The second dataframe shows which indexes has been wrongly classified as a different object. This means that the ID in 'matchID' is actually the same object as 'indexID'. This 'indexID' in _df2 corresponds to 'id' in _df1.
Taking the first line in _df2 as an example, it says that index 4 and 5 is the same. Therefore, I need to change the 'id' values, in _df1, of all the frames with 'id' 5 to 4. This is an example of what the final table should look like since 5 has to be classified as 4 and 7 has to be classified as 6.
Output:
frame id x1 y1 x2 y2
1 1 1363 569 103 241
2 1 1362 568 103 241
3 1 1362 568 103 241
4 1 1362 568 103 241
964 4 925 932 80 255
965 4 925 932 79 255
966 4 925 932 79 255
967 4 924 932 80 255
968 4 924 932 79 255
16 6 631 761 100 251
17 6 631 761 100 251
18 6 631 761 100 251
19 6 631 761 100 251
20 6 631 761 100 251
21 6 631 761 100 251
88 6 623 901 144 123
89 6 623 901 144 123
90 6 623 901 144 123
91 6 623 901 144 123
92 6 623 901 144 123
93 6 623 901 144 123
94 6 623 901 144 123
Using replace
df1.id=df.id.replace(dict(zip(df2.indexID,df2.matchID)))

Categories