How could i count the rating for each item_id? - python

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

Related

pad rows on a pandas dataframe with zeros till N count

Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.
reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]
We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')

make another column in dataframe to filter out the week of the month based on date

I have a code as below:
from datetime import datetime
import random
pd.DataFrame({'date':pd.date_range(datetime.today(), periods=100).tolist(),
'country': random.sample(range(1,101), 100),
'amount': random.sample(range(1,101), 100),
'others': random.sample(range(1,101), 100)})
I wish to have an output such as:
month_week sum(country) sum(amount) sum(other)
4_1
4_2
4_3
4_4
the sum is actually the value sum of the week.
Something like this:
In [713]: df['month_week'] = df['date'].dt.month.map(str) + '_' + df['date'].apply(lambda d: (d.day-1) // 7 + 1).map(str)
In [725]: df.groupby('month_week').sum().reset_index()
Out[725]:
month_week country amount others
0 4_3 377 367 290
1 4_4 315 445 475
2 4_5 128 48 47
3 5_1 395 355 293
4 5_2 382 500 430
5 5_3 286 196 250
6 5_4 291 448 343
7 5_5 151 147 109
8 6_1 434 359 437
9 6_2 371 301 487
10 6_3 303 475 243
11 6_4 327 270 274
12 6_5 174 114 161
13 7_1 432 253 360
14 7_2 272 321 361
15 7_3 353 404 327
16 7_4 59 47 163

Extract complete country name from a string and make it as a dataframe column

I have a data as below.How to convert the below into a dataframe. I need the Country name(some country names has comma inbetween)as first column and other values as each columns.
Input is a txt file with many countries
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
Output should be a dataframe with country name as first column
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
You can first use read_csv (no problem if it is .txt file) with some separator which is not in values like | for Series, then extract and strip country names to one column and another values split by ,:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object
df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
If need index with countries:
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485
Solution be regex from another answer - it is possible use it as sep parameter, only engine='python' is necessary because warning:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
jezrael's answer is the way to go if you want the complete output asap.
If you want to really understand some simpler code, try doing the following:
Split the string into some lists like this:
data = "Czech Republic..."
lines = data.split('\n')
rows = []
then iterate over the lines, and append them to a list of lists:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
for line in lines:
temp = line.split(',')
if is_number(temp[1]):
rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
else:
rows.append(temp)
then use this list of lists and read the following pandas DataFrame documentation, on how to preety-print it. (Hint: make the list of lists a dict first)
The solution using re.split() function and labeled data structure with columns:
import pandas as pd, re
s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
if l: data.append(re.split(r',(?=\d)', l))
# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))
print(df)
The output:
Country name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485

Renaming a subset of index from a dataframe

I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'

Splitting the header into multiple headers in DataFrame

I have a DataFrame where I need to split the header into multiple rows as headers for the same Dataframe.
The dataframe looks like this,
My data Frame looks like follows,
gene ALL_ID_1 AML_ID_1 AML_ID_2 AML_ID_3 AML_ID_4 AML_ID_5 Stroma_ID_1 Stroma_ID_2 Stroma_ID_3 Stroma_ID_4 Stroma_ID_5 Stroma_CR_Pat_4 Stroma_CR_Pat_5 Stroma_CR_Pat_6 Stroma_CR_Pat_7 Stroma_CR_Pat_8
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
And I want the the above header to be spitted as follows,
network ID ID REL
node B_ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
Any help is greatly appreciated ..
Probably not the best minimal example you put here, very few people has the subject knowledge to understand what is network, node and hemi in your context.
You just need to create your MultiIndex and replace your column index with the one you created:
There are 3 rules in your example:
1, whenever 'Stroma' is found, the column belongs to REL, otherwise belongs to ID.
2, node is the first field of the initial column names
3, hemi is the last field of the initial column names
Then, just code away:
In [110]:
df.columns = pd.MultiIndex.from_tuples(zip(np.where(df.columns.str.find('Stroma')!=-1, 'REL', 'ID'),
df.columns.map(lambda x: x.split('_')[0]),
df.columns.map(lambda x: x.split('_')[-1])),
names=['network', 'node', 'hemi'])
print df
network ID REL \
node ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5
gene
ENSG 8 1 11 5 10 0 628 542 767 578 462
ENSG 0 0 1 0 0 0 0 28 1 3 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110
ENSG 11 26 24 9 11 2 649 532 953 463 468
network
node
hemi 4 5 6 7 8
gene
ENSG 680 513 968 415 623
ENSG 1 4 0 0 0
ENSG 9 3 3 5 1
ENSG 857 1880 1526 2262 2624
ENSG 878 587 245 722 484

Categories