2d list in python - accessing through column names - python

I'm parsing two files which has data as shown below
File1:
UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67
file2:
UID X Y A Z B
------ ---------- ---------- ---------- ---------- ---------
456 536 1 148 304 234
1071 908 1 128 243 12
1118 4 8 52 162 123
249 4 8 68 154 987
1072 296 416 68 114 45
118 180 528 68 67 6
I will be comparing two such files, however the number of columns might vary and the columns names. For every unique UID, I need to match the column names, compare and find the difference.
Questions
1. Is there a way to access columns by column names instead of index?
2. Dynamically give column names based on the file data?
I'm able to load the file into list, and compare using indexes, but thats not a proper solutions.
Thanks in advance.

You might consider using csv.DictReader. It allows you both to address columns by names, and a variable list of columns for each file opened. Consider removing the ------ separating header from actual data as it might be read wrong.
Example:
import csv
with open('File1', 'r', newline='') as f:
# If you don't pass field names
# they are taken from the first row.
reader = csv.DictReader(f)
for line in reader:
# `line` is a dict {'UID': val, 'A': val, ... }
print line
If your input format has no clear delimiter (multiple whitespaces), you can wrap the file with a generator that will compress continous whitespaces into e.g. a comma:
import csv
import re
r = re.compile(r'[ ]+')
def trim_whitespaces(f):
for line in f:
yield r.sub(',', line)
with open('test.txt', 'r', newline='') as f:
reader = csv.DictReader(trim_whitespaces(f))
for line in reader:
print line

This is an good use case for pandas, loading data is as simple as:
import pandas as pd
from StringIO import StringIO
data = """ UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')
Let's inspect results:
>>> df
UID A B C D
0 456 536 1 148 304
1 1071 908 1 128 243
2 1118 4 8 52 162
3 249 4 8 68 154
4 1072 296 416 68 114
5 118 180 528 68 67
After obtaining df2 with similar means we can merge results:
>>> df.merge(df2, on=['UID'])
UID A_x B_x C D X Y A_y Z B_y
0 456 536 1 148 304 536 1 148 304 234
1 1071 908 1 128 243 908 1 128 243 12
2 1118 4 8 52 162 4 8 52 162 123
3 249 4 8 68 154 4 8 68 154 987
4 1072 296 416 68 114 296 416 68 114 45
5 118 180 528 68 67 180 528 68 67 6
Resulting pandas.DataFrame has a very profound API, and all SQL-like analisis operations such as joining, filtering, grouping, aggregating etc are easy to perform. Look for examples on this site or in the documentation.

my_text = """UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns
print my_dict['A'] #access
obviously you would need to change it if you had to read from a file say
alternatively this is what packages like pandas were made for
import pandas
table = pandas.read_csv('foo.csv', index_col=0)

Related

How could i count the rating for each item_id?

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

pad rows on a pandas dataframe with zeros till N count

Iam loading data via pandas read_csv like so:
data = pd.read_csv(file_name_item, sep=" ", header=None, usecols=[0,1,2])
which looks like so:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
I would like to pad this data with zeros till a row count of 256, meaning:
0 1 2
0 157 303 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
5 183 394 255
6 192 179 15
7 192 347 234
8 192 380 243
9 192 437 135
10 211 358 234
11 0 0 0
.. .. .. ..
256 0 0 0
How do I go about doing this? The file could have anything from 1 row to 200 odd rows and I am looking for something generic which pads this dataframe with 0's till 256 rows.
I am quite new to pandas and could not find any function to do this.
reindex with fill_value
df_final = data.reindex(range(257), fill_value=0)
Out[1845]:
0 1 2
0 257 503 48
1 167 258 39
2 172 242 39
3 172 403 81
4 180 228 39
.. ... ... ..
252 0 0 0
253 0 0 0
254 0 0 0
255 0 0 0
256 0 0 0
[257 rows x 3 columns]
We can do
new_df = df.reindex(range(257)).fillna(0, downcast='infer')

Comparing CSVs with rows as multiples of 6

I have two CSV files with rows as n-multiples of 6 and I want to compare them. If a row in CSV1 has the same values of column 1 to 3 as any in CSV2, but their column 4 values are different replace the column 4 in CSV1 with column 4 in CSV2. So far I have written the code below which reads both CSVs and groups them by 6(s) but I don't know what next to do as it results in a list of list of list which I cannot handle. N.B One CSV has more rows than the other.
My Code:
import csv
def datareader(datafile):
with open(datafile, 'r') as f:
reader = csv.reader(f)
next(reader, None)
List1 = [lines for lines in reader]
return [List1[pos:pos + 6] for pos in xrange(0, len(List1), 6)]
list1 = datareader('CSV1.csv')
def datareader1(datafile):
# Read the csv
with open(datafile, 'r') as f:
reader = csv.reader(f)
next(reader, None)
List2 = [lines for lines in reader]
return [List2[pos:pos + 6] for pos in xrange(0, len(List2), 6)]
list2 = datareader1('CSV2.csv')
CSV1
frm to grp dur
192 177 1 999999
192 177 2 749
192 177 3 895
192 177 4 749
192 177 5 749
192 177 6 222222
192 178 1 222222
192 178 2 222222
192 178 3 222222
192 178 4 222222
192 178 5 1511
192 178 6 999999
192 179 1 999999
192 179 2 387
192 179 3 969
192 179 4 387
192 179 5 387
192 179 6 999999
CSV2
from_BAKCode to_BAKCode interval duration
192 177 1 999999
192 177 2 749
192 177 3 749
192 177 4 749
192 177 5 749
192 177 6 999999
192 178 1 999999
192 178 2 999999
192 178 3 999999
192 178 4 999999
192 178 5 1511
192 178 6 999999
You can use pandas module to deal with data manipulation. It will be much easier.
import pandas as pd
def add_new_dur(row):
if row['dur'] == row['duration']: return row['dur']
else: return row['duration']
fileNameCSV1 = 'csv1.csv'
fileNameCSV2 = 'csv2.csv'
df = dict()
for f in [fileNameCSV1, fileNameCSV2]:
df[f.split('.')[0]] = pd.read_csv(f)
result = df['csv1'].merge(df['csv2'],
left_on = ['frm', 'to', 'grp'],
right_on = ['from_BAKCode', 'to_BAKCode', 'interval'])
result['new_dur'] = result.apply(add_new_dur, axis=1)
result = result[['frm', 'to', 'grp', 'new_dur']]
result = result.rename(columns={'new_dur':'dur'})
The result will look like this.
frm to grp dur
0 192 177 1 999999
1 192 177 2 749
2 192 177 3 749
3 192 177 4 749
4 192 177 5 749
5 192 177 6 999999
6 192 178 1 999999
7 192 178 2 999999
8 192 178 3 999999
9 192 178 4 999999
10 192 178 5 1511
11 192 178 6 999999
Incase you have one csv file has greater number of rows than the other csv file, then the extra rows will be omitted.
Hope it helps.

Renaming a subset of index from a dataframe

I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'

Grouping, sorting lines and removing redundancies

I have an input file having 15 columns,
con13 tr|M0VCZ1| 91.39 267 23 0 131 211 1 267 1 480 239 267 33.4 99.6
con13 tr|M8B287| 97.12 590 17 0 344 211 1 267 0 104 239 590 74.0 99.8
con15 tr|M0WV77| 92.57 148 11 0 73 516 1 148 2 248 256 148 17.3 99.3
con15 tr|C5WNQ0| 85.14 148 22 0 73 516 1 178 4 233 256 148 17.3 99.3
con15 tr|B8AQC2| 83.78 148 24 0 73 516 1 148 6 233 256 148 17.3 99.3
con18 tr|G9HXG9| 99.66 293 1 0 144 102 1 293 7 527 139 301 63.1 97.0
con18 tr|M0XCZ0| 98.29 293 5 0 144 102 1 293 2 519 139 301 63.1 97.0
I need to 1) group and iterate inside each con (using groupby), 2) sort line[2] from lowest to highest value, 3) see inside each group if line[0], line[8] and line[9] are similar, 4) if they are similar, remove repetitive elements and print the results in a new .txt file choosing the one that has highest value in line[2], so that my output file looks like this,
con13 tr|M8B287| 97.12 590 17 0 344 211 1 267 0 104 239 590 74.0 99.8
con15 tr|M0WV77| 92.57 148 11 0 73 516 1 148 2 248 256 148 17.3 99.3
con15 tr|C5WNQ0| 85.14 148 22 0 73 516 1 178 4 233 256 148 17.3 99.3
con18 tr|G9HXG9| 99.66 293 1 0 144 102 1 293 7 527 139 301 63.1 97.0
My attempted script, prints only one single con and does not sort,
from itertools import groupby
f1 = open('example.txt','r')
f2 = open('result1', 'w')
f3 = open('result2.txt','w')
for k, g in groupby(f1, key=lambda x:x.split()[0]):
seen = set()
for line in g:
hsp = tuple(line.rsplit())
if hsp[8] and hsp[9] not in seen:
seen.add(hsp)
f2.write(line.rstrip() + '\n')
else:
f3.write(line.rstrip() + '\n')
Use the csv module to pre-split your lines for you and write out formatted data again, and use a tuple in seen (of just the 9th and 10th columns) to track similar rows:
import csv
from itertools import groupby
from operator import itemgetter
with open('example.txt','rb') as f1
with open('result1', 'wb') as f2, open('result2.txt','wb') as f3):
reader = csv.reader(f1, delimiter='\t')
writer1 = csv.writer(f2, delimiter='\t')
writer2 = csv.writer(f3, delimiter='\t')
for group, rows in groupby(reader, itemgetter(0)):
rows = sorted(rows, key=itemgetter(8, 9, 2))
for k, rows in groupby(rows, itemgetter(8, 9)):
# now we are grouping on columns 8 and 9,
# *and* these are sorted on column 2
# everything but the *last* row is written to writer2
rows = list(rows)
writer1.writerow(rows[-1])
writer2.writerows(rows[:-1])
The sorted(rows, key=itemgetter(2)) call sorts the grouped rows (so all rows with the same row[0] value) on the 3rd column.
Because you then want to write only the row with the highest value in column 2 *per group of rows with column 8 and 9 equal) to the first result file, we group again, but sorted on columns 8, 9 and 2 (in that order), then group on just columns 8 and 9 giving us sorted groups in ascending order for column 2. The last row is then written to result1, the rest to result2.txt.

Categories