I have an excel file with +400k rows of protein_protein interactions with Entrez identifiers, I want to map the identifiers to corresponding identifiers of different database Uniprot
database looks like this:
and i want this
Provided that I have the corresponding values of each entrez id to uniprot id
Could you please suggest me an efficient way to do this, I can't think of anything other than iterating over the database
OK, this took me a minute to grok, but I think I have this for you. We discussed the example in chat, so you should probably update your question to reflect my answer since it varies from the original.
This is just iterating over the tables, so it's not a more efficient version, but I wasn't aware if you had anything at this point to start from, so at least this is something.
We're trying to create table2 from table1 and table3:
Starting with these CSV files:
table1.csv
paperA_db1,paperB_db1
9240,8601
8933,91289
table3.csv
paper_db1,paper_db2
9240,Q8ND90
8933,A6ZKI3
8601,O76081
91289,Q9BU23
We can do this using Python's csv module, like this:
import csv
mappings = {}
with open("table3.csv", newline="") as mapping_csv:
reader = csv.DictReader(mapping_csv)
for row in reader:
mappings[row["paper_db1"]] = row["paper_db2"]
table3 = {}
with open("table1.csv", newline="") as table1_csv:
reader = csv.DictReader(table1_csv)
for row in reader:
table3[mappings[row["paperA_db1"]]] = mappings[row["paperB_db1"]]
with open("table2.csv", newline="", mode="w") as table2_csv:
fieldnames = ['paperA_db2', 'paperB_db2']
writer = csv.DictWriter(table2_csv, fieldnames=fieldnames)
writer.writeheader()
for paperA_db2, paperB_db2 in table3.items():
writer.writerow(dict(paperA_db2=paperA_db2, paperB_db2=paperB_db2))
Here's that running on my machine:
Mine is very similar to #DanielSchroederDev.
I've got a error check on the lookup, so the script keeps going.
I also just use a csv.reader rather than a csv.DictReader, 2 columns is pretty easy to keep in your head.
It also seems like overkill to use pandas, but if your data is in Excel, you'll need to uses a Excel reader, much easier to use text files, so save as csv!
import csv
trans = dict()
with open("key_file.csv", "r", encoding="utf8") as f:
c = csv.reader(f)
next(c)
for row in c:
trans[row[0]] = row[1]
print(trans)
def lookup(p):
try:
return trans[p]
except KeyError:
print(f"No translation for {p}")
return 0
with open("protiens.csv", "r", encoding="utf8") as f:
c = csv.reader(f)
next(c)
new_protiens = list(map(lambda x: [lookup(x[0]), lookup(x[1])], c))
print(new_protiens)
with open("translated.csv", "w", encoding="utf8") as f:
c = csv.writer(f)
c.writerow(["proA_unipro", "proB_unipro"])
for row in new_protiens:
c.writerow(row)
Related
I have a small issue that I hope you'll be able to help me with :) I've tried to provide simplified examples to help you see what I mean. I'm using python 2.6.
So, I'm currently trying to re-assign some values in a file which represents interactions between two objects. The interaction file (file1) looks something like this:
Thing1 Thing2 0.625
Thing2 Thing3 0.191
Thing1 Thing3 0.173
Whilst my other file (file2), also a tsv, looks something like:
DiffName1 Thing1 ...
DiffName2 Thing2 ...
DiffName3 Thing3 ...
Essentially, I'd like to take file1, find the corresponding 'DiffName' value in file2, and make a new file with the same layout as file1 but with 'Thing1' replaced with 'DiffName1' and so on, whilst maintaining the structure of file1. i.e two columns with corresponding interaction value.
So far, from asking questions and reading answers on here, I've achieved similar results with this script: (I've checked but there may be some redundant/wrong things in here)
import csv
import sys
interaction_file = sys.argv[1]
Out_file = sys.argv[2]
f_output = open(Out_file, 'wb')
ids = {}
with open('file2') as f_file2:
csv_file2 = csv.reader(f_file2, skipinitialspace=True)
header = next(csv_file2)
for cols in csv_file2:
ids[cols[7]] = cols[0]
with open(interaction_file, 'rb') as f_file1:
csv_file1 = csv.reader(f_file1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
for cols in csv_file1:
csv_output.writerow([ids.get(cols[0], cols[0]), ids.get(cols[1], cols[1]), cols[2]])
But for whatever reason, I suspect due to the slightly different layout of file2 compared to the file that this scripts was originally written for, I've been unable to make this work for me. I've spent quite a bit of time trying to understand each line of this file but I still can't quite get it running, possibly because I don't quite fully understand the final line:
csv_output.writerow([ids.get(cols[0], cols[0]), ids.get(cols[1], cols[1]), cols[2]])
Is anyone able to give me some advice?
Cheers,
Matthew
Is in that line ids[cols[7]] = cols[0] just a typo, you seem to have only 2 columns in your example, and you are trying to use 7th column.
What it does, is that you declare a dictionary and populate it from 2nd file. Then, when you search in dictionary with get ids.get(cols[0], cols[0]), it will search for a key cols[0] and if it is not in dictionary, it will return second argument of get function, in this case cols[0] itself.
I added some annotations to your script and changed/shortened some bits. The docs on dict.get should help you understand the last line:
import csv, sys
interaction_file, out_file = sys.argv[1], sys.argv[2]
f_output = open(out_file, 'wb')
with open('file2') as f_file2:
# get lines as list and slice off header row
rows = list(csv.reader(f_file2, skipinitialspace=True, delimiter='\t'))[1:]
# ids: Thing* as key, DiffName* as value
ids = {row[1]: row[0] for row in rows}
with open(interaction_file, 'rb') as f_file1:
csv_file1 = csv.reader(f_file1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
for row in csv_file1:
csv_output.writerow([ids.get(row[0], row[0]), ids.get(row[1], row[1]), row[2]])
# ids.get(row[0], row[0]): dict.get(key[, default])
# use value (DiffName*) for key row[0] (Thing*) from ids,
# or use row[0] (Thing*) itself
# if it is not present as a key in ids
Check that your input files have correct delimiters. And also seeing the error message would be good.
I'm somewhat new to Python and still trying to learn all its tricks and exploitations.
I'm looking to see if it's possible to collect column data from two separate files to create a single dictionary, rather than two distinct dictionaries. The code that I've used to import files before looks like this:
import csv
from collections import defaultdict
columns = defaultdict(list)
with open("myfile.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (header,variable) in row.items():
columns[header].append(variable)
f.close()
This code makes each element of the first line of the file into a header for the columns of data below it. What I'd like to do now is to import a file that only contains one line which I'll use as my header, and import another file that only contains data that I'll match the headers up to. What I've tried so far resembles this:
columns = defaultdict(list)
with open("headerData.txt") as g:
reader1 = csv.DictReader(g,delimiter='\t')
for row in reader1:
for (h,v) in row.items():
columns[h].append(v)
with open("variableData.txt") as f:
reader = csv.DictReader(f,delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)
Is nesting the open statements the right way to attempt this? Honestly I am totally lost on what to do. Any help is greatly appreciated.
You can't use DictReader like that if the headers are not in the file. But you can create a fake file object that would yield the headers and then the data, using itertools.chain:
from itertools import chain
with open('headerData.txt') as h, open('variableData.txt') as data:
f = chain(h, data)
reader = csv.DictReader(f,delimiter='\t')
# proceed with you code from the first snippet
# no close() calls needed when using open() with "with" statements
Another way of course would be to just read the headers into a list and use regular csv.reader on variableData.txt:
with open('headerData') as h:
names = next(h).split('\t')
with open('variableData.txt') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
for name, value in zip(names, row):
columns[name].append(value)
By default, DictReader will take the first line in your csv file and use that as the keys for the dict. However, according to the docs, you can also pass it a fieldnames parameter, which is a sequence containing the names of the keys to use for the dict. So you could do this:
columns = defaultdict(list)
with open("headerData.txt") as f, open("variableData.txt") as data:
reader = csv.DictReader(data,
fieldnames=f.read().rstrip().split('\t'),
delimiter='\t')
for row in reader:
for (h,v) in row.items():
columns[h].append(v)
I am trying to append 2 data sets to my csv file. Below is my code. The code runs but my data gets appended below a set of data in the first column (i.e. col[0]). I would however like to append my data sets in separate columns at the end of file. Could I please get advice on how I might be able to do this? Thanks.
import csv
Trial = open ('Trial_test.csv', 'rt', newline = '')
reader = csv.reader(Trial)
Trial_New = open ('Trial_test.csv', 'a', newline = '')
writer = csv.writer(Trial_New, delimiter = ',')
Cortex = []
Liver = []
for col in reader:
Cortex_Diff = float(col[14])
Liver_Diff = float(col[17])
Cortex.append(Cortex_Diff)
Liver.append(Liver_Diff)
Avg_diff_Cortex = sum(Cortex)/len(Cortex)
Data1 = str(Avg_diff_Cortex)
Avg_diff_Liver = sum(Liver)/len(Liver)
Data2 = str(Avg_diff_Liver)
writer.writerows(Data1 + Data2)
Trial.close()
Trial_New.close()
I think I see what you are trying to do. I won't try to rewrite your function entirely for you, but here's a tip: assuming you are dealing with a manageable size of dataset, try reading your entire CSV into memory as a list of lists (or list of tuples), then perform your calculations on the values on this object, then write the python object back out to the new CSV in a separate block of code. You may find this article or this one of use. Naturally the official documentation should be helpful too.
Also, I would suggest using different files for input and output to make your life easier.
For example:
import csv
data = []
with open('Trial_test.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
data.append(row)
# now do your calculations on the 'data' object.
with open('Trial_test_new.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ', quotechar='|')
for row in data:
writer.writerow(row)
Something like that, anyway!
I have no knowledge of python.
What i want to be able to do is create a script that will edit a CSV file so that it will wrap every field in column 3 around quotes. I haven't been able to find much help, is this quick and easy to do? Thanks.
column1,column2,column3
1111111,2222222,333333
This is a fairly crude solution, very specific to your request (assuming your source file is called "csvfile.csv" and is in C:\Temp).
import csv
newrow = []
csvFileRead = open('c:/temp/csvfile.csv', 'rb')
csvFileNew = open('c:/temp/csvfilenew.csv', 'wb')
# Open the CSV
csvReader = csv.reader(csvFileRead, delimiter = ',')
# Append the rows to variable newrow
for row in csvReader:
newrow.append(row)
# Add quotes around the third list item
for row in newrow:
row[2] = "'"+str(row[2])+"'"
csvFileRead.close()
# Create a new CSV file
csvWriter = csv.writer(csvFileNew, delimiter = ',')
# Append the csv with rows from newrow variable
for row in newrow:
csvWriter.writerow(row)
csvFileNew.close()
There are MUCH more elegant ways of doing what you want, but I've tried to break it down into basic chunks to show how each bit works.
I would start by looking at the csv module.
import csv
filename = 'file.csv'
with open(filename, 'wb') as f:
reader = csv.reader(f)
for row in reader:
row[2] = "'%s'" % row[2]
And then write it back in the csv file.
For example, my csv has columns as below:
ID, ID2, Date, Job No, Code
I need to write the columns back in the same order. The dict jumbles the order immediately, so I believe it's more of a problem with the reader.
Python's dicts do NOT maintain order prior to 3.6 (but, regardless, in that version the csv.DictReader class was modified to return OrderedDicts).
However, the instance of csv.DictReader that you're using (after you've read the first row!-) does have a .fieldnames list of strings, which IS in order.
So,
for rowdict in myReader:
print ['%s:%s' % (f, rowdict[f]) for f in myReader.fieldnames]
will show you that the order is indeed maintained (in .fieldnames of course, NEVER in the dict -- that's intrinsically impossible in Python!-).
So, suppose you want to read a.csv and write b.csv with the same column order. Using plain reader and writer is too easy, so you want to use the Dict varieties instead;-). Well, one way is...:
import csv
a = open('a.csv', 'r')
b = open('b.csv', 'w')
ra = csv.DictReader(a)
wb = csv.DictWriter(b, None)
for d in ra:
if wb.fieldnames is None:
# initialize and write b's headers
dh = dict((h, h) for h in ra.fieldnames)
wb.fieldnames = ra.fieldnames
wb.writerow(dh)
wb.writerow(d)
b.close()
a.close()
assuming you have headers in a.csv (otherewise you can't use a DictReader on it) and want just the same headers in b.csv.
Make an OrderedDict from each row dict sorted by DictReader.fieldnames.
import csv
from collections import OrderedDict
reader = csv.DictReader(open("file.csv"))
for row in reader:
sorted_row = OrderedDict(sorted(row.items(),
key=lambda item: reader.fieldnames.index(item[0])))
from csv import DictReader, DictWriter
with open("input.csv", 'r') as input_file:
reader = DictReader(f=input_file)
with open("output.csv", 'w') as output_file:
writer = DictWriter(f=output_file, fieldnames=reader.fieldnames)
for row in reader:
writer.writerow(row)
I know this question is old...but if you use DictReader, you can pass it an ordered list with the fieldnames to the fieldnames param
Edit: as of python 3.6 dicts are ordered by insertion order, essentially making all dicts in python OrderedDicts by default. That being said the docs say dont rely on this behaviour because it may change. I will challenge that, lets see if it ever changes back :)
Unfortunatley the default DictReader does not allow for overriding the dict class, a custom DictReader would do the trick though
import csv
class DictReader(csv.DictReader):
def __init__(self, *args, **kwargs):
self.dict_class = kwargs.pop(dict_class, dict)
super(DictReader, self).__init__(*args, **kwargs)
def __next__(self):
''' copied from python source '''
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = next(self.reader)
self.line_num = self.reader.line_num
# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == []:
row = next(self.reader)
# using the customized dict_class
d = self.dict_class(zip(self.fieldnames, row))
lf = len(self.fieldnames)
lr = len(row)
if lf < lr:
d[self.restkey] = row[lf:]
elif lf > lr:
for key in self.fieldnames[lr:]:
d[key] = self.restval
return d
use it like so
import collections
csv_reader = DictReader(f, dict_class=collections.OrderedDict)
# ...
I wrote a little tool to sort the order of CSV columns:
I don't claim that it's great I know little of Python, but it does the job:
import csv
import sys
with open(sys.argv[1], 'r') as infile:
csvReader = csv.DictReader(infile)
sorted_fieldnames = sorted(csvReader.fieldnames)
writer = csv.DictWriter(sys.stdout, fieldnames=sorted_fieldnames)
# reorder the header first
writer.writeheader()
for row in csvReader:
# writes the reordered rows to the new file
writer.writerow(row)