Compare 2 .csv Files via Loop - python

I have 15 .csv files with the following formats:
**File 1**
MYC
RASSF1
DAPK1
MDM2
TP53
E2F1
...
**File 2**
K06227
C00187
GLI1
PTCH1
BMP2
TP53
...
I would like to create a loop that runs through each of the 15 files and compares 2 at each time, creating unique pairs. So, File 1 and File 2 would be compared with each other giving an output telling me how many matches it found and what they were. So in the above example, the output would be:
1 match and TP53
The loops would be used to compare all the files against each other so 1,3 (File 1 against File 3), 1,4 and so on.
f1 = set(open(str(cancers[1]) + '.csv', 'r'))
f2 = set(open(str(cancers[2]) + '.csv', 'r'))
f3 = open(str(cancers[1]) + '_vs_' + str(cancers[2]) + '.txt', 'wb').writelines(f1 & f2)
The above works but I'm having a hard time creating the looping portion.

In order not to compare the same file, and make the code flexible to the number of cancers, I would code like this. I assume cancer is a list.
# example list of cancers
cancers = ['BRCA', 'BLCA', 'HNSC']
fout = open('match.csv', 'w')
for i in range(len(cancers)):
for j in range(len(cancers)):
if j > i:
# if there are string elements in cancers,
# then it doesn't need 'str(cancers[i])'
f1 = [x.strip() for x in set(open(cancers[i] + '.csv', 'r'))]
f2 = [x.strip() for x in set(open(cancers[j] + '.csv', 'r'))]
match = list(set(f1) & set(f2))
# I use ; to separate matched genes to make excel able to read
fout.write('{}_vs_{},{} matches,{}\n'.format(
cancers[i], cancers[j], len(match), ';'.join(match)))
fout.close()
Results
BRCA_vs_BLCA,1 matches,TP53
BRCA_vs_HNSC,6 matches,TP53;BMP2;GLI1;C00187;PTCH1;K06227
BLCA_vs_HNSC,1 matches,TP53

To loop through all pairs up to 15, something like this can do it:
for i in range(1, 15):
for j in range(i+1, 16):
f1 = set(open(str(cancers[i]) + '.csv', 'r'))
f2 = set(open(str(cancers[j]) + '.csv', 'r'))
f3 = open(str(cancers[i]) + '_vs_' + str(cancers[j]) + '.txt',
'wb').writelines(f1 & f2)

Related

Python: write function return value to output file

I have 5 input files. I'm reading the targets & arrays from the 2nd & 3rd lines. I have 2sum function. I need to output the function's returned value(s) and output them to 5 output files.
I know my functions correct. I print it out just fine.
I know I'm reading 5 input files & creating & writing to 5 output files.
What I can't figure out is how to write my return value(s) from the twoSum function INTO the output function.
def twoSum(arr, target):
for i in range(len(arr)):
for j in range(i, len(arr)):
curr = arr[i] + arr[j]
if arr[i]*2 == target:
return [i, i]
return answer
if curr == target:
return [i, j]
Read 5 files
inPrefix = "in"
outPrefix = "out"
for i in range(1, 6):
inFile = inPrefix + str(i) + ".txt"
with open(inFile, 'r') as f:
fileLines = f.readlines()
target = fileLines[1]
arr = fileLines[2]
how do i get write twoSum's value to output file?
???????
Output to 5 files
outFile = outPrefix + str(i) + ".txt"
with open(outFile, 'a') as f:
f.write(target) #just a test to make sure I'm writing successfully
f.write(arr) #just a test to make sure I'm writing successfully
Two things that come to mind:
open the file in wt mode (Write Text).
processedOutput appears to be a list. When you write to a file, you need write a string. Are you wanting a CSV style output of those values ? or just a line of spaced out values ? The simplest way here, would be something like: " ".join(processedOutput). That will separate all the items in your processedOut list by a string.

Trying to stepwise iterate through 2 files in python

I am trying to merge two LARGE input files together into 1 output, sorting as I go.
## Above I counted the number of lines in each table
print("Processing Table Lines: table 1 has " + str(count1) + " and table 2 has " + str(count2) )
newLine, compare, line1, line2 = [], 0, [], []
while count1 + count2 > 0:
if count1 > 0 and compare <= 0: count1, line1 = count1 - 1, ifh1.readline().rstrip().split('\t')
else: line1 = []
if count2 > 0 and compare >= 0: count2, line2 = count2 - 1, ifh2.readline().rstrip().split('\t')
else: line2 = []
compare = compareTableLines( line1, line2 )
newLine = mergeLines( line1, line2, compare, tIndexes )
ofh.write('\t'.join( newLine + '\n'))
What I expect to happen is that as lines are written to output, I pull the next line in the file I used to be read in if available. I also expect that the loop cuts out once both files are empty.
However I keep getting this error:
ValueError: Mixing iteration and read methods would lose data
I just don't see how to get around it. Either file is too large to keep in memory so I want to read as I go.
Here's an example of merging two ordered files, CSV files in this case, using heapq.merge() and itertools.groupby(). Given 2 CSV files:
x.csv:
key1,99
key2,100
key4,234
y.csv:
key1,345
key2,4
key3,45
Running:
import csv, heapq, itertools
keyfun = lambda row: row[0]
with open("x.csv") as inf1, open("y.csv") as inf2, open("z.csv", "w") as outf:
in1, in2, out = csv.reader(inf1), csv.reader(inf2), csv.writer(outf)
for key, rows in itertools.groupby(heapq.merge(in1, in2, key=keyfun), keyfun):
out.writerow([key, sum(int(r[1]) for r in rows)])
we get:
z.csv:
key1,444
key2,104
key3,45
key4,234

Performance issue when parsing a large file line by line

I have a set of several millions of small numbers stored in a file
I wrote a Python script that reads numbers from a tab delimited text file line by line, computes the reminders and appends the result to an output file. For some reason it consumes a lot of ram (20 Gb of ram on Ubuntu to parse a million of numbers). It also freezes the system because of frequent writes.
What is the correct way to tweak this script.
import os
import re
my_path = '/media/me/mSata/res/'
# output_file.open() before the first loop didn't help
for file_id in range (10,11): #10,201
filename = my_path + "in" + str(file_id) + ".txt"
fstr0 = ""+my_path +"out"+ str(file_id)+"_0.log"
fstr1 = ""+my_path +"res"+ str(file_id)+"_1.log"
with open(filename) as fp:
stats = [0] * (512)
line = fp.readline()
while line:
raw_line = line.strip()
arr_of_parsed_numbers = re.split(r'\t+', raw_line.rstrip('\t'))
for num_index in range(0, len(arr_of_parsed_numbers)):
my_number = int(arr_of_parsed_numbers[num_index])
v0 = (my_number % 257) -1 #value 257 is correct
my_number = (my_number )//257
stats[v0] += 1
v1 = my_number % 256
stats[256+v1]+=1
f0 = open(fstr0, "a")
f1 = open(fstr1, "a")
f0.write("{}\n".format(str(v0).rjust(3)))
f1.write("{}\n".format(str(v1).rjust(3)))
f0.close()
f1.close()
line=fp.readLine()
print(stats)
# tried output_file.close() here as well
print("done")
Updated:
I've ran this script under Windows 10 (10 Mb memory in Python.exe) and Ubuntu (10 Gb memory consumed). What can cause this discrepancy? Thousand times more is a lot.
his script consumes about 20Mb on Windows 10 (looking o
try something like this. Note the files are only opened and closed once each, and the loop iterates once per line.
import os
import re
my_path = '/media/me/mSata/res/'
# output_file.open() before the first loop didn't help
for file_id in range (10,11): #10,201
filename = my_path + "in" + str(file_id) + ".txt"
fstr0 = ""+my_path +"out"+ str(file_id)+"_0.log"
fstr1 = ""+my_path +"res"+ str(file_id)+"_1.log"
with open(filename, "r") as fp, open(fstr0, "a") as f0, open(fstr1, "a") as f1:
stats = [0] * (512)
for line in fp:
raw_line = line.strip()
arr_of_parsed_numbers = re.split(r'\t+', raw_line.rstrip('\t'))
for num_index in range(0, len(arr_of_parsed_numbers)):
my_number = int(arr_of_parsed_numbers[num_index])
v0 = (my_number % 257) -1 #value 257 is correct
my_number = (my_number )//257
stats[v0] += 1
v1 = my_number % 256
stats[256+v1]+=1
f0.write("{}\n".format(str(v0).rjust(3)))
f1.write("{}\n".format(str(v1).rjust(3)))
print(stats)
# tried output_file.close() here as well
print("done")

Find sum of numbers in line

This is what I have to do:
Read content of a text file, where two numbers separated by comma are on each line (like 10, 5\n, 12, 8\n, …)
Make a sum of those two numbers
Write into new text file two original numbers and the result of summation = like 10 + 5 = 15\n, 12 + 8 = 20\n, …
So far, I've got this:
import os
import sys
relative_path = "Homework 2.txt"
if not os.path.exists(relative_path):
print "not found"
sys.exit()
read_file = open(relative_path, "r")
lines = read_file.readlines()
read_file.close()
print lines
path_output = "data_result4.txt"
write_file = open(path_output, "w")
for line in lines:
line_array = line.split()
print line_array
You need to have a good understanding of python to understand this.
First, read the file, and get all of the lines by splitting it with a line feed (\n)
For each expression, calculate the answer and write it. Remember, you need to cast the numbers to integers so that they can be added together.
with open('Original.txt') as f:
lines = f.read().split('\n')
with open('answers.txt', 'w+') as f:
for expression in lines: # expression should be in format '12, 8'
nums = [int(i) for i in expression.split(', ')]
f.write('{} + {} = {}\n'.format(nums[0], nums[1], nums[0] + nums[1]))
# That should write '12 + 8 = 20\n'
Make your last for loop look like this:
for line in lines:
splitline = line.strip().split(",")
summation = sum(map(int, splitline))
write_file.write(" + ".join(splitline) + " = " + str(summation) + "\n")
One beautiful thing about that way is that you can have as many numbers as you want on a line, and it will still display correctly.
Seems like the input File is csv so just use the csv reader module in python.
Input File Homework 2.txt
1, 2
1,3
1,5
10,6
The script
import csv
f = open('Homework 2.txt', 'rb')
reader = csv.reader(f)
result = []
for line in list(reader):
nums = [int(i) for i in line]
result.append(["%(a)s + %(b)s = %(c)s" % {'a' : nums[0], 'b' : nums[1], 'c' : nums[0] + nums[1] }])
f = open('Homework 2 Output.txt', 'wb')
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in result:
writer.writerow(line)
The output file is then Homework 2 Output.txt
1 + 2 = 3
1 + 3 = 4
1 + 5 = 6
10 + 6 = 16

Writing and naming multiple files in python

The following code writes multiple files with certain number of lines from "t.txt" and name each file with the increasing line count. Now I want to write all files with name "1, 2, 3, 4, 5, 6, 7, 8..." or "mya, myb, myc, myd...". How do I change the code?
9 with open ("t.txt") as f:
10 probelist = [x.strip() for x in f.readlines()]
11 for i in probelist:
12 if not itemcount % filesize:
13 outfile = open("{}".format(filenum).zfill(8), "w")
14 filenum += 1
15 outfile.write(i+"\n")
16 itemcount += 1
17 outfile.close()
You can use itertools.islice to get filesize slice using enumerate to get get a unique name for each file, passing 1 as the start to enumerate to start the indexing at 1:
from itertools import islice
with open("t.txt") as f:
for ind, sli in enumerate(iter(lambda:list(islice(f,filesize)),[]),1):
with open("{}.txt".format(ind),"w") as out:
out.writelines(sli)
I added "file_count" to do what I assumed "filenum" was doing. Does this do the job for you? Sorry about the earlier confusion; I was paying attention to the variable name instead of the definition.
file_count = 1
with open ("t.txt") as f:
probelist = [x.strip() for x in f.readlines()]
for i in probelist:
if not itemcount % filesize:
file_name = str(file_count) + ".txt"
outfile = open(file_name, "w")
file_count += 1
filenum += 1
outfile.write(i+"\n")
itemcount += 1
outfile.close()

Categories