Comparing 2 files line by line

Comparing 2 files line by line - python

I have 2 file of the following form:
file1:
work1
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
file2:
work2
2 3 4 5 5
2 4 7 8 9
work1
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
work3
1 7 8 9 10
Now I want to compare to file and wherever say the header (work1) is equal..I want to compare the subsequent sections and print the line at which the difference is found. E.g.
work1 (file1)
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
work1 (file2)
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
Now I want to print the line where difference occurs i.e. "1 2 4 4 5"
For doing so I have written the following code:
with open("file1",) as r, open("file2") as w:
for line in r:
if "work1" in line:
for line1 in w:
if "work1" in line1:
print "work1"
However, from here on I am confused as to how can I read both the files parallely. Can someone please help me with this...as I am not getting after comparing "work1"'s how should I read the files parallelly

You would probably want to try out itertools module in Python.
It contains a function called izip that can do what you need, along with a function called islice. You can iterate through the second file until you hit the header you were looking for, and you could slice the header up.
Here's a bit of the code.
from itertools import *
w = open('file2')
for (i,line) in enumerate(w):
if "work1" in line:
iter2 = islice(open('file2'), i, None, 1) # Starts at the correct line
f = open('file1')
for (line1,line2) in izip(f,iter2):
print line1, line2 # Place your comparisons of the two lines here.
You're guaranteed now that on the first run through of the loop you'll get "work1" on both lines. After that you can compare. Since f is shorter than w, the iterator will exhaust itself and stop once you hit the end of f.
Hopefully I explained that well.
EDIT: Added import statement.
EDIT: We need to reopen file2. This is because iterating through iterables in Python consumes the iterable. So, we need to pass a brand new one to islice so it works!

with open('f1.csv') as f1, open('f2.csv') as f2 :
i=0
break_needed = False
while True :
r1, r2 = f1.readline(), f2.readline()
if len(r1) == 0 :
print "eof found for f1"
break_needed = True
if len(r2) == 0 :
print "eof found for f2"
break_needed = True
if break_needed :
break
i += 1
if r1 != r2 :
print " line %i"%i
print "file 1 : " + r1
print "file 2 : " + r2

Related

Write a function called write_nums() that outputs the first positive n numbers to a txt file called nums.txt

I have to print first n positive numbers for ex if its first 10 positive numbers then the output should be as follows
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
it should print the first 10 numbers in 10 lines but my output is as follows:
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11
how do i delete the 11th line?
this is my code
def write_nums(n):
with open("nums.txt", 'w') as fp:
for i in range (1,11):
fp.writelines(str(i)+"\n")
pass
if __name__ == "__main__":
write_nums(10)
# writes out the first 10 positive numbers to nums.txt

A very crude way of doing so can be by adding an if else statement for the last run of the loop
def write_nums(n):
with open("nums.txt", 'w') as fp:
for i in range (1,n+1):
if i != n:
fp.writelines(str(i)+"\n")
else :
fp.writelines(str(i))
pass
if __name__ == "__main__":
write_nums(10)
# writes out the first 10 positive numbers to nums.txt
I am not sure if this is the best way to do it, but it will get the code working the way you want it

Change format and combine info from 2 columns

First time posting here so please be patient!
I have a file that looks like that:
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
...
I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
...
Any idea on how I could do that would be much appreciated!

Here's a method that works:
lines = open('test.txt','r').read().splitlines()
place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
for ct in row.split()[1:]:
a,b = ct.split(':')
counts[i][place[a]] = int(b)
out_str = '\n'.join([lines[0]] + ['{:<4}{},{},{},{}'.format(i+1,*ct)
for i,ct in enumerate(counts)])
with open('output.txt','w') as f:
f.write(out_str)
The resulting file reads
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0

I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file.
for each line you can do as follows (we will take the first line as an example):
remove useless spaces using strip
G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like
resulting_table=[]
for line in file: #You can easily find how to read a file line by line
split_line=line.strip().split(' ')
A,T,G,C=0,0,0,0
for pair in split_line:
element=pair.split(':')
if element[0]=='A':
A=element[1]
elif element[0]=='T':
...
resulting_table.append([A,T,G,C])
And here you go ! You can then transform it easily into a dataframe or a numpy array
This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner

sourse.txt
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
import re
template = ('A', 'C', 'G', 'T')
def proc_line(line: str):
index, *elements = re.findall(r'\S+', line)
data = dict([*map(lambda x: x.split(':'), elements)])
return f'{index}\t' + ','.join([data.get(item, '0') for item in template]) + '\n'
with open('source.txt', encoding='utf-8') as file:
header, *lines = file.readlines()
with open('output.txt', 'w', encoding='utf-8') as new_file:
new_file.writelines(
[header] + list(map(proc_line, lines))
)
output.txt
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0

$ awk -F'[ :]+' '
NR>1 {
delete f
f[$2] = $3
f[$4] = $5
$0 = sprintf("%s %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
}
1' file
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0

How to get next n lines from in a file when a line is found

While reading a file in python, I was wondering how to get the next n lines when we encounter a line that meets my condition.
Say there is a file like this
mangoes:
1 2 3 4
5 6 7 8
8 9 0 7
7 6 8 0
apples:
1 2 3 4
8 9 0 9
Now whenever we find a line starting with mangoes, I want to be able to read all the next 4 lines.
I was able to find out how to do the next immediate line but not next n immediate lines
if (line.startswith("mangoes:")):
print(next(ifile)) #where ifile is the input file being iterated over

just repeat what you did
if (line.startswith("mangoes:")):
for i in range(n):
print(next(ifile))

Unless it's a huge file and you don't want to read all lines into memory at once you could do something like this
n = 4
with open(fn) as f:
lines = f.readlines()
for idx, ln in enumerate(lines):
if ln.startswith("mangoes"):
break
mangoes = lines[idx:idx+n]
This would give you a list of the n number of lines, including the word mangoes. if you did idx=idx+1 then you'd skip the title too.

With itertools.islice feature:
from itertools import islice
with open('yourfile') as ifile:
n = 4
for line in ifile:
if line.startswith('mangoes:'):
mango_lines = list(islice(ifile, n))
From your input sample the resulting mango_lines list would be:
['1 2 3 4 \n', '5 6 7 8\n', '8 9 0 7\n', '7 6 8 0\n']

Read lines between empty spaces of data file and write in new files

I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
I want to extract data between empty lines and write it in new files. It is hard to know how many empty lines are in file (means you also dont know how many new files you will be writing ; thus it seems very hard to write new files since u dont know how many new files will you be writing. Can anyone guide me? Thank you. I hope my question is clear.

Unless your file is very large, split all into individual sections using re, splitting on 2 or more whitespace chars
import re
with open("in.txt") as f:
lines = re.split("\s{2,}",f.read())
print lines
['#01textline1\n1 2 3 4 5 6\n2 3 5 6 7 3\n3 5 6 7 6 4\n4 6 7 8 9 9', '1 2 3 6 4 7\n3 5 7 7 8 4\n4 6 6 7 8 5', '3 4 5 6 7 8\n4 6 7 8 8 9']
Just iterate over lines and write your new files each iteration

Reading files is not data-mining. Please choose more appropriate tags...
Splitting a file on empty lines is trivial:
num = 0
out = open("file-0", "w")
for line in open("file"):
if line == "\n":
num = num + 1
out.close()
out = open("file-"+num, "w")
continue
out.write(line)
out.close()
As this approach is reading just one line at a time, file size does not matter. It should process data as fast as your disk can handle it, with near-constant memory usage.
Perl would have had a neat trick, because you can set the input record separator to two newlines via $/="\n\n"; and then process the data one record at a time as usual... I could not find something similar in python; but the hack with "split on empty lines" is not bad either.

Here is a start:
with open('in_file') as input_file:
processing = False
i = 0
for line in input_file:
if line.strip() and not processing:
out_file = open('output - {}'.format(i), 'w')
out_file.write(line)
processing = True
i += 1
elif line.strip():
out_file.write(line)
else:
processing = False
out_file.close()
This code keeps track of whether a file is being currently written to, with the processing flag. It resets the flag when it sees a blank line. The code also creates a new file upon seeing an empty line.
Hope it helps.

Read specific number of lines in python

I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..

You do not need a loop to accomplish your purpose. Just use the index function on the list to get the index of the two lines and take all the lines between them.
Note that I changed your file.readlines() to strip trailing newlines.
(Using file.read().splitlines() can fail, if read() ends in the middle of a line of data.)
file1 = open("data.txt","r")
file2=open("newdata.txt","w")
lines = [ line.rstrip() for line in file1.readlines() ]
firstIndex = lines.index("#02textline2")
secondIndex = lines.index("#03textline3")
print firstIndex, secondIndex
file2.write("\n".join(lines[firstIndex + 1 : secondIndex]))
file1.close()
file2.close()

There is a line return character at the end of every line, so this:
if line == "#03textline3":
will never be true, as the line is actually "#03textline3\n". Why didn't you use the same syntax as the one you used for "#02textline2" ? It would have worked:
if "#03textline3" in line: # Or ' line == "#03textline3\n" '
break;
Besides, you have to correct your indentation for the always_print = True line.

Here's what I would suggest doing:
firstKey = "#02textline2"
secondKey = "#03textline3"
with open("data.txt","r") as fread:
for line in fread:
if line.rstrip() == firstKey:
break
with open("newdata.txt","w") as fwrite:
for line in fread:
if line.rstrip() == secondKey:
break
else:
fwrite.write(line)
This approach takes advantage of the fact that Python treats files like iterators. The first for loops iterates through the file iterator f until the first key is found. The loop breaks, but the iterator stays as the current position. When it gets picked back up, the second loops starts where the first let off. We then directly write the lines you want to a new file, and discard the rest
Advantages:
This does not load the entire file into memory, only the lines between firstKey and secondKey are stored, and only the lines before secondKey are ever read by the script
No entries are looked over or processed more than once
The context manager with is a safer way to consume files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing 2 files line by line - python

Related

Write a function called write_nums() that outputs the first positive n numbers to a txt file called nums.txt

Change format and combine info from 2 columns

How to get next n lines from in a file when a line is found

Read lines between empty spaces of data file and write in new files

Read specific number of lines in python

Categories

Resources