First time posting here so please be patient!
I have a file that looks like that:
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
...
I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
...
Any idea on how I could do that would be much appreciated!
Here's a method that works:
lines = open('test.txt','r').read().splitlines()
place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
for ct in row.split()[1:]:
a,b = ct.split(':')
counts[i][place[a]] = int(b)
out_str = '\n'.join([lines[0]] + ['{:<4}{},{},{},{}'.format(i+1,*ct)
for i,ct in enumerate(counts)])
with open('output.txt','w') as f:
f.write(out_str)
The resulting file reads
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file.
for each line you can do as follows (we will take the first line as an example):
remove useless spaces using strip
G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like
resulting_table=[]
for line in file: #You can easily find how to read a file line by line
split_line=line.strip().split(' ')
A,T,G,C=0,0,0,0
for pair in split_line:
element=pair.split(':')
if element[0]=='A':
A=element[1]
elif element[0]=='T':
...
resulting_table.append([A,T,G,C])
And here you go ! You can then transform it easily into a dataframe or a numpy array
This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner
sourse.txt
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
import re
template = ('A', 'C', 'G', 'T')
def proc_line(line: str):
index, *elements = re.findall(r'\S+', line)
data = dict([*map(lambda x: x.split(':'), elements)])
return f'{index}\t' + ','.join([data.get(item, '0') for item in template]) + '\n'
with open('source.txt', encoding='utf-8') as file:
header, *lines = file.readlines()
with open('output.txt', 'w', encoding='utf-8') as new_file:
new_file.writelines(
[header] + list(map(proc_line, lines))
)
output.txt
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
$ awk -F'[ :]+' '
NR>1 {
delete f
f[$2] = $3
f[$4] = $5
$0 = sprintf("%s %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
}
1' file
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
While reading a file in python, I was wondering how to get the next n lines when we encounter a line that meets my condition.
Say there is a file like this
mangoes:
1 2 3 4
5 6 7 8
8 9 0 7
7 6 8 0
apples:
1 2 3 4
8 9 0 9
Now whenever we find a line starting with mangoes, I want to be able to read all the next 4 lines.
I was able to find out how to do the next immediate line but not next n immediate lines
if (line.startswith("mangoes:")):
print(next(ifile)) #where ifile is the input file being iterated over
just repeat what you did
if (line.startswith("mangoes:")):
for i in range(n):
print(next(ifile))
Unless it's a huge file and you don't want to read all lines into memory at once you could do something like this
n = 4
with open(fn) as f:
lines = f.readlines()
for idx, ln in enumerate(lines):
if ln.startswith("mangoes"):
break
mangoes = lines[idx:idx+n]
This would give you a list of the n number of lines, including the word mangoes. if you did idx=idx+1 then you'd skip the title too.
With itertools.islice feature:
from itertools import islice
with open('yourfile') as ifile:
n = 4
for line in ifile:
if line.startswith('mangoes:'):
mango_lines = list(islice(ifile, n))
From your input sample the resulting mango_lines list would be:
['1 2 3 4 \n', '5 6 7 8\n', '8 9 0 7\n', '7 6 8 0\n']
I have two txt files.
First one is contains a number for each line like this:
22
15
32
53
.
.
and the other file contains 20 continuous numbers for each line like this:
0.1 2.3 4.5 .... 5.4
3.2 77.4 2.1 .... 8.1
....
.
.
According to given number in first txt I want to separate the other files. For example, in first txt for first line I have 22, that means I will take first line with 20 column and second line with two column and other columns of second line I will remove. Then I will look second line of first txt (it is 15), that means I will take 15 column from third line of other file and other columns of third line I will remove and so on. How can I make this?
with open ('numbers.txt', 'r') as f:
with open ('contiuousNumbers.txt', 'r') as f2:
with open ('results.txt', 'w') as fOut:
for line in f:
...
Thanks.
For the number on each line you iterate through the first file, make that number a target total to read, so that you can use a while loop to keep using next on the second file object to read the numbers and decrement the number of numbers from the total until the total reaches 0. Use the lower number of the total and the number of numbers to slice the numbers so that you output just the requested number of numbers:
for line in f:
output = []
total = int(line)
while total > 0:
try:
items = next(f2).split()
output.extend(items[:min(total, len(items))])
total -= len(items)
except StopIteration:
break
fOut.write(' '.join(output) + '\n')
so that given the first file with:
3
6
1
5
and the second file with:
2 5
3 7
2 1
3 6
7 3
2 2
9 1
3 4
8 7
1 2
3 8
the output file will have:
2 5 3
2 1 3 6 7 3
2
9 1 3 4 8
I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
I want to extract data between empty lines and write it in new files. It is hard to know how many empty lines are in file (means you also dont know how many new files you will be writing ; thus it seems very hard to write new files since u dont know how many new files will you be writing. Can anyone guide me? Thank you. I hope my question is clear.
Unless your file is very large, split all into individual sections using re, splitting on 2 or more whitespace chars
import re
with open("in.txt") as f:
lines = re.split("\s{2,}",f.read())
print lines
['#01textline1\n1 2 3 4 5 6\n2 3 5 6 7 3\n3 5 6 7 6 4\n4 6 7 8 9 9', '1 2 3 6 4 7\n3 5 7 7 8 4\n4 6 6 7 8 5', '3 4 5 6 7 8\n4 6 7 8 8 9']
Just iterate over lines and write your new files each iteration
Reading files is not data-mining. Please choose more appropriate tags...
Splitting a file on empty lines is trivial:
num = 0
out = open("file-0", "w")
for line in open("file"):
if line == "\n":
num = num + 1
out.close()
out = open("file-"+num, "w")
continue
out.write(line)
out.close()
As this approach is reading just one line at a time, file size does not matter. It should process data as fast as your disk can handle it, with near-constant memory usage.
Perl would have had a neat trick, because you can set the input record separator to two newlines via $/="\n\n"; and then process the data one record at a time as usual... I could not find something similar in python; but the hack with "split on empty lines" is not bad either.
Here is a start:
with open('in_file') as input_file:
processing = False
i = 0
for line in input_file:
if line.strip() and not processing:
out_file = open('output - {}'.format(i), 'w')
out_file.write(line)
processing = True
i += 1
elif line.strip():
out_file.write(line)
else:
processing = False
out_file.close()
This code keeps track of whether a file is being currently written to, with the processing flag. It resets the flag when it sees a blank line. The code also creates a new file upon seeing an empty line.
Hope it helps.
I have 2 file of the following form:
file1:
work1
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
file2:
work2
2 3 4 5 5
2 4 7 8 9
work1
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
work3
1 7 8 9 10
Now I want to compare to file and wherever say the header (work1) is equal..I want to compare the subsequent sections and print the line at which the difference is found. E.g.
work1 (file1)
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
work1 (file2)
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
Now I want to print the line where difference occurs i.e. "1 2 4 4 5"
For doing so I have written the following code:
with open("file1",) as r, open("file2") as w:
for line in r:
if "work1" in line:
for line1 in w:
if "work1" in line1:
print "work1"
However, from here on I am confused as to how can I read both the files parallely. Can someone please help me with this...as I am not getting after comparing "work1"'s how should I read the files parallelly
You would probably want to try out itertools module in Python.
It contains a function called izip that can do what you need, along with a function called islice. You can iterate through the second file until you hit the header you were looking for, and you could slice the header up.
Here's a bit of the code.
from itertools import *
w = open('file2')
for (i,line) in enumerate(w):
if "work1" in line:
iter2 = islice(open('file2'), i, None, 1) # Starts at the correct line
f = open('file1')
for (line1,line2) in izip(f,iter2):
print line1, line2 # Place your comparisons of the two lines here.
You're guaranteed now that on the first run through of the loop you'll get "work1" on both lines. After that you can compare. Since f is shorter than w, the iterator will exhaust itself and stop once you hit the end of f.
Hopefully I explained that well.
EDIT: Added import statement.
EDIT: We need to reopen file2. This is because iterating through iterables in Python consumes the iterable. So, we need to pass a brand new one to islice so it works!
with open('f1.csv') as f1, open('f2.csv') as f2 :
i=0
break_needed = False
while True :
r1, r2 = f1.readline(), f2.readline()
if len(r1) == 0 :
print "eof found for f1"
break_needed = True
if len(r2) == 0 :
print "eof found for f2"
break_needed = True
if break_needed :
break
i += 1
if r1 != r2 :
print " line %i"%i
print "file 1 : " + r1
print "file 2 : " + r2