First time posting here so please be patient!
I have a file that looks like that:
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
...
I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
...
Any idea on how I could do that would be much appreciated!
Here's a method that works:
lines = open('test.txt','r').read().splitlines()
place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
for ct in row.split()[1:]:
a,b = ct.split(':')
counts[i][place[a]] = int(b)
out_str = '\n'.join([lines[0]] + ['{:<4}{},{},{},{}'.format(i+1,*ct)
for i,ct in enumerate(counts)])
with open('output.txt','w') as f:
f.write(out_str)
The resulting file reads
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file.
for each line you can do as follows (we will take the first line as an example):
remove useless spaces using strip
G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like
resulting_table=[]
for line in file: #You can easily find how to read a file line by line
split_line=line.strip().split(' ')
A,T,G,C=0,0,0,0
for pair in split_line:
element=pair.split(':')
if element[0]=='A':
A=element[1]
elif element[0]=='T':
...
resulting_table.append([A,T,G,C])
And here you go ! You can then transform it easily into a dataframe or a numpy array
This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner
sourse.txt
POS {ALLELE:COUNT}
1 G:27 A:11
2 C:40 T:0
3 C:40 A:0
4 T:40 G:0
5 G:0 C:40
6 C:40 T:0
7 G:24 A:14
8 G:40 A:0
9 A:40 G:0
import re
template = ('A', 'C', 'G', 'T')
def proc_line(line: str):
index, *elements = re.findall(r'\S+', line)
data = dict([*map(lambda x: x.split(':'), elements)])
return f'{index}\t' + ','.join([data.get(item, '0') for item in template]) + '\n'
with open('source.txt', encoding='utf-8') as file:
header, *lines = file.readlines()
with open('output.txt', 'w', encoding='utf-8') as new_file:
new_file.writelines(
[header] + list(map(proc_line, lines))
)
output.txt
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
$ awk -F'[ :]+' '
NR>1 {
delete f
f[$2] = $3
f[$4] = $5
$0 = sprintf("%s %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
}
1' file
POS {ALLELE:COUNT}
1 11,0,27,0
2 0,40,0,0
3 0,40,0,0
4 0,0,0,40
5 0,40,0,0
6 0,40,0,0
7 14,0,24,0
8 0,0,40,0
9 40,0,0,0
While reading a file in python, I was wondering how to get the next n lines when we encounter a line that meets my condition.
Say there is a file like this
mangoes:
1 2 3 4
5 6 7 8
8 9 0 7
7 6 8 0
apples:
1 2 3 4
8 9 0 9
Now whenever we find a line starting with mangoes, I want to be able to read all the next 4 lines.
I was able to find out how to do the next immediate line but not next n immediate lines
if (line.startswith("mangoes:")):
print(next(ifile)) #where ifile is the input file being iterated over
just repeat what you did
if (line.startswith("mangoes:")):
for i in range(n):
print(next(ifile))
Unless it's a huge file and you don't want to read all lines into memory at once you could do something like this
n = 4
with open(fn) as f:
lines = f.readlines()
for idx, ln in enumerate(lines):
if ln.startswith("mangoes"):
break
mangoes = lines[idx:idx+n]
This would give you a list of the n number of lines, including the word mangoes. if you did idx=idx+1 then you'd skip the title too.
With itertools.islice feature:
from itertools import islice
with open('yourfile') as ifile:
n = 4
for line in ifile:
if line.startswith('mangoes:'):
mango_lines = list(islice(ifile, n))
From your input sample the resulting mango_lines list would be:
['1 2 3 4 \n', '5 6 7 8\n', '8 9 0 7\n', '7 6 8 0\n']
I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
I want to extract data between empty lines and write it in new files. It is hard to know how many empty lines are in file (means you also dont know how many new files you will be writing ; thus it seems very hard to write new files since u dont know how many new files will you be writing. Can anyone guide me? Thank you. I hope my question is clear.
Unless your file is very large, split all into individual sections using re, splitting on 2 or more whitespace chars
import re
with open("in.txt") as f:
lines = re.split("\s{2,}",f.read())
print lines
['#01textline1\n1 2 3 4 5 6\n2 3 5 6 7 3\n3 5 6 7 6 4\n4 6 7 8 9 9', '1 2 3 6 4 7\n3 5 7 7 8 4\n4 6 6 7 8 5', '3 4 5 6 7 8\n4 6 7 8 8 9']
Just iterate over lines and write your new files each iteration
Reading files is not data-mining. Please choose more appropriate tags...
Splitting a file on empty lines is trivial:
num = 0
out = open("file-0", "w")
for line in open("file"):
if line == "\n":
num = num + 1
out.close()
out = open("file-"+num, "w")
continue
out.write(line)
out.close()
As this approach is reading just one line at a time, file size does not matter. It should process data as fast as your disk can handle it, with near-constant memory usage.
Perl would have had a neat trick, because you can set the input record separator to two newlines via $/="\n\n"; and then process the data one record at a time as usual... I could not find something similar in python; but the hack with "split on empty lines" is not bad either.
Here is a start:
with open('in_file') as input_file:
processing = False
i = 0
for line in input_file:
if line.strip() and not processing:
out_file = open('output - {}'.format(i), 'w')
out_file.write(line)
processing = True
i += 1
elif line.strip():
out_file.write(line)
else:
processing = False
out_file.close()
This code keeps track of whether a file is being currently written to, with the processing flag. It resets the flag when it sees a blank line. The code also creates a new file upon seeing an empty line.
Hope it helps.
I have the BIG data text file for example:
#01textline1
1 2 3 4 5 6
2 3 5 6 7 3
3 5 6 7 6 4
4 6 7 8 9 9
1 2 3 6 4 7
3 5 7 7 8 4
4 6 6 7 8 5
3 4 5 6 7 8
4 6 7 8 8 9
..
..
You do not need a loop to accomplish your purpose. Just use the index function on the list to get the index of the two lines and take all the lines between them.
Note that I changed your file.readlines() to strip trailing newlines.
(Using file.read().splitlines() can fail, if read() ends in the middle of a line of data.)
file1 = open("data.txt","r")
file2=open("newdata.txt","w")
lines = [ line.rstrip() for line in file1.readlines() ]
firstIndex = lines.index("#02textline2")
secondIndex = lines.index("#03textline3")
print firstIndex, secondIndex
file2.write("\n".join(lines[firstIndex + 1 : secondIndex]))
file1.close()
file2.close()
There is a line return character at the end of every line, so this:
if line == "#03textline3":
will never be true, as the line is actually "#03textline3\n". Why didn't you use the same syntax as the one you used for "#02textline2" ? It would have worked:
if "#03textline3" in line: # Or ' line == "#03textline3\n" '
break;
Besides, you have to correct your indentation for the always_print = True line.
Here's what I would suggest doing:
firstKey = "#02textline2"
secondKey = "#03textline3"
with open("data.txt","r") as fread:
for line in fread:
if line.rstrip() == firstKey:
break
with open("newdata.txt","w") as fwrite:
for line in fread:
if line.rstrip() == secondKey:
break
else:
fwrite.write(line)
This approach takes advantage of the fact that Python treats files like iterators. The first for loops iterates through the file iterator f until the first key is found. The loop breaks, but the iterator stays as the current position. When it gets picked back up, the second loops starts where the first let off. We then directly write the lines you want to a new file, and discard the rest
Advantages:
This does not load the entire file into memory, only the lines between firstKey and secondKey are stored, and only the lines before secondKey are ever read by the script
No entries are looked over or processed more than once
The context manager with is a safer way to consume files