Complex parsing query - python

I have a very complex parsing problem. Any thoughts would be appreciated here. I have a test.dat file.The file to be parsed looks like this:
* Number = 40
Time = 0
1 10.13 10 10.11 12 13
.
.
Time = n
1 10 10 10 12.50 13
.
.
There are N time blocks and each block has 40 lines like shown above. What I would like to do is add e.g. the 1st line of first block , then 1st line in block #2 .. and so on to to a new file -test_1.dat. Similarly, 2nd line of every block to test_2.datand so on.The lines in the block should be written as is to the new _n.dat file. Is there any way to do this? The number I have assumed here is 40, so if the * number = 40 there will be 40 lines under each time block.
regards,
Ris

You can read the file in as a list of strings (call it fileList), where each string is a different line:
f = open('filename')
fileList = f.readlines()
Then, remove the "header" part of your file with
fileList.pop(0)
fileList.pop(0)
Then, do
outFileContents = {} # This will be a dict, where number -> content of test_number.dat
for outFileName in range(1,41): #outFileName will be the number going after the _ in your filename
outFileContents[outFileName] = []
for n in range(40): # Counting through the time blocks
currentRowIndex = (42 * n) + outFileName # 42 to account for the Time = and blank row
outFileContents[outFileName].append(fileList[currentRowIndex])
Finally you can loop through outFileContents and write the contents of each value to separate files.

Related

I want to use the sum function to count multiple occurrences of specific characters, but my script only works for one character

This script is supposed to calculate the total weight of a protein, so I decided to count the occurrences of certain characters in a script. However, only the first equation produces a result which causes the total weight to be the same value (everything under the first one equals zero, which is definitely incorrect). How do I get my script to pay attention to the other lines???
This is a shortened version:
akt3_file = open('AKT3 fasta.txt', 'r') #open up the fasta file
for line in akt3_file:
ala =(sum(line.count('A') for line in akt3_file)*89*1000) #this value is 1780000
arg =(sum(line.count('R') for line in akt3_file)*174*1000)
asn =(sum(line.count('N') for line in akt3_file)*132*1000)
asp =(sum(line.count('D') for line in akt3_file)*133*1000)
asx =(sum(line.count('B') for line in akt3_file)*133*1000)
protein_weight = ala+arg+asn+asp+asx
print(protein_weight) # the problem is that this value is also 1780000
akt3_file.close() #close the fasta file
The issue you have is that you're trying to iterate over your file's lines several times. While that's actually possible (unlike most iterators, file objects can be rewound with seek), you're not doing it properly, so all the iterations except for the first don't see any data.
In this case, you probably don't need to iterate over the lines at all. Just read the full text of the file into a string, and count the characters you want out of that string:
with open('AKT3 fasta.txt', 'r') as akt_3file: # A with is not necessary, but a good idea.
data = akt_3file.read() # Read the whole file into the data string.
ala = data.count('A') * 89 * 1000 # Now we can count all the occurrences in all lines at
arg = data.count('R') * 174 * 1000 # once, and there's no issue iterating the file, since
asn = data.count('N') * 132 * 1000 # we're not dealing with the file any more, just the
asp = data.count('D') * 133 * 1000 # immutable data string.
asx = data.count('B') * 133 * 1000

How to "copy-paste" blocks of text from one file to another using python?

I am running some molecular simulations, and I have a file with coordinates (a .xyz file, which is basically a file with tabbed columns) and I need to send it another file, which will be my input file for my simulation.
To give you a picture, this is how my input files look like with my coordinates (there is more stuff at the bottom that remains untouched):
inputfile.py
# One-electron Properties
# Methacrylic acid (MA0)
# Neutral
# 86.09 g/mol
memory 8 GB
molecule MA0 {
0 1
C 2.87618 -0.84254 0.16797
C 2.96148 0.13611 1.08491
C 2.43047 -0.01082 2.47698
C 3.62022 1.40750 0.67356
O 3.45819 2.47668 1.24567
}
.
.
.
I have generated some coordinates which are in another file. That file looks like:
conformer_coords.xyz
15
conformer index = 0001, molecular weight is = 100.052429496, MMA.pdb
O 2.98687 0.35207 1.05259
C 2.40900 0.04400 0.02100
O 1.13058 0.37171 -0.29283
C 0.85476 1.77012 -0.33847
.
.
.
What I want to do is replace the coordinates in inputfile.py to the coordinates in conformer_coords.xyz. The number of coordinate positions in my conformer is known. Let's call it N for now. So, conformer_coords.xyz has N+2 lines.
I basically want to take coordinates from conformer_coords.xyz and place them between the
{
0 1
and } (yes, the 0 1 are needed there).
How should I go about this? Can python pull it off? I am using subprocess anyway, so if awk or bash can do it, I would be really grateful if someone could point me in the right direction!!
import re
def insert_data(conformer_filepath,input_filepath,output_filepath):
#grab conformer data
with open(conformer_filepath,'r') as f:
conformer_text = f.read()
conformer_data = re.search('conformer[^\n]+\n(.+)',conformer_text,re.M|re.S).group(1)
#this looks for the line that has conformer in it and takes all lines after it
#grab input file before and after text
with open(input_filepath,'r') as f:
input_text = f.read()
input_pre,input_post = re.search('(^.+\n0 1\n).+?(\n}.*)$',input_text,re.M|re.S).groups()
#this looks for the "0 1" line and takes everything before that. Then it skips down to the next curly bracket which is at the start of a line and takes that and everything past it.
#write them to the output file
with open(output_filepath,'w') as f:
f.write(input_pre + conformer_data + input_post)
#this writes the three pieces collected to the output file

Looking for first line of data with python

I have a data file that looks like this, and the file type is a list.
############################################################
# Tool
# File: test
#
# mass: mass in GeV
# spectrum: from 1 to 100 GeV
###########################################################
# mass (GeV) spectrum (1-100 GeV)
10 0.2822771608053263
20 0.8697454394829301
30 1.430461657476815
40 1.9349004472432392
50 2.3876849629827412
60 2.796620869276766
70 3.1726347734996727
80 3.5235401505002244
90 3.8513847250834106
100 4.157478780924807
For me to read the data I would normally have to count how many lines before the first set of numbers and then for loop through the file. In this file its 8 lines
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for i in range(8,len(test)):
single_line=test[i].split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
Let's say I didn't want to open the file to check how many lines there are from the intro statement to the first line of data points. How would I make python automatically start at the first line of data points, and go through the end of the file?
This is a general solution, but should work in your specific case.
you could for each line, check if it starts with a number.
psedo-code
for line in test:
if line.split()[0].isdigit():
DoStuffWithData
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for line in test:
if line[0] == '#':
continue
single_line=line.split('\t')
mass.append(float(single_line[0]))
spectrum.append(float(single_line[1]))
You can filter all lines that start with # by regex or startswith method of string
import re
spectrum=[]
mass=[]
with open ('test.in') as m:
test= [i for i in f.readlines() if not re.match("^#.*", i)]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
OR
spectrum = []
mass = []
with open('test.in') as m:
test = [i for i in f.readlines() if not i.startwith("#")]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
This will filter out all the lines that start with #.
pseudo code:
for r in m:
if r.startwith('#'):
continue
spt = r.split('\t')
if len(spt) < 2:
continue
## todo: .....

Python: How to compare string from two text files and retrieve an additional line of one in case of match

I have found so much information from previous search on this website but I seem to be stuck on the following issue.
I have two text files that looks like this
Inter.txt ( n-lines but only showed 4 lines,you get the idea)
7275
30000
6693
855
....
rules.txt (2n-lines)
7275
8500
6693
7555
....
3
1000
8
5
....
I want to compare the first line of Inter.txt with rules.txt and in case of a match, I jump for n-lines in order to get the score of that line. (E.g. with 7275, there is a match, I jump n to get the score 3)
I produced the following code but for some reasons, I only have the ouput of the first line when I should have one for each match from my first file. With the previous example, I should have 8 as an output for 6693.
import linecache
inter = open("Inter.txt", "r")
rules = open("rules.txt", "r")
iScore = 0
jump = 266
i=0
for lineInt in inter:
#i = i+1
#print(i)
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
All the print(i) are there because I checked that all the lines were read. I am a novice in Python.
To sum up, I don't understand why I only have one output. Thanks in advance !
Ok, I think the main thing that blocks you from getting forward is that the for loops on files gets the pointer to the end of the file, and doesn't resets when you starts the loops again.
So when you only open rules.txt once, and uses its intance in the inner loop it only goes through all the lines at the first iteration of the outer loop, the second time it tries to go over the remains lines, which are non.
The solution is to close and open the file outside the inner loop.
This code worked for me.
import linecache
inter = open("Inter.txt", "r")
iScore = 0
jump = 4
for lineInt in inter:
i=0
#i = i+1
#print(i)
rules = open("rules.txt", "r")
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
rules.close()
I also moved where you set the i to 0 to the beginning of the outer loop, but I guess you'd find it yourself.
And I changed jump to 4 to fit the example files your gave :p
Can you please try this solution:
def get_rules_values(rules_file):
with open(rules_file, "r") as rules:
return map(int, rules.readlines())
def get_rules_dict(rules_values):
return dict(zip(rules_values[:len(rules_values)/2], rules_values[len(rules_values)/2:]))
def get_inter_values(inter_file):
with open(inter_file, "r") as inter:
return map(int, inter.readlines())
rules_dict = get_rules_dict(get_rules_values("rules.txt"))
inter_values = get_inter_values("inter.txt")
for inter_value in inter_values:
print inter_value, rules_dict[inter_value]
Hope it's working for you!

Iterating Over Two Large Lists using Python

I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).
I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.
However, because the files are big, it took a couple of days to get the filtrated output file?
So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?
Here is my code:
import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
list_coord.append((row[0],row[1],row[2]))
def parseVcf(vcf,src):
done = False
with open(vcf,'r') as f:
reader=csv.reader((f),delimiter='\t')
vcf_out_split = vcf.split('.')
vcf_out_split.insert(2,"output_CORRECT2")
outpt = open('.'.join(vcf_out_split),'a')
for coord in list_coord:
for row in reader:
if '#' not in row[0]:
coor_genom = int(row[1])
coor_exon1 = int(coord[1])+1
coor_exon2 = int(coord[2])
coor_genom_chr = row[0]
coor_exon_chr = coord[0]
ComH = row[7].split(';')
for x in ComH:
if 'DP4=' in x:
DP4_split=x[4:].split(',')
if (coor_exon1 <= coor_genom <= coor_exon2):
if (coor_genom_chr == coor_exon_chr):
if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
done = True
outpt.write('\t'.join(row) + '\n')
if done:
break
outpt.close()
for root,dirs,files in os.walk("."):
for file in files:
pathname=os.path.join(root,file)
if file.find("1_1")==0:
print "Parsing " + file
parseVcf(pathname, "1_1")
ref_ordered.txt:
1 69090 70008
1 367658 368597
1 621095 622034
1 861321 861393
1 865534 865716
1 866418 866469
1 871151 871276
1 874419 874509
1_1 Input File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT directory
1 14907 rs79585140 A G 20 . DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ 0/1:50,0,51:50
1 14930 rs75454623 A G 44 . DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38 GT:PL:GQ 0/1:74,0,58:61
1 15211 rs78601809 T G 9.33 . DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171 GT:PL:GQ 1/1:41,10,0:13
1 16146 . A C 25 . DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ 0/1:55,0,68:58
1 16257 rs78588380 G C 40 . DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308 GT:PL:GQ 0/1:70,0,249:73
1 16378 rs148220436 T C 39 . DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227 GT:PL:GQ 0/1:69,0,90:72
OUTPUT File:
1 877831 rs6672356 T C 44.8 . DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 878000 . C T 44.8 . DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 881627 rs2272757 G A 205 . DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1 GT:PL:GQ 1/1:238,27,0:51
First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.
your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.
You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers) I add my line to the list.
Also, why do you use outpt = open('.'.join(vcf_out_split),'a') and with open(vcf,'r') as f: try to be consistent and smart in your choices.
Bioinformatics for the win!
If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.
Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4 condition, so I'll leave it to you to add that part back in):
with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
coords = csv.reader(coords_f, delimiter="\t")
vcf = csv.reader(vcf_f, delimiter="\t")
out = csv.writer(out_f, delimiter="\t")
next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!
try:
c = next(coords)
r = next(vcf)
while True:
if int(c[1]) >= int(r[1]): # vcf file is behind
r = next(vcf)
elif int(c[2]) < int(r[1]): # coords file is behind
c = next(coords)
else: # int(c[1]) < int(r[1]) <= int(c[2])
out.writeline(r) # add DP4 check here, and indent this line under it
r = next(vcf) # don't indent this line
except StopIteration: # one of the files has ended
pass

Categories