reading a file outside a function to iterate through - python

I have created a function, that I want to run over an entire file, but I am having some trouble. I am only getting output from the last line of the file.
I have two different input files, and the idea is to take the lines from one file and collecting certain terms, adding them to a dictionary, and then searching the second file for the corresponding lines and printing the output. I know the problem is most likely the placement of my call for the function.
The matrix file looks like this
Sp_ds Sp_hs Sp_log Sp_plat
c3833_g1_i2 4.00 0.07 16.84 26.37
c4832_g1_i1 24.55 116.87 220.53 28.82
c5161_g1_i1 107.49 89.39 26.95 698.97
c4399_g1_i2 27.91 72.57 5.56 36.58
c5916_g1_i1 82.57 19.03 48.55 258.22
The Blast file looks like this
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
c1007_g1_i1|m.807 gi|20137702|sp|O74370.1|ISY1_SCHPO 100.00 201 0 0 1 201 1 201 4e-146 412
The program that I have gotten so far is this
def parse_blast(blast_line="NA"):
transcript = blast_line[0][0]
swissProt = blast_line[1][3]
return(transcript, swissProt)
blast = open("/scratch/RNASeq/blastp.outfmt6")
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein = {}
transcript_to_protein[transcript] = swissProt
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
matrix = open("/scratch/RNASeq/diffExpr.P1e-3_C2.matrix")
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
tab = "\t"
fields = (protein,Sp_ds,Sp_hs,Sp_log,Sp_plat)
out = open("parsed_blast.txt","w")
out.write(tab.join(fields))
matrix.close()
blast.close()
out.close()

It's a scope problem, as your indentation is not correct.
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
So you keep looping till the last line without saving the values you get.
I think you should change your indentation to this
transcript_to_protein = {} # 1. declare the dictionary
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein[transcript] = swissProt # 2. Add the data to the dictionary
This will solve the problem of your first file.But not your second as you don't use the dictionary inside the loop.
So you have to move these lines inside the second loop
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
I think you got the idea. I will leave the rest for you to do, there a few lines that needs to be moved before the loops and one or two inside the second loop.

This:
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
Reads all the lines, but after it is finished (transcript, swissProt) will only have the value from the last line.
Same for:
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
You need to put the rest of your line processing inside your loops.

Related

how to replace multiple lines of specific columns of text file

i have multiple files, i want to copy two columns from one file and replace it to another two columns in another file.
the first file contains:
ag-109 3.905E-07
am-241 1.121E-06
am-243 7.294E-09
cs-133 1.210E-05
eu-151 2.393E-08
eu-153 4.918E-07
gd-155 2.039E-08
mo-95 1.139E-05
nd-143 9.869E-06
.......
........
and the second file is :
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
u-234 101 0 7.471e-06 293 end
u-235 101 0 0.0005265 293 end
u-236 101 0 0.0001285 293 end
u-238 101 0 0.02278 293 end
np-237 101 0 1.018e-05 293 end
pu-238 101 0 2.262e-06 293 end
pu-239 101 0 0.000147 293 end
.........
.......
.
.
u-234 1018 0 7.471e-06 293 end
u-235 1018 0 0.0005265 293 end
u-236 1018 0 0.0001285 293 end
u-238 1018 0 0.02278 293 end
np-237 1018 0 1.018e-05 293 end
pu-238 1018 0 2.262e-06 293 end
i want to replace the first column of file2 from file1, and the 2nd column of file1 to the 4th column of file2.
file 2 contain more lines that i want to continue reading without changing.
second problem is:
file2 has repetitive of column 1 for 18 times. the column "101" to "1018"
each 18 nuclides in the first column has different values in column 4
i have tried, to read file1 line by line and the same for the file2.
then start to replace from specific value '11'
including a condition for column 2 to change every time the nuclides iteration finished ( i have 29 nuclides).
with open('100_60.inp','a+') as fapp:
with open("20_3.2_10_100_18.txt") as copf:
line = fapp.readline()
# if not line:
# break
source = re.split(r"\s+", line.strip())
nuclide = copf.readline()
# if not nuclide:
# break
comp = re.split(r"\s+", nuclide.strip())
if len(source)==6 and source[1] != '11':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
if len(source)==6 and source[1] !='101':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
the expected result must be like this:
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
ag-109 101 0 3.905E-07 293 end
am-241 101 0 1.121E-06 293 end
am-243 101 0 7.294E-09 293 end
cs-133 101 0 1.210E-05 293 end
eu-151 101 0 2.393E-08 293 end
eu-153 101 0 4.918E-07 293 end
gd-155 101 0 2.039E-08 293 end
....
....
....
I think that if you manage to convert the text file into a csv working with column is going to be much easier.
if columns are separated by tabs you can also do it in excel without having to script it up yourself https://support.geekseller.com/knowledgebase/convert-txt-file-csv/
After that you could use the csv module and read the file by getting a dict where you can add or remove keys(columns). I can't script up a full working solution for you right now, but I hope this gives you some hint on how to approach it.

quantifying reads to a reference with Python

Input file (test.sam):
SN398:470:C8RD3ACXX:7:1111:19077:53994 16 chrI 65374 255 51M * 0 0 TGAGAAATTCTTGAACATTCGTCTGTATTGATAAATAAAACTAGTATACAG IJJJJJJJJJJJJJIJJJIJJJJJJHJJJJJJJJJJJJHHHHHFFFFDB#B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
genes.bed file is the reference:
chrI 130798 131983 YAL012W 0 + 130798 131983 0 1 1185, 0,
chrI 334 649 YAL069W 0 + 334 649 0 1 315, 0,
chrI 537 792 YAL068W-A 0 + 537 792 0 1 255, 0,
chrI 1806 2169 YAL068C 0 - 1806 2169 0 1 363, 0,
chrI 2479 2707 YAL067W-A 0 + 2479 2707 0 1 228, 0,
chrI 7234 9016 YAL067C 0 - 7234 9016 0 1 1782, 0,
chrI 10090 10399 YAL066W 0 + 10090 10399 0 1 309, 0,
chrI 11564 11951 YAL065C 0 - 11564 11951 0 1 387, 0,
chrI 12045 12426 YAL064W-B 0 + 12045 12426 0 1 381, 0,
script is the following - it looks if "chr" matches between two files, and if fourth column of test.sam (called genomic_location) is within the second and third column of genes.bed file, then it prints the fourth column of genes.bed and counts it as "1".
#!/usr/bin/env python
import sys
samfile=open('test.sam') #sorted sam file
bedfile=open('genes.bed') #reference genome
sys.stdout=open('merged.txt', 'w')
lookup = {}
for line in bedfile:
fields = line.strip().split()
chrm = fields[0]
st = int(fields[1])
end = int(fields[2])
name = fields[3]
if chrm not in lookup:
lookup[chrm] = {}
for i in range(st,end):
if i not in lookup[chrm]:
lookup[chrm][i] = [name]
else:
lookup[chrm][i].append(name)
gene_counts = {}
for line in samfile:
reads = line.split()
qname = reads[0]
flag = reads[1] # be 0 or 16
rname=reads[2]
genomic_location = int(reads[3])
mapq = int(reads[4])
if rname in lookup:
if genomic_location in lookup[rname]:
for gene in lookup[rname][genomic_location]:
if gene not in gene_counts:
gene_counts[gene] = 0
else:
gene_counts[gene] += 1
print gene_counts
I need to change it in such a way that when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile (test.sam) and then process it to see if that newly made integer is within st and end of genes.bed file.
What do you think is the best way to do this? I need to implement this within script and not make a new input files (test.sam) that first changes the fourth column if second is 16.
I would like to do this Python. Thank you for your help and please let me know if something is unclear.
Maybe there's some hidden complexity that I'm missing here, but the most obvious python implementation of "when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile" is:
if flag == 16:
genomic_location = int(reads[3]) - 51

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

How to split a line of txt file into 3 values per line

I have a file which has a line of values separated by spaces
0 380 -222 0 382 -218 45 428 174 . . .
and so on.
what is the fastest way to split each 3 values in a new line ?
like this way
0 380 -222
0 382 -218
45 428 174
.
.
.
You initially need to split the string you have based on the spaces. Then you can choose many ways to combine elements of that list and print them.
To split the string to a list you generally use split().
# test string
s = "0 380 -222 0 382 -218 45 428 174"
# splitting based on the spaces
l = s.split()
One of the options to combine them is to use zip taking note of the slices you take in order to define the elements you want. This will create tuples which will contain your elements.
You can then unpack these tuples in a for loop and print or do anything else you want with them:
for a,b,c in zip(l[0::3], l[1::3], l[2::3]):
print a, b, c
Which in turn prints:
0 380 -222
0 382 -218
45 428 174
After #boardrider's comment, I'll note that in case the list length is not divisible by 3 you can use izip_longest (zip_longest in Py3) from itertools (and supply an optionary filler value if you need it) in order to get all possible values in the string s:
from itertools import izip_longest
s = "0 380 -222 0 382 -218 45 428 293 9298 8192 919 919"
l = s.split()
for a,b,c in zip_longest(l[0::3], l[1::3], l[2::3]):
print a, b, c
Which now yields:
0 380 -222
0 382 -218
45 428 293
9298 8192 919
919 None None
You can use xrange too-
l='0 380 -222 0 382 -218 45 428 174'.split(' ')
result = [l[i:i+3] for i in xrange(0, len(l), 3)]
for three in result:
print ' '.join(three)
Prints-
0 380 -222
0 382 -218
45 428 174
You could do it with a generalized grouping function:
def grouper(n, iterable):
"s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ..."
return zip(*[iter(iterable)]*n)
line = "0 380 -222 0 382 -218 45 428 174"
for group in grouper(3, line.split()):
print(' '.join(group))
This is how you would do it:
num_columns = 3
with open(file_name) as fd:
for line in fd:
for count, number in enumerate(line.strip().split(), 1):
print '{:4}'.format(number),
if count % num_columns == 0:
print
The following will read an input file called input.txt containing your space delimited entries, and create you an output file called output.txt with the entries split 3 per line:
from itertools import izip, islice
with open('input.txt', 'r') as f_input, open('output.txt', 'w') as f_output:
values = []
for line in f_input:
values.extend(line.split())
ivalues = iter(values)
for triple in iter(lambda: list(islice(ivalues, 3)), []):
f_output.write(' '.join(triple) + '\n')
Giving you an output file as follows:
0 380 -222
0 382 -218
45 428 174
Tested using Python 2.7.9

Efficiently finding intersecting regions in two huge dictionaries

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
My file1 is sorted by line[0] and has 0-15 lines,
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
My file2 is not sorted and has 0-10 line,
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
My desired output,
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526
I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!
This is almost the same but within a function.
#Creates a function to do the reading for each file
def read_store(file_, dictio_):
"""Given a file name and a dictionary stores the values
of the file in a dictionary by its value on the column provided."""
import re
with open(file_,'r') as file_0:
lines_file_0 = fileA.readlines()
for line in lines_file_0:
ID = re.findall("^.+\s+(\w+)", line)
#I couldn't check it but it should match whatever is after a separate
# character that has letters, numbers or underscore
dictio_[ID] = line
To use do:
file1 = {}
read_store("file1.txt", file1)
And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

Categories