quantifying reads to a reference with Python - python

Input file (test.sam):
SN398:470:C8RD3ACXX:7:1111:19077:53994 16 chrI 65374 255 51M * 0 0 TGAGAAATTCTTGAACATTCGTCTGTATTGATAAATAAAACTAGTATACAG IJJJJJJJJJJJJJIJJJIJJJJJJHJJJJJJJJJJJJHHHHHFFFFDB#B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
genes.bed file is the reference:
chrI 130798 131983 YAL012W 0 + 130798 131983 0 1 1185, 0,
chrI 334 649 YAL069W 0 + 334 649 0 1 315, 0,
chrI 537 792 YAL068W-A 0 + 537 792 0 1 255, 0,
chrI 1806 2169 YAL068C 0 - 1806 2169 0 1 363, 0,
chrI 2479 2707 YAL067W-A 0 + 2479 2707 0 1 228, 0,
chrI 7234 9016 YAL067C 0 - 7234 9016 0 1 1782, 0,
chrI 10090 10399 YAL066W 0 + 10090 10399 0 1 309, 0,
chrI 11564 11951 YAL065C 0 - 11564 11951 0 1 387, 0,
chrI 12045 12426 YAL064W-B 0 + 12045 12426 0 1 381, 0,
script is the following - it looks if "chr" matches between two files, and if fourth column of test.sam (called genomic_location) is within the second and third column of genes.bed file, then it prints the fourth column of genes.bed and counts it as "1".
#!/usr/bin/env python
import sys
samfile=open('test.sam') #sorted sam file
bedfile=open('genes.bed') #reference genome
sys.stdout=open('merged.txt', 'w')
lookup = {}
for line in bedfile:
fields = line.strip().split()
chrm = fields[0]
st = int(fields[1])
end = int(fields[2])
name = fields[3]
if chrm not in lookup:
lookup[chrm] = {}
for i in range(st,end):
if i not in lookup[chrm]:
lookup[chrm][i] = [name]
else:
lookup[chrm][i].append(name)
gene_counts = {}
for line in samfile:
reads = line.split()
qname = reads[0]
flag = reads[1] # be 0 or 16
rname=reads[2]
genomic_location = int(reads[3])
mapq = int(reads[4])
if rname in lookup:
if genomic_location in lookup[rname]:
for gene in lookup[rname][genomic_location]:
if gene not in gene_counts:
gene_counts[gene] = 0
else:
gene_counts[gene] += 1
print gene_counts
I need to change it in such a way that when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile (test.sam) and then process it to see if that newly made integer is within st and end of genes.bed file.
What do you think is the best way to do this? I need to implement this within script and not make a new input files (test.sam) that first changes the fourth column if second is 16.
I would like to do this Python. Thank you for your help and please let me know if something is unclear.

Maybe there's some hidden complexity that I'm missing here, but the most obvious python implementation of "when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile" is:
if flag == 16:
genomic_location = int(reads[3]) - 51

Related

how to replace multiple lines of specific columns of text file

i have multiple files, i want to copy two columns from one file and replace it to another two columns in another file.
the first file contains:
ag-109 3.905E-07
am-241 1.121E-06
am-243 7.294E-09
cs-133 1.210E-05
eu-151 2.393E-08
eu-153 4.918E-07
gd-155 2.039E-08
mo-95 1.139E-05
nd-143 9.869E-06
.......
........
and the second file is :
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
u-234 101 0 7.471e-06 293 end
u-235 101 0 0.0005265 293 end
u-236 101 0 0.0001285 293 end
u-238 101 0 0.02278 293 end
np-237 101 0 1.018e-05 293 end
pu-238 101 0 2.262e-06 293 end
pu-239 101 0 0.000147 293 end
.........
.......
.
.
u-234 1018 0 7.471e-06 293 end
u-235 1018 0 0.0005265 293 end
u-236 1018 0 0.0001285 293 end
u-238 1018 0 0.02278 293 end
np-237 1018 0 1.018e-05 293 end
pu-238 1018 0 2.262e-06 293 end
i want to replace the first column of file2 from file1, and the 2nd column of file1 to the 4th column of file2.
file 2 contain more lines that i want to continue reading without changing.
second problem is:
file2 has repetitive of column 1 for 18 times. the column "101" to "1018"
each 18 nuclides in the first column has different values in column 4
i have tried, to read file1 line by line and the same for the file2.
then start to replace from specific value '11'
including a condition for column 2 to change every time the nuclides iteration finished ( i have 29 nuclides).
with open('100_60.inp','a+') as fapp:
with open("20_3.2_10_100_18.txt") as copf:
line = fapp.readline()
# if not line:
# break
source = re.split(r"\s+", line.strip())
nuclide = copf.readline()
# if not nuclide:
# break
comp = re.split(r"\s+", nuclide.strip())
if len(source)==6 and source[1] != '11':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
if len(source)==6 and source[1] !='101':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
the expected result must be like this:
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
ag-109 101 0 3.905E-07 293 end
am-241 101 0 1.121E-06 293 end
am-243 101 0 7.294E-09 293 end
cs-133 101 0 1.210E-05 293 end
eu-151 101 0 2.393E-08 293 end
eu-153 101 0 4.918E-07 293 end
gd-155 101 0 2.039E-08 293 end
....
....
....
I think that if you manage to convert the text file into a csv working with column is going to be much easier.
if columns are separated by tabs you can also do it in excel without having to script it up yourself https://support.geekseller.com/knowledgebase/convert-txt-file-csv/
After that you could use the csv module and read the file by getting a dict where you can add or remove keys(columns). I can't script up a full working solution for you right now, but I hope this gives you some hint on how to approach it.

reading a file outside a function to iterate through

I have created a function, that I want to run over an entire file, but I am having some trouble. I am only getting output from the last line of the file.
I have two different input files, and the idea is to take the lines from one file and collecting certain terms, adding them to a dictionary, and then searching the second file for the corresponding lines and printing the output. I know the problem is most likely the placement of my call for the function.
The matrix file looks like this
Sp_ds Sp_hs Sp_log Sp_plat
c3833_g1_i2 4.00 0.07 16.84 26.37
c4832_g1_i1 24.55 116.87 220.53 28.82
c5161_g1_i1 107.49 89.39 26.95 698.97
c4399_g1_i2 27.91 72.57 5.56 36.58
c5916_g1_i1 82.57 19.03 48.55 258.22
The Blast file looks like this
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
c1007_g1_i1|m.807 gi|20137702|sp|O74370.1|ISY1_SCHPO 100.00 201 0 0 1 201 1 201 4e-146 412
The program that I have gotten so far is this
def parse_blast(blast_line="NA"):
transcript = blast_line[0][0]
swissProt = blast_line[1][3]
return(transcript, swissProt)
blast = open("/scratch/RNASeq/blastp.outfmt6")
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein = {}
transcript_to_protein[transcript] = swissProt
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
matrix = open("/scratch/RNASeq/diffExpr.P1e-3_C2.matrix")
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
tab = "\t"
fields = (protein,Sp_ds,Sp_hs,Sp_log,Sp_plat)
out = open("parsed_blast.txt","w")
out.write(tab.join(fields))
matrix.close()
blast.close()
out.close()
It's a scope problem, as your indentation is not correct.
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
So you keep looping till the last line without saving the values you get.
I think you should change your indentation to this
transcript_to_protein = {} # 1. declare the dictionary
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein[transcript] = swissProt # 2. Add the data to the dictionary
This will solve the problem of your first file.But not your second as you don't use the dictionary inside the loop.
So you have to move these lines inside the second loop
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
I think you got the idea. I will leave the rest for you to do, there a few lines that needs to be moved before the loops and one or two inside the second loop.
This:
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
Reads all the lines, but after it is finished (transcript, swissProt) will only have the value from the last line.
Same for:
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
You need to put the rest of your line processing inside your loops.

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

How to rename labels in list?

I have a file like that
301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go
I want to convert the labels 301, 303, 306, 308, 308, 309 to 1, 2, 3, 4, 4, 5
How can I rename these labels in order in such a way that similar ones get the same number?
Use a dictionary to store the mapping from original to new label, and use the current len of the dictionary for values that have not yet been mapped, using setdefault.
>>> labels = 301, 303, 306, 308, 308, 309
>>> names = {}
>>> for l in labels:
... names.setdefault(l, len(names)+1)
...
>>> names
{301: 1, 303: 2, 306: 3, 308: 4, 309: 5}
More complete example:
text = """301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go""".splitlines()
import re
names = {}
replacer = lambda x: str(names.setdefault(x.group(), len(names) + 1))
for line in text:
replaced = re.sub(r'^\d+', replacer, line)
print(replaced)
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
You could use an index which only increments when the label is different from the last one:
data = ["301 my name is joe", "303 whatsup", "306 how are you doing today", "308 what happened?", "308 going home", "309 let's go"]
idx = 0
last_index = ""
for i in range(len(data)):
if last_index != data[i].split(" ")[0]: idx += 1
print str(idx) + " " + ' '.join(data[i].split(" ")[1:])
last_index = data[i].split(" ")[0]
Result:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
Use a dict to collect the prefixes and a counter.
data = """301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go"""
prefixes = {}
i = 1
for line in data.split("\n"):
prefix, rest = line.split(" ", 1)
pr = int(prefix)
if not pr in prefixes:
prefixes[pr] = i
i = i + 1
newPrefix = prefixes[pr]
print("{} {}".format(newPrefix, rest))
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
def update_text(data):
labels = sorted(set([line.split()[0] for line in data.splitlines()]))
for inx, line in enumerate(data.splitlines()):
yield str(labels.index(line.split()[0]) + 1) + ' ' + ' '.join(line.split()[1:])
data = '''301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go'''
print '\n'.join(update_text(data))
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
Another simple solution:
>>> keys = sorted(set([line.split()[0] for line in data.splitlines()]))
>>> for k, v in enumerate(keys):
... data = data.replace(v, str(k + 1))
...
>>> print data
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go

Unsure why program similar to bubble-sort is not working

I have been working on a programming challenge, problem here, which basically states:
Given integer array, you are to iterate through all pairs of neighbor
elements, starting from beginning - and swap members of each pair
where first element is greater than second.
And then return the amount of swaps made and the checksum of the final answer. My program seemingly does both the sorting and the checksum according to how it wants. But my final answer is off for everything but the test input they gave.
So: 1 4 3 2 6 5 -1
Results in the correct output: 3 5242536 with my program.
But something like:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
Results in: 39 1291223 when the correct answer is 39 3485793.
Here's what I have at the moment:
# Python 2.7
def check_sum(data):
data = [str(x) for x in str(data)[::]]
numbers = len(data)
result = 0
for number in range(numbers):
result += int(data[number])
result *= 113
result %= 10000007
return(str(result))
def bubble_in_array(data):
numbers = data[:-1]
numbers = [int(x) for x in numbers]
swap_count = 0
for x in range(len(numbers)-1):
if numbers[x] > numbers[x+1]:
temp = numbers[x+1]
numbers[x+1] = numbers[x]
numbers[x] = temp
swap_count += 1
raw_number = int(''.join([str(x) for x in numbers]))
print('%s %s') % (str(swap_count), check_sum(raw_number))
bubble_in_array(raw_input().split())
Does anyone have any idea where I am going wrong?
The issue is with your way of calculating Checksum. It fails when the array has numbers with more than one digit. For example:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
You are calculating Checksum for 2967439240707483842747064867463187179639043276268209559266227857859127247091613388169792999692421001863750412763159556330418794527699666
digit by digit while you should calculate the Checksum of [2, 96, 7439, 240, 70748, 3, 842, 74, 706, 4, 86, 7, 463, 1871, 7963, 904, 327, 6268, 20955, 92662, 278, 57, 8, 5912, 724, 70916, 13, 388, 1, 697, 92999, 6924, 2, 100, 186, 37504, 1, 27631, 59556, 33041, 87, 9, 45276, 99666]
The fix:
# Python 2.7
def check_sum(data):
result = 0
for number in data:
result += number
result *= 113
result %= 10000007
return(result)
def bubble_in_array(data):
numbers = [int(x) for x in data[:-1]]
swap_count = 0
for x in xrange(len(numbers)-1):
if numbers[x] > numbers[x+1]:
numbers[x+1], numbers[x] = numbers[x], numbers[x+1]
swap_count += 1
print('%d %d') % (swap_count, check_sum(numbers))
bubble_in_array(raw_input().split())
More notes:
To swap two variables in Python, you dont need to use a temp variable, just use a,b = b,a.
In python 2.X, use xrange instead of range.

Categories