how to keep ordered rows in dictionary? - python

I wrote the following script to retrieve the gene count for each contains. It works well but the order of the ID list that I use as an input is not conserved in the output.
I would need to conserve the same order as my input contigs list is ordered depending on their level of expression
Can anyone help me?
Thanks for your help.
from collections import defaultdict
import numpy as np
gene_list = {}
for line in open('idlist.txt'):
columns = line.strip().split()
gene = columns[0]
rien = columns[1]
gene_list[gene] = rien
gene_count = defaultdict(lambda: np.zeros(6, dtype=int))
out_file= open('out.txt','w')
esem_file = open('Aquilonia.txt')
esem_file.readline()
for line in esem_file:
fields = line.strip().split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list.keys():
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
input file:
comp54678_c0_seq3
comp56871_c2_seq8
comp56466_c0_seq5
comp57004_c0_seq1
comp54990_c0_seq11
...
output file comes back in numerical order:
comp100235_c0_seq1 [22 13 15 6 15 16]
comp101274_c0_seq1 [55 2 27 26 6 6]
comp101915_c0_seq1 [20 2 34 12 8 7]
comp101956_c0_seq1 [13 21 11 17 17 28]
comp101964_c0_seq1 [30 73 45 36 0 1]

Use collections.OrderedDict(); it preserves entries in input order.
from collections import OrderedDict
with open('idlist.txt') as idlist:
gene_list = OrderedDict(line.split(None, 1) for line in idlist)
The above code reads your gene_list ordered dictionary using one line.
However, it looks as if you generate the output file purely based on the order of the input file lines:
for line in esem_file:
# ...
if exon in gene_list: # no need to call `.keys()` here
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
Rework your code to first collect the counts, then use a separate loop to write out your data:
with open('Aquilonia.txt') as esem_file:
next(esem_file, None) # skip first line
for line in esem_file:
fields = line.split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list:
gene_count[gene_list[exon]] += numbers
with open('out.txt','w') as out_file:
for gene in gene_list:
print >> out_file, gene, gene_count[gene]

Related

Counting items in txt file with Python dictionaries

I have following txt file (only a fragment is given)
## DISTANCE : Shortest distance from variant to transcript
## a lot of comments here
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000335137 Transcript upstream_gene_variant 216 180 60 S tcA/tcG - IMPACT=LOW;STRAND=1
chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000641515 Transcript intron_variant 303 243 81 S tcA/tcG - IMPACT=LOW;STRAND=1
chr1_69511_A/G chr1:69511 G ENSG00000186092 ENST00000335137 Transcript upstream_gene_variant 457 421 141 T/A Aca/Gca - IMPACT=MODERATE;STRAND=1
with many unknown various ENSG numbers, such as ENSG00000187583, etc. The count of integers in each ENSG string is 11.
I have to count how many intron_variant and upstream_gene_variant contains each gene (ENSGxxx).
and output it to csv file.
I use dictionary for this purpose. i tried to write this code, but not sure about correct syntax.
The logics should be: if these 11 numbers are not in dictionary, it should be added with value 1. If they already are in dictionary, value should be changed to x + 1. I currently have this code, but I am not really Python programmer, and not sure about correct syntax.
with open(file, 'rt') as f:
data = f.readlines()
Count = 0
d = {}
for line in data:
if line[0] == "#":
output.write(line)
if line.__contains__('ENSG'):
d[line.split('ENSG')[1][0:11]]=1
if 1 in d:
d=1
else:
Count += 1
Any suggestions?
Thank you!
Can you try this:
from collections import Counter
with open('data.txt') as fp:
ensg = []
for line in fp:
idx = line.find('ENSG')
if not line.startswith('#') and idx != -1:
ensg.append(line[idx+4:idx+15])
count = Counter(ensg)
>>> count
Counter({'00000187961': 2, '00000187583': 2})
Update
I need to know how many ENGs contain "intron_variant" and "upstream_gene_variant"
Use regex to extract desired patterns:
from collections import Counter
import re
PAT_ENSG = r'ENSG(?P<ensg>\d{11})'
PAT_VARIANT = r'(?P<variant>intron_variant|upstream_gene_variant)'
PATTERN = re.compile(fr'{PAT_ENSG}.*\b{PAT_VARIANT}\b')
with open('data.txt') as fp:
ensg = []
for line in fp:
sre = PATTERN.search(line)
if not line.startswith('#') and sre:
ensg.append(sre.groups())
count = Counter(ensg)
Output:
>>> count
Counter({('00000186092', 'upstream_gene_variant'): 2,
('00000186092', 'intron_variant'): 1})
Here's another interpretation of your requirement:-
I have modified your sample data such that the first ENG value is ENSG00000187971 to highlight how this works.
D = {}
with open('eng.txt') as eng:
for line in eng:
if not line.startswith('#'):
t = line.split()
V = t[6]
E = t[3]
if not V in D:
D[V] = {}
if not E in D[V]:
D[V][E] = 1
else:
D[V][E] += 1
print(D)
The output of this is:-
{'intron_variant': {'ENSG00000187971': 1, 'ENSG00000187961': 1}, 'upstream_gene_variant': {'ENSG00000187583': 2}}
So what you have now is a dictionary keyed by variant. Each variant has its own dictionary keyed by the ENSG values and a count of occurrences of each ENSG value

Separating values from text file in python

I am trying to read in a text file with the following data:
362 147
422 32
145 45
312 57
35 421
361 275
and I want to separate the values into pairs so 362 and 147 would be pair 1, 422 and 32 pair 2 and so on.
However I run into a problem during the 5 pair which should be 35,421 but for some reason my code does not split this pair correctly, i think this is because of the spaces since only this pair has a two digit number and then a 3 digit number. But I'm not sure how to fix this, here's my code:
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split(" ") #splits the two values for each line into an array
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)
The output I get:
pair: 1
362
147
pair: 2
422
32
pair: 3
145
45
pair: 4
312
57
pair: 5
35
pair: 6
361
275
Try this :
def __init__(filename):
with open(filename, "r") as f:
lines = [i.strip() for i in f.readlines()]
for line_num, line in enumerate(lines):
p1, p2 = [i for i in line.split() if i]
print(f"pair: {line_num+1}\n{p1}\n{p2}\n\n")
Note : Always try to use with open(). In this way python takes care of closing the file automatically at the end.
The problem with your code is that you're not checking whether the words extracted after splitting values are empty string or not. If you print values for each line, for the pair 5, you'ld notice it is ['', '35', '421\n']. The first value of this one is an empty string. You can change your code to this :
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split() #splits the two values for each line into an array; Addendum .split(" ") is equivalent to .split()
values = [i for i in values if i] #Removes the empty strings
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)
Change this line:
values = line.split(" ")
to:
values = line.split()

How to make Python read new lines and just lines?

I know that Python can read numbers like:
8
5
4
2
2
6
But I am not sure how to make it read it like:
8 5 4 2 2 6
Also, is there a way to make python read both ways? For example:
8 5 4
2
6
I think reading with new lines would be:
info = open("info.txt", "r")
lines = info.readlines()
info.close()
How can I change the code so it would read downwards and to the sides like in my third example above?
I have a program like this:
info = open("1.txt", "r")
lines = info.readlines()
numbers = []
for l in lines:
num = int(l)
numbers.append(str(num**2))
info.close()
info = open("1.txt", "w")
for num in numbers:
info.write(num + "\n")
info.close()
How can I make the program read each number separately in new lines and in just lines?
Keeping them as strings:
with open("info.txt") as fobj:
numbers = fobj.read().split()
Or, converting them to integers:
with open("info.txt") as fobj:
numbers = [int(entry) for entry in fobj.read().split()]
This works with one number and several numbers per line.
This file content:
1
2
3 4 5
6
7
8 9 10
11
will result in this output for numbers:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
This approach reads the whole file at once. Unless your file is really large this is fine.
info = open("1.txt", "r")
lines = info.readlines()
numbers = []
for line in lines:
for num_str in line.split(' '):
num = int(num_str)
numbers.append(str(num**2))
info.close()
info = open("test.txt", "r")
lines = info.readlines()
numbers = []
for l in lines:
l = l.strip()
lSplit = l.split(' ')
if len(lSplit) == 1:
num = int(l)
numbers.append(str(num**2))
else:
for num in lSplit:
num2 = int(num)
numbers.append(str(num2**2))
print numbers
info.close()
A good way to do this is with a generator that iterates over the lines, and, for each line, yields each of the numbers on it. This works fine if there is only one number on the line (or none), too.
def numberfile(filename):
with open(filename) as input:
for line in input:
for number in line.split():
yield int(number)
Then you can just write, for example:
for n in numberfile("info.txt"):
print(n)
If you don't care how many numbers per line, then you could try this to create the list of the squares of all the numbers.
I have simplified your code a bit by simply iterating over the open file using a with statement, but iterating over the readlines() result will work just as well (for small files - for large ones, this method doesn't require you to hold the whole content of the file in memory).
numbers = []
with open("1.txt", 'r') as f:
for line in f:
nums = line.split()
for n in nums:
numbers.append(str(int(n)**2))
Just another not yet posted way...
numbers = []
with open('info.txt') as f:
for line in f:
numbers.extend(map(int, line.split()))
file_ = """
1 2 3 4 5 6 7 8
9 10
11
12 13 14
"""
for number in file_ .split():
print number
>>
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Finding the rating of words using python

This is my program and it display the value if i give the complete name like if i type eng than it will show me only eng with value
import re
sent = "eng"
#sent=raw_input("Enter word")
#regex = re.compile('(^|\W)sent(?=(\W|$))')
for line in open("sir_try.txt").readlines():
if sent == line.split()[0].strip():
k = line.rsplit(',',1)[0].strip()
print k
gene name utr length
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
Actually what i want to do is that i want to search the highest value from the text file not through words , and it delete all the values from the text file of the same word having less value than the maximum like from the above text it should delete 12 , 30 for ensg , and than it should find the minimum value from the utr values and display it with name
What you people answering me is , i already done it , and i mention it before my showing my program
please try this
file=open("sir_try.txt","r")
list_line=file.readlines()
file.close()
all_text=""
dic={}
sent="ensg"
temp_list=[]
for line in list_line:
all_text=all_text+line
name= line.rsplit()[0].strip()
score=line.rsplit()[1].strip()
dic[name]=score
for i in dic.keys():
if sent in i:
temp_list.append(dic[i])
hiegh_score=max(temp_list)
def check(index):
reverse_text=all_text[index+1::-1]
index2=reverse_text.find("\n")
if sent==reverse_text[:index2+1][::-1][1:len(sent)+1]:
return False
else:
return True
list_to_min=dic.values()
for i in temp_list:
if i!=hiegh_score:
index=all_text.find(str(i))
while check(index):
index=all_text.find(str(i),index+len(str(i)))
all_text=all_text[0:index]+all_text[index+len(str(i)):]
list_to_min.remove(str(i))
#write all text to "sir_try.txt"
file2=open("sir_try.txt","w")
file2.write(all_text)
file2.close()
min_score= min(list_to_min)
for j in dic.keys():
if min_score==dic[j]:
print "min score is :"+str(min_score)+" for person "+j
function check is for a bug in solotion for explain when your file is
gene name utr length
ali 12
ali87 30
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
program delete ali score but we dont have it
by adding check function i solve it
and this version is final version answer
Try instead of if sent == and replace it with a if sent in (line.split()[0].strip()):
That should check if the value of sent (engs) is anywhere in the argument (line.split()[0].strip()) in this case.
If you're still trying to only take the highest value, I would just create a variable value, then something along the lines of
if line.split()[1].strip() > value:
value = line.split()[1].strip()
Test that out and let us know how it works for you.
To find out the name (first column) with the maximum value associated (second column), you need to first split the lines at the whitespace between name and value. Then you can find the maximum value using the built-in max() function. Let it take the value column as sorting criterion. You can then easily find out the corresponding name.
Example:
file_content = """
gene name utr length
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
"""
# split lines at whitespace
l = [line.split() for line in file_content.splitlines()]
# skip headline and empty lines
l = [line for line in l if len(line) == 2]
print l
# find the maximum of second column
max_utr_length_tuple = max(l, key=lambda x:x[1])
print max_utr_length_tuple
print max_utr_length_tuple[0]
the output is:
$ python test.py
[['ensbta', '24'], ['ensg1', '12'], ['ensg24', '30'], ['ensg37', '65'], ['enscat', '22'], ['ensm', '30']]
['ensg37', '65']
ensg37
Short and sweet:
In [01]: t=file_content.split()[4:]
In [02]: b=((zip(t[0::2], t[1::2])))
In [03]: max(b, key=lambda x:x[1])
Out[03]: ('ensg37', '65')
import operator
f = open('./sir_try.txt', 'r')
f = f.readlines()
del f[0]
gene = {}
matched_gene = {}
for line in f:
words = line.strip().split(' ')
words = [word for word in words if not word == '']
gene[words[0]] = words[1]
# getting user input
user_input = raw_input('Enter gene name: ')
for gene_name, utr_length in gene.iteritems():
if user_input in gene_name:
matched_gene[gene_name] = utr_length
m = max(matched_gene.iteritems(), key=operator.itemgetter(1))[0]
print m, matched_gene[m] # expected answer
# code to remove redundant gene names as per requirement
for key in matched_gene.keys():
if not key == m:
matched_gene.pop(key)
for key in gene.keys():
if user_input in key:
gene.pop(key)
final_gene = dict(gene.items() + matched_gene.items())
out = open('./output.txt', 'w')
out.write('gene name' + '\t\t' + 'utr length' + '\n\n')
for key, value in final_gene.iteritems():
out.write(key + '\t\t\t\t' + value + '\n')
out.close()
Output:
Enter gene name: ensg
ensg37 65
Since you have tagged your question regex,
Here's something that you would want to see and it's the only one (at the moment) that uses regex!
import re
sent = 'ensg' # your sequence
# regex that will "filter" the lines containing value of sent
my_re = re.compile(r'(.*?%s.*?)\s+?(\d+)' % sent)
with open('stack.txt') as f:
lines = f.read() # get data from file
filtered = my_re.findall(lines) # "filter" your data
print filtered
# get the desired (tuple with maximum "utr length")
max_tuple = max(filtered, key=lambda x: x[1])
print max_tuple
Output:
[('ensg1', '12'), ('ensg24', '30'), ('ensg37', '65')]
('ensg37', '65')

Reading and editing a font file and using a dictionary

I have to take the values from a text file which contains the co-ordinates to draw characters out in TurtleWorld, an example of the text file is the following:
<character=B, width=21, code=66>
4 21
4 0
-1 -1
4 21
13 21
16 20
17 19
18 17
18 15
17 13
16 12
13 11
-1 -1
4 11
13 11
16 10
17 9
18 7
18 4
17 2
16 1
13 0
4 0
</character>
I have to then write a function to take all of these points and then convert them into a dictionary where a key is the character and the corresponding values are the set of points which can be used to draw that character in TurtleWorld.
The code I have tried is the following:
def read_font():
"""
Read the text from font.txt and convert the lines into instructions for how to plot specific characters
"""
filename = raw_input("\n\nInsert a file path to read the text of that file (or press any letter to use the default font.txt): ")
if len(filename) == 1:
filename = 'E:\words.txt'
words = open(filename, 'r')
else:
words = open(filename, 'r')
while True: # Restarts the function if the file path is invalid
line = words.readline()
line = line.strip()
if line[0] == '#' or line[0] == ' ': # Used to omit the any unwanted lines of text
continue
elif line[0] == '<' and line[1] == '/': # Conditional used for the end of each character
font_dictionary[character] = numbers_list
elif line[0] == '<' and line[1] != '/':
take a look at http://oreilly.com/catalog/pythonxml/chapter/ch01.html :: specifically, hit up the example titled :: Example 1-1: bookhandler.py
you can more or less credit/copy that and tweak it to read your particular xml. once you get the 'guts'(your coords), you can split it into a list of x/y coords really easily
such as
a = "1 3\n23 4\n3 9\n"
coords = map(int,a.split())
and chunk it into a list w/ groups of 2 How do you split a list into evenly sized chunks?
and store the result letters[letter] = result
or you can do the chunking more funky using the re module
import re
a = "1 13\n4 5\n"
b = re.findall("\d+ *\d+",a)
c = [map(int,item.split()) for item in b]
c
[[1, 13], [4, 5]]

Categories