Creating a dictionary from FASTA file - python

I have a file that looks like this:
%Labelinfo
string1
string2
%Labelinfo2
string3
string4
string5
I would like to create dictionary that has key a string that is %Labelinfo, and value that is a concatenation of strings from one Labelinfo to next. Basically this :
{%Labelinfo : string1+string2 , %Labelinfo : string2+string3+string4}
Problem is that there can be any number of lines between two "Labelinfo" lines. For example, between %Labelinfo to %Labelinfo2 can be 5 lines. Then, between %Labelinfo2 to %Labelinfo3 can be, let's say 4 lines.
However, the line that containes "Labelinfo" always starts with the same character, for example %.
How do I solve this problem?

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
d = {}
with open('Labelinfo.txt') as f:
for line in f:
if len(line) > 1:
if '%Labelinf' in line:
key = line.strip()
d[key] = ""
else:
d[key] += line.strip() + "+"
d = {key: d[key][:-1] for key in d}
print d
{'%Labelinfo2': 'string3+string4+string5', '%Labelinfo': 'string1+string2'}

Here's how I would write it:
The program loops through every line in the file. Checks to see if that line is empty, if it is, ignore it. If it isn't empty, then we process the line. Anything with a % at the start denotes a variable, so let's go ahead and add that to the dictionary and set that to a variable, current. Then we keep on adding to the dictionary at key current, until the next %
di = {}
with open("fasta.txt","r") as f:
current = ""
for line in f:
line = line.strip()
if line == "":
continue
if line[0] == "%":
di[line] = ""
current = line
else:
if di[current] == "":
di[current] = line
else:
di[current] += "+" + line
print(di)
Output:
{'%Labelinfo2': 'string3+string4+string5', '%Labelinfo': 'string1+string2'}
Note: Dictionaries do not enforce error, so they will be out of order; but stil accessible in the same way. And, just a heads up, your example output is slightly wrong, you forgot to put in the 2 after one of the %Labelinfo.

import re
d = {}
text = open('fasta.txt').read()
for el in [ x for x in re.split(r'\s+', text) if x]:
if el.startswith('%'):
key = el
d[key] = ''
else:
value = d[key] + el
d[key] = value
print(d)
{'%Labelinfo': 'string1string2', '%Labelinfo2': 'string3string4string5'}

Related

How to print out lines longer than specific lenght

I have an input file like this:
#sample1
ATGGTTCCAAGGCCTTGGTTAATTGGGGGGTTTTTTTTTTTTTTTTTTT
#sample2
TTGGAACCTTGGCCAATTAAGGGGGGGGGTTTTTTTCCCCCCCCCCCCC
#sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG
I want to print out the line that have specific minimum length. For example, if the minimum length I want is 66, then the output will be :
#sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG
Since only the sequence of sample 3 have the minimum length 66
Below is my code sofar:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line.startswith("#"):
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = []
continue
sequence = line
fastfile[sequencenumber].append(sequence)
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)
Argv[1] is the path of the input file and argv[2] is the specific minimum length.
You want the values of the fastfile dictionary to be strings not lists, so instead of appending consecutive sequences to a running list, you need to concatenating them to a running string:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line[0] == "#":
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = ""
continue
fastfile[sequencenumber] += line
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)
Or if you need to store the strings in a list like you originally do, then use "".join(value) to concatenate all the strings together, like so:
output = []
for key, value in fastfile.items():
if len("".join(value)) >= sys.argv[2]:
output.append("".join(value))
output
This looks much simpler:
with open(argv[1]) as fin :
text = fin.read()
min_length = int(argv[2])
parts = text.split('#')
# choose only the parts that have strings over the min_length
parts = [p for p in parts if any(len(i) > min_length for i in p.split('\n'))]
output = '#'.join( parts )

How do I compare a value in one line to a value in another line?

I have a file that puts out lines that have two values each. I need to compare the second value in every line to make sure those values are not repeated more than once. I'm very new to coding so any help is appreciated.
My thinking was to turn each line into a list with two items each, and then I could compare the same position from a couple lists.
This is a sample of what my file contains:
20:19:18 -1.234567890
17:16:15 -1.098765432
14:13:12 -1.696969696
11:10:09 -1.696969696
08:07:06 -1.696969696
Here's the code I'm trying to use. Basically I want it to ignore those first two lines and print out the third line, since it gets repeated more than once:
with open('my_file') as txt:
for line in txt: #this section turns the file into lists
linelist = '%s' % (line)
lista = linelist.split(' ')
n = 1
for line in lista:
listn = line[n]
listo = line[n + 1]
listp = line[n + 2]
if listn[1] == listo[1] and listn[1] == listp[1]:
print line
else:
pass
n += 1
What I want to see is:
14:13:12 -1.696969696
But I keep getting an error on the long if statement of "string index out of range"
You would be a lot better off using a dictionary type structure. Dictionary allows you to quickly check for existence.
Basically check if the 2nd value is a key in your dict. If a key then print the line. Else just add the 2nd value as a key for later.
myDict = {}
with open('/home/dmoraine/pylearn/%s' % (file)) as txt:
for line in txt:
key = line.split()[1]
if key in myDict:
print(line)
else:
myDict[key] = None #value doesn't matter
Some simple debugging highlights the functional problem:
with open('my_file.txt') as txt:
for line in txt: #this section turns the file into lists
linelist = '%s' % (line)
lista = linelist.split(' ')
print(linelist, lista)
n = 1
for line in lista:
print("line", n, ":\t", line)
listn = line[n]
listo = line[n + 1]
listp = line[n + 2]
print(listn, '|',listo, '|',listp)
if listn[1] == listo[1] and listn[1] == listp[1]:
print(line)
n += 1
Output:
20:19:18 -1.234567890
['20:19:18', '-1.234567890\n']
17:16:15 -1.098765432
['17:16:15', '-1.098765432\n']
14:13:12 -1.696969696
['14:13:12', '-1.696969696\n']
11:10:09 -1.696969696
['11:10:09', '-1.696969696\n']
08:07:06 -1.696969696
['08:07:06', '-1.696969696\n']
line 1 : 08:07:06
8 | : | 0
In short, you've mis-handled the variables. When you get to the second loop, lista is the "words" of the final line; you've read and discarded all of the others. line iterates through these individual words. Your listn/o/p variables are, therefore, individual characters. Thus, there is no such thing as listn[1], and you get an error.
Instead, you need to build some sort of list of the floating-point numbers. For instance, using your top loop as a starting point:
float_list = {}
for line in txt: #this section turns the file into lists
lista = line.split(' ')
my_float = float(lista[1]) # Convert the second field into a float
float_list.append(my_float)
Now you need to write code that will find duplicates in float_list. Can you take it from there?
Ended up turning each line into a list, and then making a dictionary of all the lists. Thank you all for your help.

How to parse a "here document" in Python?

I want to write a Python method that reads a text file with key-values:
FOO=BAR
BUZ=BLEH
I also want to support newlines either through quoting and \n, and by supporting here-docs:
MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC
While the first one is easy to implement, I'm struggling with the second. Is there maybe something in Python's stdlib (i.e. shlex) that I can use already?
"test.txt" content:
FOO=BAR
BUZ=BLEH
MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC
Function:
def read_strange_file(filename):
with open(filename) as f:
file_content = f.read().splitlines()
res = {}
key, value, delim = "", "", ""
for line in file_content:
if "=" in line and not delim:
key, value = line.split("=")
if value.strip(" ").startswith("<<"):
delim = value.strip(" ")[2:] # extracting delimiter keyword
value = ""
continue
if not delim or (delim and line == delim):
if value.startswith("\"") and value.endswith("\""):
# [1: -1] delete quotes
value = bytes(value[1: -1], "utf-8").decode("unicode_escape")
if delim:
value = value[:-1] # delete "\n"
res[key] = value
delim = ""
if delim:
value += line + "\n"
return res
Usage:
result = read_strange_file("test.txt")
print(result)
Output:
{'FOO': 'BAR', 'BUZ': 'BLEH', 'MULTILINE1': 'This\nis a test', 'MULTILINE2': 'This\nis a test'}
I'm assuming that this is the test string (i.e., there are unseen \n characters at the end of each line):
s = ''
s += 'MULTILINE1="This\nis a test"\n'
s += 'MULTILINE2= <<DOC\n'
s += 'This\n'
s += 'is a test\n'
s += 'DOC\n'
The best I can do is to cheat using NumPy:
import numpy as np
A = np.asarray([ss.rsplit('\n', 1) for ss in ('\n'+s).split('=')])
keys = A[:-1,1].tolist()
values = A[1:,0].tolist()
#optionally parse here-documents
di = 'DOC' #delimiting identifier
values = [v.strip().lstrip('<<%s\n'%di).rstrip('\n%s'%di) for v in values]
print('Keys: ', keys)
print('Values: ', values)
#if you want a dictionary:
d = dict( zip(keys, values) )
This results in:
Keys: ['MULTILINE1', 'MULTILINE2']
Values: ['"This\nis a test"', '"This\nis a test"']
It works by sneakily adding a \n character to the beginning of the string, then splitting the whole string by = characters, then finally uses rsplit to retain all values to the right of =, even when those values contain multiple \n characters. Printing the array A makes things clearer:
[['', 'MULTILINE1'],
['"This\nis a test"', 'MULTILINE2'],
[' <<DOC\nThis\nis a test\nDOC', '' ]]

How do you sort by ascending order from a txt file in python?

I am new to python and have a hopefully simple question. I have a txt file with road names and lengths. i.e. Main street 56 miles. I am stumped on how to call on that file in Python and sort all the road lengths in ascending order. Thanks for your help.
Let's assume that there is a "miles" after every number. (This is an untested code so you can get it edited but I think the idea is right).
EDIT: THIS IS TESTED
import collections
originaldict = {}
newdict = collections.OrderedDict()
def isnum(string):
try:
if string is ".":
return True
float(string)
return True
except Exception:
return False
for line in open(input_file, "r"):
string = line[:line.find("miles") - 1]
print string
curnum = ""
for c in reversed(string):
if not isnum(c):
break
curnum = c + curnum
originaldict[float(curnum)] = []
originaldict[float(curnum)].append(line)
for num in sorted(originaldict.iterkeys()):
newdict[num] = []
newdict[num].append(originaldict[num][0])
del originaldict[num][0]
with open(output_file, "a") as o:
for value in newdict.values():
for line in value:
o.write(line + "\n")

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))
Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result
Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

Categories