Make a list in python from a FASTA text file - python

I have text file like this small example:
>ENST00000491024.1|ENSG00000187583.6|OTTHUMG00000040756.4|OTTHUMT00000097942.2|PLEKHN1-003|PLEKHN1|176
SLESSPDAPDHTSETSHSPLYADPYTPPATSHRRVTDVRGLEEFLSAMQSARGPTPSSPLPSVPVSVPASDPRSCSSGPAGPYLLSKKGALQSRAAQRHRGSAKDGGPQPPDAPQLVSSAREGSPEPWLPLTDGRSPRRSRDPGYDHLWDETLSSSHQKCPQLGGPEASGGLVQWI
>ENST00000433179.2|ENSG00000187642.5|OTTHUMG00000040757.3|-|C1orf170-201|C1orf170|696
MPTQDGQLRRPARPPGPRAWMEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000341290.2|ENSG00000187642.5|OTTHUMG00000040757.3|OTTHUMT00000097943.2|C1orf170-001|C1orf170|676
MEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000428771.2|ENSG00000188290.6|OTTHUMG00000040758.2|OTTHUMT00000097945.2|HES4-002|HES4|247
MAADTPGKPSASPMAGAPASASRTPDKPRSAAEHRKVGSRPGVRGATGGREGRGTQPVPDPQSSKPVMEKRRRARINESLAQLKTLILDALRKESSRHSKLEKADILEMTVRHLRSLRRVQVTAALSADPAVLGKYRAGFHECLAEVNRFLAGCEGVPADVRSRLLGHLAACLRQLGPSRRPASLSPAAPAEAPAPEVYAGRPLLPSLGGPFPLLAPPLLPGLTRALPAAPRAGPQGPGGPWRPWLR
This file is splitted into different groups. Each group has 2 parts. The 1st part starts with ">" and the elements in this part are splitted by "|" and the line after that is the 2nd part. I am trying to make a list in Python from my file which has the 6th element of the ID part of each group. Here is the expected output for the small example:
list = ["PLEKHN1", "C1orf170", "C1orf170", "HES4"]
I am trying to first import into a dictionary and then make a list like expected output using:
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
res = [el[5:] for s in k for el in s.split('|')]
But it does not return what I am looking for. Do you know how to fix it?

Since these are clearly protein sequences in FASTA format, I suggest you use Biopython, it will save you time and be more robust than building your own parser:
from Bio import SeqIO
lst = [record.description.split('|')[5] for record in SeqIO.parse('in_file.fasta', 'fasta')]
print(lst)
# ['PLEKHN1', 'C1orf170', 'C1orf170', 'HES4']

Try this:
res = [s[5] for s in [el.split('|') for el in k]]
output:
['HES4', 'C1orf170', 'PLEKHN1', 'C1orf170']

You can get the tokens you want by reading every line in your file and selecting only the lines that start with '>'. Then you split the results based on the '|' character and take the 6th element. This code does that in one line
with open('infile.txt') as f:
tokens =[line.split('|')[5] for line in f.readlines() if line[0] == '>']
print(tokens)

Related

How do I split a txt file based on a condition of a certain element in in a certain order list (Python)

So I need to split a txt file into a dictionary.
The txt file could look like this:
Keyone -2
key-two 1
Key'Three -3
Key four-here 5
I think I would need to check the list reversed to check if the second to last element is either a " " or a "-", but since there could be "-" between the words in the string, I'am a bit confused as to how to approach this.
I need the dict to look like [str(key); int(value)]
My tries so far, lookes like:
`
for line in file
a=line.split()
value = a[-1]
key=line[0:-2]
key=key.replace("-","")
`
try the following code:
# Define input
txt = "Keyone -2\nkey-two 1\nKey'Three -3\nKey four-here 5"
print(txt)
# Split the text by newlines
lines = txt.split('\n')
print(lines)
# Iterate over all lines
d = {}
for line in lines:
line.split(' ')
# The key is element after the last space
key = "".join(line[:-1])
# The value is everything before the first space
value = line[-1]
# Assuming it can only be an integer
value = int(value)
d[key] = value
print(d)
with open ("text.txt") as f:
for i in f:
a=i.split()
value=a[-1]
key=i[0:-2]
#print(type(key))
key=key.replace("-","")
d[key[0:-1]]=value
print(d)
The following is the answer using regex:
import re
data_to_parse = """
Keyone -2
key-two 1
Key'Three -3
Key four-here 5
"""
data_to_parse = data_to_parse.splitlines()
pattern = " -?\d"
new = {}
for line in data_to_parse:
if re.findall(pattern, line):
x = re.findall(pattern, line)
#print(line[line.find(x[0]) - 1:])
new[line[:line.find(x[0])].strip()] = line[line.find(x[0]):].strip()
print(new)
See the output:
EDITED:
If the values needs to be an integer, please change the line as following:
new[line[:line.find(x[0])].strip()] = int(line[line.find(x[0]):].strip())
So that the output is going to be below:

Spliting list till finding an element

I'm reading a file that has lines like these:
2SomethingHere
3Whatever
3Whatever
4foo
4bar
5baz
2SomethingHere
3Whatever
3Whatever
4foo
4bar
5baz
This is a test file, and I've been reading like:
file = open('data.txt', 'r')
contents = file.readlines()
In order to separate lines and getting them into a list. But I want to separate this list into a list of lists like this:
main_list = [['2SomethingHere', '3Whatever', '3Whatever', '4foo', '4baz', '5baz'], ['2SomethingHere', '3Whatever', '3Whatever', '4foo', '4baz', '5baz']]
Being 2 at the beggining of one element the start of a new list.
I've been trying this:
from itertools import groupby
result = [list(g) for k,g in groupby(contents,lambda x:x.startswith('2')) if k]
But the result is showing only the elements starting with 2
I want all the elements following this 2 until finding another.
If you know that the file will start with a 2 on the first line, then you can just do:
file = open('data.txt', 'r')
contents = file.readlines()
print(contents)
main_list = []
for el in contents:
if el.startswith("2"):
main_list.append([]) # add a new sub-list
main_list[-1].append(el.strip()) # add line (without leading/trailing whitespace) to the last sub-list
print(main_list)
but if it might not, then you would have to do something like:
main_list = [[]]
for el in contents:
if el.startswith("2") and main_list[-1]:
main_list.append([])
main_list[-1].append(el.strip())
so that the start is handled a little bit differently: an initial sublist is already present ready for the items, even if the first line does not start with "2", but if the first line does start with 2, then it does not immediately move onto a new sub-list (which would leave an empty sub-list at the start of the output).
If your trying to group the lines by the first character then:
import itertools
with open("test.txt", "r") as fp:
lines = fp.readlines()
groups = itertools.groupby(lines, key=lambda line: line[:1])
results = [list(g) for k, g in groups if k]
print(results)

converting user details stored in a text file into a dictionary

I have tried converting the text file into a dictionary using the following code below:
d = {}
with open('staff.txt', 'r') as file:
for line in file:
(key, val) = line.split()
d[str(key)] = val
print(d)
The contents in the file staff.txt:
username1 jaynwauche
password1 juniornwauche123
e_mail1 juniornwauche#gmail.com
Fullname1 Junior Nwauche
Error: too many values to unpack
What am I doing wrong?
According to your file, the last line you have three words and you want to split them by space so you will have three words but just two variables.
You need to specify the split condition. Right now you are splitting each character, there for you get a list with a lot of elements. Try line.split(' ') like this:
d = {}
with open('staff.txt', 'r') as file:
for line in file:
(key, val) = line.split(' ')
d[str(key)] = val
print(d)
This will split the lines where there's an space, so you get only words on the list.

Python reading file and analysing lines with substring

In Python, I'm reading a large file with many many lines. Each line contains a number and then a string such as:
[37273738] Hello world!
[83847273747] Hey my name is James!
And so on...
After I read the txt file and put it into a list, I was wondering how I would be able to extract the number and then sort that whole line of code based on the number?
file = open("info.txt","r")
myList = []
for line in file:
line = line.split()
myList.append(line)
What I would like to do:
since the number in message one falls between 37273700 and 38000000, I'll sort that (along with all other lines that follow that rule) into a separate list
This does exactly what you need (for the sorting part)
my_sorted_list = sorted(my_list, key=lambda line: int(line[0][1:-2]))
Use tuple as key value:
for line in file:
line = line.split()
keyval = (line[0].replace('[','').replace(']',''),line[1:])
print(keyval)
myList.append(keyval)
Sort
my_sorted_list = sorted(myList, key=lambda line: line[0])
How about:
# ---
# Function which gets a number from a line like so:
# - searches for the pattern: start_of_line, [, sequence of digits
# - if that's not found (e.g. empty line) return 0
# - if it is found, try to convert it to a number type
# - return the number, or 0 if that conversion fails
def extract_number(line):
import re
search_result = re.findall('^\[(\d+)\]', line)
if not search_result:
num = 0
else:
try:
num = int(search_result[0])
except ValueError:
num = 0
return num
# ---
# Read all the lines into a list
with open("info.txt") as f:
lines = f.readlines()
# Sort them using the number function above, and print them
lines = sorted(lines, key=extract_number)
print ''.join(lines)
It's more resilient in the case of lines without numbers, it's more adjustable if the numbers might appear in different places (e.g. spaces at the start of the line).
(Obligatory suggestion not to use file as a variable name because it's a builtin function name already, and that's confusing).
Now there's an extract_number() function, it's easier to filter:
lines2 = [L for L in lines if 37273700 < extract_number(L) < 38000000]
print ''.join(lines2)

Highest to Lowest from a textfile?

I'm pretty new to python and I have been having trouble in trying to print out a score list in the form of highest to lowest. The scorelist is saved in a text file and is set out like this...
Jax:6
Adam:10
Rav:2
I have looked in books but I haven't been getting anywhere, does anyone know how I could go about receiving the scores in the form of highest to lowest from a textfile. Thank You.
I am using Python 3.3.2 version.
try like this:
with open("your_file") as f:
my_dict = {}
for x in f:
x = x.strip().split(":")
my_dict[x[0]] = x[1]
print sorted(my_dict.items(), key= lambda x:x[1], reverse=True)
First, you need to load the file (let say it's name is file.txt), then you need to read the values, sort it after that and then print it. It's not as difficult as it seems to be.
Works only when the scores are unique
# init a dictionary where you store the results
results = {}
# open the file with results in a "read" mode
with open("file.txt", "r") as fileinput:
# for each line in file with results, do following
for line in fileinput:
# remove whitespaces at the end of the line and split the line by ":"
items = line.strip().split(":")
# store the result, the score will be the key
results[int(items[1])] = items[0]
# sort the scores (keys of results dictionery) in descending order
sorted_results = sorted(results.keys(), reverse=True)
# for each score in sorted_results do the following
for i in sorted_results:
# print the result in the format of the scores in your file
print "{}:{}".format(results[i],i)
The steps are explained in the example code.
The links to the relevant documentation or examples follows:
Sorting
Printing (string.format())
Dictionary data structure (dict)
Reading and writing files
EDIT:
This version works even when there are more scores of the same value.
(Thanks to #otorrillas for pointing out the problem)
# init a list where you store the results
results = []
# open the file with results in a "read" mode
with open("file.txt", "r") as fileinput:
# for each line in file with results, do following
for line in fileinput:
# remove whitespaces at the end of the line and split the line by ":"
items = line.strip().split(":")
# store the result as a list of tuples
results.append(tuple(items))
# first it sorts all the tuples in `results` tuple by the second item (score)
# for each result record in sorted results list do the following
for result_item in sorted(results, key=lambda x: x[1], reverse=True):
# print the result in the format of the scores in your file
print "{}:{}".format(result_item[0], result_item[1])
Comments in the code describes the code. The main difference is that the code does not use dict any more and uses tuple instead. And it also uses sorting by a key.
Just for fun: if all you need to do is sort the data from a file you can use the UNIX sort command
sort -k 2 -t : -n -r $your_file
(the arguments are: sort by second key, split fields by ':', numeric sort, reverse order).
tldr
sorted([l.rstrip().split(':') for l in open('d.d')], key=lambda i:int(i[1]))
You need to operate on the lines in the file, that you can get simply as
[l for l in open('FILE')]
but possibly without the new lines
[l.rstrip() for l in open('FILE')]
and eventually split over the : colon character
[l.rstrip().split(':') for l in open('FILE')]
so that you have obtainined a list of lists
>>> print [l.rstrip().split(':') for l in open('FILE')]
[['Jax', '6'], ['Adam', '10'], ['Rav', '2']]
that is the thing that you want to have sorted. In species
you want to sort it according to the numerical value of the 2nd field
>>> print [int(r[1]) for r in [l.rstrip().split(':') for l in open('FILE')]]
[6, 10, 2]
The sorted builtin accepts the optional argument key, a function to extract the part to compare in each element of the iterable to be sorted
>>> sd = sorted([l.rstrip().split(':')for l in open('FILE')],key=lambda r:int(r[1]))
>>> print sd
[['Rav', '2'], ['Jax', '6'], ['Adam', '10']]
and that's all folks...

Categories