I'm reading a file that has lines like these:
2SomethingHere
3Whatever
3Whatever
4foo
4bar
5baz
2SomethingHere
3Whatever
3Whatever
4foo
4bar
5baz
This is a test file, and I've been reading like:
file = open('data.txt', 'r')
contents = file.readlines()
In order to separate lines and getting them into a list. But I want to separate this list into a list of lists like this:
main_list = [['2SomethingHere', '3Whatever', '3Whatever', '4foo', '4baz', '5baz'], ['2SomethingHere', '3Whatever', '3Whatever', '4foo', '4baz', '5baz']]
Being 2 at the beggining of one element the start of a new list.
I've been trying this:
from itertools import groupby
result = [list(g) for k,g in groupby(contents,lambda x:x.startswith('2')) if k]
But the result is showing only the elements starting with 2
I want all the elements following this 2 until finding another.
If you know that the file will start with a 2 on the first line, then you can just do:
file = open('data.txt', 'r')
contents = file.readlines()
print(contents)
main_list = []
for el in contents:
if el.startswith("2"):
main_list.append([]) # add a new sub-list
main_list[-1].append(el.strip()) # add line (without leading/trailing whitespace) to the last sub-list
print(main_list)
but if it might not, then you would have to do something like:
main_list = [[]]
for el in contents:
if el.startswith("2") and main_list[-1]:
main_list.append([])
main_list[-1].append(el.strip())
so that the start is handled a little bit differently: an initial sublist is already present ready for the items, even if the first line does not start with "2", but if the first line does start with 2, then it does not immediately move onto a new sub-list (which would leave an empty sub-list at the start of the output).
If your trying to group the lines by the first character then:
import itertools
with open("test.txt", "r") as fp:
lines = fp.readlines()
groups = itertools.groupby(lines, key=lambda line: line[:1])
results = [list(g) for k, g in groups if k]
print(results)
Related
I have text file like this small example:
>ENST00000491024.1|ENSG00000187583.6|OTTHUMG00000040756.4|OTTHUMT00000097942.2|PLEKHN1-003|PLEKHN1|176
SLESSPDAPDHTSETSHSPLYADPYTPPATSHRRVTDVRGLEEFLSAMQSARGPTPSSPLPSVPVSVPASDPRSCSSGPAGPYLLSKKGALQSRAAQRHRGSAKDGGPQPPDAPQLVSSAREGSPEPWLPLTDGRSPRRSRDPGYDHLWDETLSSSHQKCPQLGGPEASGGLVQWI
>ENST00000433179.2|ENSG00000187642.5|OTTHUMG00000040757.3|-|C1orf170-201|C1orf170|696
MPTQDGQLRRPARPPGPRAWMEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000341290.2|ENSG00000187642.5|OTTHUMG00000040757.3|OTTHUMT00000097943.2|C1orf170-001|C1orf170|676
MEPRGGGSSQFSSCPGPASSGDQMQRLLQGPAPRPPGEPPGSPKSPGHSTGSQRPPDSPGAPPRSPSRKKRRAVGAKGGGHTGASASAQTGSPLLPAASPETAKLMAKAGQEELGPGPAGAPEPGPRSPVQEDRPGPGLGLSTPVPVTEQGTDQIRTPRRAKLHTVSTTVWEALPDVSRAKSDMAVSTPASEPQPDRDMAVSTPASEPQSDRDMAVSTPASEPQPDTDMAVSTPASEPQPDRDMAVSIPASKPQSDTAVSTPASEPQSSVALSTPISKPQLDTDVAVSTPASKHGLDVALPTAGPVAKLEVASSPPVSEAVPRMTESSGLVSTPVPRADAAGLAWPPTRRAGPDVVEMEAVVSEPSAGAPGCCSGAPALGLTQVPRKKKVRFSVAGPSPNKPGSGQASARPSAPQTATGAHGGPGAWEAVAVGPRPHQPRILKHLPRPPPSAVTRVGPGSSFAVTLPEAYEFFFCDTIEENEEAEAAAAGQDPAGVQWPDMCEFFFPDVGAQRSRRRGSPEPLPRADPVPAPIPGDPVPISIPEVYEHFFFGEDRLEGVLGPAVPLPLQALEPPRSASEGAGPGTPLKPAVVERLHLALRRAGELRGPVPSFAFSQNDMCLVFVAFATWAVRTSDPHTPDAWKTALLANVGTISAIRYFRRQVGQGRRSHSPSPSS
>ENST00000428771.2|ENSG00000188290.6|OTTHUMG00000040758.2|OTTHUMT00000097945.2|HES4-002|HES4|247
MAADTPGKPSASPMAGAPASASRTPDKPRSAAEHRKVGSRPGVRGATGGREGRGTQPVPDPQSSKPVMEKRRRARINESLAQLKTLILDALRKESSRHSKLEKADILEMTVRHLRSLRRVQVTAALSADPAVLGKYRAGFHECLAEVNRFLAGCEGVPADVRSRLLGHLAACLRQLGPSRRPASLSPAAPAEAPAPEVYAGRPLLPSLGGPFPLLAPPLLPGLTRALPAAPRAGPQGPGGPWRPWLR
This file is splitted into different groups. Each group has 2 parts. The 1st part starts with ">" and the elements in this part are splitted by "|" and the line after that is the 2nd part. I am trying to make a list in Python from my file which has the 6th element of the ID part of each group. Here is the expected output for the small example:
list = ["PLEKHN1", "C1orf170", "C1orf170", "HES4"]
I am trying to first import into a dictionary and then make a list like expected output using:
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
res = [el[5:] for s in k for el in s.split('|')]
But it does not return what I am looking for. Do you know how to fix it?
Since these are clearly protein sequences in FASTA format, I suggest you use Biopython, it will save you time and be more robust than building your own parser:
from Bio import SeqIO
lst = [record.description.split('|')[5] for record in SeqIO.parse('in_file.fasta', 'fasta')]
print(lst)
# ['PLEKHN1', 'C1orf170', 'C1orf170', 'HES4']
Try this:
res = [s[5] for s in [el.split('|') for el in k]]
output:
['HES4', 'C1orf170', 'PLEKHN1', 'C1orf170']
You can get the tokens you want by reading every line in your file and selecting only the lines that start with '>'. Then you split the results based on the '|' character and take the 6th element. This code does that in one line
with open('infile.txt') as f:
tokens =[line.split('|')[5] for line in f.readlines() if line[0] == '>']
print(tokens)
Im a little stuck here. I'm trying to read a data file in Python 3.
I want to make a list of lists
*The first 36 lines:
each line is a list that's appended to the main list
f = open("a.data","r")
h = []
a = []
for word in range(0,797):
g = f.readline()
h.append(g.strip())
a.append(h)
h = []
But from the 37th line and beyond:
I need a loop where this happens:
The new line is a white line, pass
the next 4 lines should go into a new list 'h' and append to 'h' to 'a'
The thing is that readline() acts crazy for everything I tried
Any suggestions?
Thanks in advance.
ps the strings in the 4 lines are divided by a ;
Try this:
import re
with open('a.data', 'r') as f:
lst = re.split(';|\n{1,2}', f.read())
length = 36
lstoflst = [lst[i:i+length] for i in range(0, len(lst)-1, length)]
print(lstoflst)
I read the whole list, split at the newline and semicolon, and make a list of list with a list comprehension.
Please consider a better data format for your next report, like csv if possible.
i want to generate a list of server addresses and credentials reading from a file, as a single list splitting from newline in file.
file is in this format
login:username
pass:password
destPath:/directory/subdir/
ip:10.95.64.211
ip:10.95.64.215
ip:10.95.64.212
ip:10.95.64.219
ip:10.95.64.213
output i want is in this manner
[['login:username', 'pass:password', 'destPath:/directory/subdirectory', 'ip:10.95.64.211;ip:10.95.64.215;ip:10.95.64.212;ip:10.95.64.219;ip:10.95.64.213']]
i tried this
with open('file') as f:
credentials = [x.strip().split('\n') for x in f.readlines()]
and this returns lists within list
[['login:username'], ['pass:password'], ['destPath:/directory/subdir/'], ['ip:10.95.64.211'], ['ip:10.95.64.215'], ['ip:10.95.64.212'], ['ip:10.95.64.219'], ['ip:10.95.64.213']]
am new to python, how can i split by newline character and create single list. thank you in advance
You could do it like this
with open('servers.dat') as f:
L = [[line.strip() for line in f]]
print(L)
Output
[['login:username', 'pass:password', 'destPath:/directory/subdir/', 'ip:10.95.64.211', 'ip:10.95.64.215', 'ip:10.95.64.212', 'ip:10.95.64.219', 'ip:10.95.64.213']]
Just use a list comprehension to read the lines. You don't need to split on \n as the regular file iterator reads line by line. The double list is a bit unconventional, just remove the outer [] if you decide you don't want it.
I just noticed you wanted the list of ip addresses joined in one string. It's not clear as its off the screen in the question and you make no attempt to do it in your own code.
To do that read the first three lines individually using next then just join up the remaining lines using ; as your delimiter.
def reader(f):
yield next(f)
yield next(f)
yield next(f)
yield ';'.join(ip.strip() for ip in f)
with open('servers.dat') as f:
L2 = [[line.strip() for line in reader(f)]]
For which the output is
[['login:username', 'pass:password', 'destPath:/directory/subdir/', 'ip:10.95.64.211;ip:10.95.64.215;ip:10.95.64.212;ip:10.95.64.219;ip:10.95.64.213']]
It does not match your expected output exactly as there is a typo 'destPath:/directory/subdirectory' instead of 'destPath:/directory/subdir' from the data.
This should work
arr = []
with open('file') as f:
for line in f:
arr.append(line)
return [arr]
You could just treat the file as a list and iterate through it with a for loop:
arr = []
with open('file', 'r') as f:
for line in f:
arr.append(line.strip('\n'))
there are several dictionaries in the variable highscores. I need to sort it by its key values, and sorted() isn't working.
global highscores
f = open('RPS.txt', 'r')
highscores = [line.strip() for line in f]
sorted(highscores)
highscores = reverse=True[:5]
for line in f:
x = line.strip()
print(x)
f.close()
this is the error:
TypeError: 'bool' object is not subscriptable
sorted(v) an iterator that returns each element of v in order; it is not a list. You can use the iterator in a for loop to process the elements one at a time:
for k in sorted(elements): ...
You can transform each element and store the result in a list:
v = [f(k) for k in sorted(elements)]
Or you can just capture all elements into a list.
v = list(k)
Note that in the code above, elements are strings from a file, not a dictionary.
The following should do what (I think) you want:
with open('RPS.txt', 'r') as f: # will automatically close f
highscores = [line.strip() for line in f]
highscores = sorted(highscores, reverse=True)[:5]
for line in highscores:
print(line)
The primary problem was the way you're using sorted(). And, at the end, rather than trying to iterate though the lines of the file again (which won't work because files aren't list and can't be arbitrarily iterated-over) WHat the code above does is sort the lines read from the file and then takes first 5 of that list, which was saved in highscores. Following that it prints them. There's no need to strip the lines again, that was taken care of when the file was first read.
I want my program to read from a .txt file, which has data in its lines arranged like this:
NUM NUM NAME NAME NAME. How could I read its lines into a list so that each line becomes an element of the list, and each element would have its first two values as ints and the other three as strings?
So the first line from the file: 1 23 Joe Main Sto should become lst[0] = [1, 23, "Joe", "Main", "Sto"].
I already have this, but it doesn't work perfectly and I'm sure there must be a better way:
read = open("info.txt", "r")
line = read.readlines()
text = []
for item in line:
fullline = item.split(" ")
text.append(fullline)
Use str.split() without an argument to have whitespace collapsed and removed for you automatically, then apply int() to the first two elements:
with open("info.txt", "r") as read:
lines = []
for item in read:
row = item.split()
row[:2] = map(int, row[:2])
lines.append(row)
Note what here we loop directly over the file object, no need to read all lines into memory first.
with open(file) as f:
text = [map(int, l.split()[:2]) + l.split()[2:] for l in f]