I have this code:
while i<len(line):
if re.findall(pattern, line[i]):
k,v = line[i].split('=')
print k
token = dict(k=v)
print token
break
and the result I'm getting is :
ptk
{'k': 'ptk_first'}
how to make this few lines of code nicer and dictionary that will look like this:
{'ptk': 'ptk_first'}
for line in lines:
if re.match(pattern, line):
k,v = line.split('=')
token = {k:v}
print token
Something like this:
lines="""\
key1=data on the rest of line 1
key2=data on the rest of line 2
key3=data on line 3"""
d={}
for line in lines.splitlines():
k,v=line.split('=')
d[k]=v
print d
In [112]: line="ptk=ptk_first"
In [113]: dict([line.split("=")])
Out[113]: {'ptk': 'ptk_first'}
for your code:
for line in lines:
if re.findall(pattern, line):
token = dict([line.split("=")])
print token
with regex you can try this:
>>> import re
>>> lines="""
... ptk=ptk_first
... ptk1=ptk_second
... """
>>> dict(re.findall('(\w+)=(\w+)',lines,re.M))
{'ptk1': 'ptk_second', 'ptk': 'ptk_first'}
Related
I have a text file like this:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|GAPDH|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|RNF14-005|ACTB|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|RNF14-202|HELLE|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
I want to make a list in python for the 6th element of the lines that start with ">".
to do so, I first make a dictionary in python and then the keys should be the list that I want. like this:
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
res = [el[5:] for s in k for el in s.split("|")]
but it returns all elements in the line starts with ">".
do you know how to fix it?
here is expected output:
["RNF14", "GAPDH", "ACTB", "HELLE"]
This should help. ->Using a simple iterattion, str.startswith and str.split
Demo:
res = []
with open(filename, "r") as infile:
for line in infile:
if line.startswith(">"):
val = line.split("|")
res.append(val[5])
print(res)
Output:
['RNF14', 'GAPDH', 'ACTB', 'HELLE']
In you code Replace
res = [el[5:] for s in k for el in s.split("|")]
with
res = [s.split("|")[5] for s in k ] #Should work.
a solution near yours with filter instead of groupby and map
with open('infile.txt') as f:
lines = f.readlines()
groups = filter(lambda x: x.startswith(">"), lines)
res = list(map(lambda x: x.split('|')[5],groups))
Suppose we have the following text file with column a and column b:
D000001 T109
D000001 T195
D000002 T115
D000002 T131
D000003 T073
D000004 T170
I wonder how to produce the following structure:
D000001 T109 T195
D000002 T115 T131
D000003 T073
D000004 T170
Pasted below is initial skeleton in Python.
from __future__ import print_function
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
print(a + ' ' + b)
You can use itertools.groupby:
import itertools, operator
with open('descr2semtype_short.txt') as f:
for key, items in itertools.groupby(
(line.rstrip().split(None,1) for line in f),
operator.itemgetter(0)):
print(key, ' '.join(item[1] for item in items))
which gives the desired output:
D000001 T109 T195
D000002 T115 T131
D000003 T073
D000004 T170
Instead of printing them there, you can keep a dictionary of the lines , with the first element of the line as the key and the second element as value (as a list , so that if another element comes for same key you can append to it).
And then print them at the end.
Example -
from __future__ import print_function
d = {}
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
if a not in d:
d[a] = []
d[a].append(b)
for k,v in d.iteritems():
print(k + ' ' + ' '.join(v))
From Python 2.7 onwards, If the order of the lines is important, then instead of Dictionary , we can use OrderedDict .
Example -
from __future__ import print_function
from collections import OrderedDict
d = OrderedDict()
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
if a not in d:
d[a] = []
d[a].append(b)
for k,v in d.items():
print(k + ' ' + ' '.join(v))
I would do it with OrderedDict , this way:
from collections import OrderedDict
d = OrderedDict()
with open('1.txt', 'r') as f:
for line in f:
a,b = line.strip().split()
print a,b
if a not in d:
d[a] = [b]
else:
d[a].append(b)
print d
Output:
OrderedDict([('D000001', ['T109', 'T109', 'T195']), ('D000002', ['T115', 'T115', 'T131']), ('D000003', ['T073', 'T073']), ('D000004', ['T170', 'T170', 'T175', 'T180'])])
I have the following as input. I am trying to write a regular expression which yields the below output. Can anyone provide
input on how to do this?
INPUT:-
refs/changes/44/1025744/3
refs/changes/62/1025962/5
refs/changes/45/913745/2
OUTPUT:-
1025744/3
1025962/5
913745/2
If that is the actual import format, a regex is not needed:
>>> source = """\
... refs/changes/44/1025744/3
... refs/changes/62/1025962/5
... refs/changes/45/913745/2
... """
>>> output = [line.split('/', 3)[-1] for line in source.splitlines()]
>>> output[0]
'1025744/3'
>>> output[1]
'1025962/5'
You can also have them all in one string, like this:
>>> ' '.join(line.split('/', 3)[-1] for line in source.splitlines())
'1025744/3 1025962/5 913745/2'
If you are feeding the input line by line, you could do this:
import re
instr = 'refs/changes/44/1025744/3'
print get_match(instr)
def get_match():
match = re.match("^(refs/changes/[0-9]*/)([0-9/]*)", instr)
if match:
return match.group(2)
>>> import re
>>> input="""refs/changes/44/1025744/3
... refs/changes/62/1025962/5
... refs/changes/45/913745/2"""
>>> res=re.findall(r'.*/.*/.*/(.*/.*)',input)
>>> for i in res:
... print i
...
1025744/3
1025962/5
913745/2
So lets say I want to convert the following to a dictionary where the 1st column is keys, and 2nd column is values.
http://pastebin.com/29bXkYhd
The following code works for this (assume romEdges.txt is the name of the file):
f = open('romEdges.txt')
dic = {}
for l in f:
k, v = l.split()
if k in dic:
dic[k].extend(v)
else:
dic[k] = [v]
f.close()
OK
But why doesn't the code work for this file?
http://pastebin.com/Za0McsAM
If anyone can tell me the correct code for the 2nd text file to work as well I would appreciate it.
Thanks in advance.
You should use append instead of extend
from collections import defaultdict
d = defaultdict(list)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].append(v)
print d
or using sets to prevent duplicates
d = defaultdict(set)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].add(v)
print d
If you want to append the data to dictionary, then you can use update in python. Please use following code:
f = open('your file name')
dic = {}
for l in f:
k,v = l.split()
if k in dic:
dict.update({k:v })
else:
dic[k] = [v]
print dic
f.close()
output:
{'0100464': ['0100360'], '0100317': ['0100039'], '0100405': ['0100181'], '0100545': ['0100212'], '0100008': ['0000459'], '0100073': ['0100072'], '0100044': ['0100426'], '0100062': ['0100033'], '0100061': ['0000461'], '0100066': ['0100067'], '0100067': ['0100164'], '0100064': ['0100353'], '0100080': ['0100468'], '0100566': ['0100356'], '0100048': ['0100066'], '0100005': ['0100448'], '0100007': ['0100008'], '0100318': ['0100319'], '0100045': ['0100046'], '0100238': ['0100150'], '0100040': ['0100244'], '0100024': ['0100394'], '0100025': ['0100026'], '0100022': ['0100419'], '0100009': ['0100010'], '0100020': ['0100021'], '0100313': ['0100350'], '0100297': ['0100381'], '0100490': ['0100484'], '0100049': ['0100336'], '0100075': ['0100076'], '0100074': ['0100075'], '0100077': ['0000195'], '0100071': ['0100072'], '0100265': ['0000202'], '0100266': ['0000201'], '0100035': ['0100226'], '0100079': ['0100348'], '0100050': ['0100058'], '0100017': ['0100369'], '0100030': ['0100465'], '0100033': ['0100322'], '0100058': ['0100056'], '0100013': ['0100326'], '0100036': ['0100463'], '0100321': ['0100320'], '0100323': ['0100503'], '0100003': ['0100004'], '0100056': ['0100489'], '0100055': ['0100033'], '0100053': ['0100495'], '0100286': ['0100461'], '0100285': ['0100196'], '0100482': ['0100483']}
I'm pretty new to Python, and I'm trying to parse a file. Only certain lines in the file contain data of interest, and I want to end up with a dictionary of the stuff parsed from valid matching lines in the file.
The code below works, but it's a bit ugly and I'm trying to learn how it should be done, perhaps with a comprehension, or else with a multiline regex. I'm using Python 3.2.
file_data = open('x:\\path\\to\\file','r').readlines()
my_list = []
for line in file_data:
# discard lines which don't match at all
if re.search(pattern, line):
# icky, repeating search!!
one_tuple = re.search(pattern, line).group(3,2)
my_list.append(one_tuple)
my_dict = dict(my_list)
Can you suggest a better implementation?
Thanks for the replies. After putting them together I got
file_data = open('x:\\path\\to\\file','r').read()
my_list = re.findall(pattern, file_data, re.MULTILINE)
my_dict = {c:b for a,b,c in my_list}
but I don't think I could have gotten there today without the help.
Here's some quick'n'dirty optimisations to your code:
my_dict = dict()
with open(r'x:\path\to\file', 'r') as data:
for line in data:
match = re.search(pattern, line)
if match:
one_tuple = match.group(3, 2)
my_dict[one_tuple[0]] = one_tuple[1]
In the spirit of EAFP I'd suggest
with open(r'x:\path\to\file', 'r') as data:
for line in data:
try:
m = re.search(pattern, line)
my_dict[m.group(2)] = m.group(3)
except AttributeError:
pass
Another way is to keep using lists, but redesign the pattern so that it contains only two groups (key, value). Then you could simply do:
matches = [re.findall(pattern, line) for line in data]
mydict = dict(x[0] for x in matches if x)
matchRes = pattern.match(line)
if matchRes:
my_dict = matchRes.groupdict()
I'm not sure I'd recommend it, but here's a way you could try to use a comprehension instead(I substituted a string for the file for simplicity)
>>> import re
>>> data = """1foo bar
... 2bing baz
... 3spam eggs
... nomatch
... """
>>> pattern = r"(.)(\w+)\s(\w+)"
>>> {x[0]: x[1] for x in (m.group(3, 2) for m in (re.search(pattern, line) for line in data.splitlines()) if m)}
{'baz': 'bing', 'eggs': 'spam', 'bar': 'foo'}