I've tried debugging this script but I'm not sure waht's causing the error.
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
file.close()
file = open(args.file,'r+')
def translate(match):
return dicts.translation[match.group(0)]
with open(args.file, 'r+') as output:
with open(args.file, 'r+') as book:
for line in book:
output.write(pattern.sub(translate, line))
Error:
return dicts.translation5[match.group(0)]
KeyError: '<p>Text 1:1-1</p>'
I believe you are trying to match a read line and see what regexp it matches so that you can apply appropriate change to it (also in regexp form). This approach might work but using a dictionary is redundant in this case.
The broad approach is
You match the line to compiled pattern to find a match.
Then you compare each pattern in list1 to the matched string to see if it
matches.
If it does you convert the matched string to the form in list2
Something like
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text3 ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
def translate(m):
for x,v in translation.items():
if re.search(x,m.group()):
return re.sub(x,v,m.group())
for line in book:
m = pattern.findall(line)
ret = translate(m)
if ret is not None:
output.write(ret)
else:
#No match. Echo back original line
output.write(line)
Input
<p>Text 1:1-1</p>
Output
<p class='text'>Text 1:1-1</p>
There are probably other better ways to do it
The issue is that the text '<p>Text 1:1-1</p>' is not a key in your dict. As dicts is a free variable in your code, there is nothing more we can tell you.
Try match.group(1) instead. In regex results, group(0) is the entire matched string and groups 1 and following are the groups in the regex itself. In your case group(0) == "<p>Text 1:1-1\</p\>" and group(1) == "1".
Related
What's a cute way to do this in python?
Say we have a list of strings:
clean_be
clean_be_al
clean_fish_po
clean_po
and we want the output to be:
be
be_al
fish_po
po
Another approach which will work for all scenarios:
import re
data = ['clean_be',
'clean_be_al',
'clean_fish_po',
'clean_po', 'clean_a', 'clean_clean', 'clean_clean_1']
for item in data:
item = re.sub('^clean_', '', item)
print (item)
Output:
be
be_al
fish_po
po
a
clean
clean_1
Here is a possible solution that works with any prefix:
prefix = 'clean_'
result = [s[len(prefix):] if s.startswith(prefix) else s for s in lst]
You've merely provided minimal information on what you're trying to achieve, but the desired output for the 4 given inputs can be created via the following function:
def func(string):
return "_".join(string.split("_")[1:])
you can do this:
strlist = ['clean_be','clean_be_al','clean_fish_po','clean_po']
def func(myList:list, start:str):
ret = []
for element in myList:
ret.append(element.lstrip(start))
return ret
print(func(strlist, 'clean_'))
I hope, it was useful, Nohab
There are many ways to do based on what you have provided.
Apart from the above answers, you can do in this way too:
string = 'clean_be_al'
string = string.replace('clean_','',1)
This would remove the first occurrence of clean_ in the string.
Also if the first word is guaranteed to be 'clean', then you can try in this way too:
string = 'clean_be_al'
print(string[6:])
You can use lstrip to remove a prefix and rstrip to remove a suffix
line = "clean_be"
print(line.lstrip("clean_"))
Drawback:
lstrip([chars])
The [chars] argument is not a prefix; rather, all combinations of its values are stripped.
I'm attempting to work through finding the amount of consecutive STRs (a substring pattern, i.e. "AGAT") in a sequence file.
String Patterns: AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Sequence file(one of many other sequence files): AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
In the above sequence, TATC is the maximum with a run of 5 consecutive TATC pairs. With my regular expression, it is returning matches whether they are consecutive or not.
I believe using regular expressions is my best bet. This is my first time working in Python so don't expect too much. I've used the regex tool at regex101.com and it has provided me some good insight into regex formulations. I'm passing a variable into the regex with {head}, which which is the string pattern, but I want to find the matched string {head} 2 or more times. My below regex returns a match to head at least 1 or more times, so I know why that is returning the way it does.
groups = re.findall(rf'?:{head})+, text)
If I use r"(AGAT){2,}" in regex101.com, this works the way I expect. It finds the matched string of characters 2 or more times. If I pass it into my code as groups = re.findall(rf'(?:{head}){2,}), it doesn't return anything.
My code is below:
import csv
import re
import string
if len(sys.argv) != 3:
print("missing command-line argument")
exit(1)
if re.search(r"(.csv)", sys.argv[1]) == None:
print("CSV file not found!")
print("Usage: 'python.py *.csv *.txt'")
exit(1)
if re.search(r"(.txt)", sys.argv[2]) == None:
print("TXT file not found!")
print("Usage: 'python.py *.csv *.txt'")
exit(1)
# use reader or DictReader from the CSV module
# use sys.argv for command-line arguments
# use open(filename) and f.read() to read its contents.
# open CSV and DNA sequence and read into memory
with open(sys.argv[1], newline='') as database, open(sys.argv[2], newline='') as sequence:
reader = csv.DictReader(database)
headers = reader.fieldnames
text = sequence.read()
for head in headers:
groups = re.findall(rf'(?:{head})+', text)
print(head, groups)
If I use the above groups = re.findall(rf'(?:{head})+', text) variable I get the below output
AGATC ['AGATCAGATCAGATCAGATC']
TTTTTTCT []
AATG ['AATG']
TCTAG []
GATA ['GATA', 'GATA']
TATC ['TATCTATCTATCTATCTATC']
GAAA ['GAAA', 'GAAA', 'GAAA']
TCTG []
If I use groups = re.findall(rf'(?:{head}){2,}', text) I get nothing.
AGATC []
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC []
GAAA []
TCTG []
So, I suppose I'm asking, how can I use regex to find a string of characters(passed as a variable) 2 or more times?
You can use pattern ((your pattern)\2*) in your regular expression to find largest consecutive pattern (regex101 for pattern TATC):
import re
seq = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
patterns = ['AGATC','TTTTTTCT','AATG','TCTAG','GATA','TATC','GAAA','TCTG']
m = max([x for p in patterns for x in re.findall(r'(({})\2*)'.format(p), seq)], key=lambda k: len(k[0]) // len(k[1]))
print('Most repeated pattern: {}, number of repetitions {}'.format(m[1], len(m[0]) // len(m[1])))
Prints:
Most repeated pattern: TATC, number of repetitions 5
This answer was given from a user, yeahIProgram, on Reddit's cs50 subreddit.
"That's what I was referring to, but I had to look it up and you escape the braces inside the formatted string by doubling them."
So, the regular expression I was looking for was groups = re.findall(rf'(({head}){{2,}})', text). Which in returned the below output that I was expecting.
AGATC [('AGATCAGATCAGATCAGATC', 'AGATC')]
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC [('TATCTATCTATCTATCTATC', 'TATC')]
GAAA []
TCTG []
Now, I just need to get the total number of times the string occurs and I should be well on my.
Thank you #Andrej Kesely for your input.
I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks
in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)
This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']
Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for
Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html
You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name
I need a regex in python to match and return the integer after the string "id": in a text file.
The text file contains the following:
{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p
I need to get the 807 after the "id", using a regular expression.
Is this what you mean?
#!/usr/bin/env python
import re
subject = '{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p'
match = re.search('"id":([^,]+)', subject)
if match:
result = match.group(1)
else:
result = "no result"
print result
The Output: 807
Edit:
In response to your comment, adding one simple way to ignore the first match. If you use this, remember to add something like "id":809,"etc to your subject so that we can ignore 807 and find 809.
n=1
for match in re.finditer('"id":([^,]+)', subject):
if n==1:
print "ignoring the first match"
else:
print match.group(1)
n+=1
Assuming that there is more to the file than that:
import json
with open('/path/to/file.txt') as f:
data = json.loads(f.read())
print(data['results'][0]['id'])
If the file is not valid JSON, then you can get the value of id with:
from re import compile, IGNORECASE
r = compile(r'"id"\s*:\s*(\d+)', IGNORECASE)
with open('/path/to/file.txt') as f:
for match in r.findall(f.read()):
print(match(1))
I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]