Searching and capturing a character using regular expressions Python - python

While going through one of the problems in Python Challenge, I am trying to solve it as follows:
Read the input in a text file with characters as follows:
DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQtvZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......
What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.
The python script that I wrote to do the above is as follows:
import re
pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
result = pattern.search(line)
if result:
print result.groups()
f.close()
The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like
aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........
Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?
Thanks

Change result.groups() to result.group(1) and you will get just the single letter match.
A second problem with your code is that it will not find multiple results on one line. So instead of using re.search you'll need re.findall or re.finditer. findall will return strings or tuples of strings, whereas finditer returns match objects.
Here's where I approached the same problem:
import urllib
import re
pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
"http://www.pythonchallenge.com/pc/def/equality.html").read()))
Note that re.findall and re.finditer return non-overlapping results. So when using the above pattern with re.findall searching against string 'aBBBcDDDeFFFg', your only match will be 'c', but not 'e'. Fortunately, this Python Challenge problem contains no such such examples.

I'd suggest using lookaround:
(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])
This will have no problem with overlapping matches.
Explanation:
(?<=[A-Z]{3}) # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z]) # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3}) # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z]) # assert that there is no uppercase letter 4 characters after the current position

import re
with open('/Users/Dev/Sometext.txt','r') as f:
tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())
for token ins tokens:
print token
What findall does:
Return all non-overlapping matches of
pattern in string, as a list of
strings. The string is scanned
left-to-right, and matches are
returned in the order found. If one or
more groups are present in the
pattern, return a list of groups; this
will be a list of tuples if the
pattern has more than one group. Empty
matches are included in the result
unless they touch the beginning of
another match.
Maybe the most useful function in the re module.
The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.
Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.

Related

Make a list return each of its elements as individual strings to be placed in a regular expression

I am facing a challenge in Python where I have a list that contains multiple strings. I want to use a Regex (findall) to search for any occurrence of each of the list's elements in a text file.
import re
name_list = ['friend', 'boy', 'man']
example_string = "friend"
file= open('file.txt', 'r')
lines= file.read()
Then comes the re.findall expression. I configured it such that it finds any occurrence in the text file where a desired string is found between a number in parentheses (\d) and a period. It works perfectly when I place a string variable inside the regular expression, as seen below.
find = re.findall(r"([^(\d)]*?"+example_string+r"[^.]*)", lines)
However, I want to be able to replace example_string with some sort of mechanism that returns each of the elements in name_list as individual strings to be placed and searched for in the regular expression. The lists I work with can get much larger than the list Iin this example, so please keep that in mind.
As a beginner, I tried simply replacing the string in re.findall with the list I have, only to quickly realize that that would result in an error. The solution to this must allow me to use re.findall in the aforementioned manner, so most of the challenge lies in manipulating the list so that it can produce each of its elements as individual strings to be placed within re.findall.
Thank you for your insights.
for name in name_list:
find = re.findall(r"([^(\d)]*?"+name+r"[^.]*)", lines)
# ... do stuff with the results
this iterates through each item in name_list, and runs the same regex as before.
The pattern that you use ([^(\d)]*?[^.]*) for this match is not correct, see the match here.
I configured it such that it finds any occurrence in the text file
where a desired string is found between a number in parentheses (\d)
and a period.
It is due to this construct [^(\d)] that is a negated character class matching any character except for what is in between the square brackets.
The next negated character class [^.]* matches any char except a dot, but the final dot is not matched.
The pattern to find all between a number in parenthesis and a dot can be using a capture group that will be returned by re.findall.
\(\d+\)([^.]*(?:friend|boy|man)[^.]*)\.
See a regex 101 demo
For example, if the content of file.txt is:
this is (10) with friend and a text.
Example code, assembling the words in a non capture group using .join(name_list)
import re
name_list = ['friend', 'boy', 'man']
pattern = rf"\(\d+\)([^.]*(?:{'|'.join(name_list)})[^.]*)\."
file = open('file.txt', 'r')
lines = file.read()
print(re.findall(pattern, lines))
Output
[' with friend and a text']

Regex for search specific substring [duplicate]

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 4 years ago.
I tried this code:
re.findall(r"d.*?c", "dcc")
to search for substrings with first letter d and last letter c.
But I get output ['dc']
The correct output should be ['dc', 'dcc'].
What did i do wrong?
What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].
When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:
d.{0}c
d.{1}c
d.{2}c
d.{3}c
...
If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.
Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:
def findall_inner(expr, text):
explore = list(re.findall(expr, text))
matches = set()
while explore:
word = explore.pop()
if len(word) >= 2 and word not in matches:
explore.extend(re.findall(expr, word[1:])) # try more removing first letter
explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
matches.add(word)
return list(matches)
found = findall_inner(r"d.*c", "dcc")
print(found)
This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.
Try this regex:
^d.*c$
Essentially, you are looking for the start of the string to be d and the end of the string to be c.
This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

Replace string between tags if string begins with "1"

I have a huge XML file (about 100MB) and each line contains something along the lines of <tag>10005991</tag>. So for example:
textextextextext<tag>10005991<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>10005993</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I want to replace any string between the tags and that begins with "1" to be replaced with a string of my choice and then write back to the file. I've tried using the line.replace function which works but only if I specify the string.
line=line.replace("<tag>10005991</tag>","<tag>YYYYYY</tag>")
Ideal output:
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I've thought about using an array to pass each string in and then replace but I'm sure there's a much simpler solution.
You can use the re module
>>> text = 'textextextextext<tag>10005991</tag>textextextextext'
>>> re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',text)
'textextextextext<tag>YYYYY</tag>textextextextext'
re.sub will replace the matched text with the second argument.
Quote from the doc
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
Usage may be like:
with open("file") as f:
for i in f:
with open("output") as f2:
f2.write(re.sub(r'<tag>1(\d+)</tag>','<tag>YYYYY</tag>',i))
You can use regex but as you have a multi-line string you need to use re.DOTALL flag , and in your pattern you can use positive look-around for match string between tags:
>>> print re.sub(r'(?<=<tag>)1\d+(?=</?tag>)',r'YYYYYY',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY<tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Also as #Bhargav Rao have did in his answer you can use grouping instead look-around :
>>> print re.sub(r'<tag>(1\d+)</?tag>',r'<tag>YYYYYY</?tag>',s,re.DOTALL,re.MULTILINE)
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005992</tag>textextextextext
textextextextext<tag>YYYYYY</?tag>textextextextext
textextextextext<tag>20005994</tag>textextextextext
I think your best bet is to use ElementTree
The main idea:
1) Parse the file
2) Find the elements value
3) Test your condition
4) Replace value if condition met
Here is a good place to start parsing : How do I parse XML in Python?

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Can't make an exact match of a string in a txt file

i solved a lot of questions by reading your posts but now i'm stuck at the following.
My problem is that i can't make an absolute match of a given word in my txt file.
I wrote the following:
for word in listtweet:
#print word,
pattern=re.compile(r'\b%s\b' %(word))
with open('testsentiwords_fullTotal_clean1712.txt', 'r') as f:
for n,line in enumerate(f):
if pattern.search(line):
print 'found word: ', word, 'in line ', line
My output is partly correct:
found word dirty in line '-0.458333333333', 'dirty'
But i also get:
found word dirty in line '-0.5', 'dirty-minded'
found word dirty in line '-0.625', 'dirty-faced'
I only want to get the exact match and nothing more!
Pls any help?
Try with this pattern :
pattern=re.compile(r'[^-a-zA-Z]%s[^-a-zA-Z]' %(word))
The problem with your pattern is that the '-' character is in \b.
If you need numbers in your word, you can add 0-9 to this pattern.
pattern=re.compile(r'[^-a-zA-Z0-9]%s[^-a-zA-Z0-9]' %(word))
If the print output you provide shows the actual lines in the file (where the word you're looking for is always enclosed in single quotes), I think that your re pattern wants to be like
p = re.compile(r"'%s'" % target_word)
so results would be something like:
>>> p = re.compile(r"'%s'" % "dirty")
>>> p.search("'12345', 'dirty'")
<_sre.SRE_Match object at 0x631b10>
>>> p.search("'12345', 'dirty-faced'")
>>>
Firstly, switch from \b to check for word boundaries to [^-a-zA-Z], since - counts as a word boundary. Secondly, if you have long lines, consider using the in keyword first:
if word in line and pattern.search(line):
that way python can do a fast match for the letters of the word first before deploying the regular expression engine. Should speed things up for large files where most lines do not match at all.
Thirdly, fix your code sample - printing line will print the line contents, whereas printing n (or better yet `n` to convert to a string).
Fourth, consider using grep instead:
grep -nwf needles_on_separate_lines haystack.txt
Which will do all you want, and far faster than Python.
Your problem is that \b matches at word boundaries. These are defined as "a position between an alphanumeric character and a non-alphanumeric character".
So \bdirty\b will match dirty in the string This is dirty! but not in dirtying your clothes. So far so good, but since - is also a non-alphanumeric character, \b will also trigger in dirty-minded as you observed.
What you therefore need to do is to think about what characters you do not want to allow as word-separators. If it's only the dash, you could add another pair of assertions to exclude those:
r"(?<!-)\b%s\b(?!-)" % word
If you want to add more characters to exclude as valid word boundaries, for example the apostrophe, use a character class:
r"(?<!['-])\b%s\b(?!['-])" % word

Categories