Detect pattern in a string in python - python

I need to be able to detect patterns in a string in Python. For example:
xx/xx/xx (where x is an integer).
How could I do this?

Assuming you want to match more than just dates, you'll want to look into using Regular Expressions (also called Regex). Here is the link for the re Python doc: https://docs.python.org/2/library/re.html This will tell you all of the special character sequences that you can use to build your regex matcher. If you're new to regex matching, then I suggest taking a look at some tutorials, for example: http://www.tutorialspoint.com/python/python_reg_expressions.htm

This is a case for regular expressions. The best resource to start out that I have read so far would be from a book called "Automate The boring Stuff with Python.
This is just a sample of how you migth implement regular expressions to solve your problem.
import re
regex = re.compile(r'\d\d\/\d\d/\d\d$')
mo = regex.findall("This is a test01/02/20")
print(mo)
and here is the output
['01/02/20']
Imports python's library to deal with regex(es)
import re
Then you ceate a regex object with:
regex = re.compile(r'\d\d\/\d\d/\d\d')
This migth look scary but it's actually very straight forward.
Youre defining a pattern. In this case the pattern is \d\d or two digits, followed by / then two more and so on.
Here is a good link to learn more
https://docs.python.org/2/library/re.html
Thou I defenetly suggest picking up the book.

Related

Python regex to split dutch address

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:
re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")
Output is []
Any ideas?
PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.
re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")
This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in
re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")
which gives [('Wilhelminakade', '173', '')].
Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Python regular expression expansion

I am really bad with regular expressions, and stuck up to generate all the possible combinations for a regular expression.
When the regular expression is abc-defghi00[1-24,2]-[1-20,23].walmart.com, it should generate all its possible combinations.
The text before the braces can be anything and the pattern inside the braces is optional.
Need all the python experts to help me with the code.
Sample output
Here is the expected output:
abc-defghi001-1.walmart.com
.........
abc-defghi001-20.walmart.com
abc-defghi001-23.walmart.com
..............
abc-defghi002-1.walmart.com
Repeat this from 1-24 and 2.
Regex tried
([a-z]+)(-)([a-z]+)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(-)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(.*)
Lets say we would like to match against abc-defghi001-1.walmart.com. Now, if we write the following regex, it does the job.
s = 'abc-defghi001-1.walmart.com'
re.match ('.*[1-24]-[1-20|23]\.walmart\.com',s)
and the output:
<_sre.SRE_Match object at 0x029462C0>
So, its found. If you want to match to 27 in the first bracket, you simply replace it by [1-24|27], or if you want to match to 0 to 29, you simply replace it by [1-29]. And ofcourse, you know that you have to write import re, before all the above commands.
Edit1: As far as I understand, you want to generate all instances of a regular expression and store them in a list.
Use the exrex python library to do so. You can find further information about it here. Then, you have to limit the regex you use.
import re
s = 'abc-defghi001-1.walmart.com'
obj=re.match(r'^\w{3}-\w{6}00(1|2)-([1-20]|23)\.walmart\.com$',s)
print(obj.group())
The above regex will match the template you're looking for I hope!

How to specify long url patterns using Regex so that they follow PEP8 guidelines

I have a long url pattern in Django similar to this:
url(r'^(?i)top-dir/(?P<first_slug>[-\w]+?)/(?P<second_slug>[-\w]+?)/(?P<third_slug>[-\w]+?).html/$',
'apps.Discussion.views.pricing',
Definitely it doesn't follow PEP8 guide as the characters are more than 80 in a single line. I have found two approach of solving this:
The first one (using backslash):
url(r'^(?i)top-dir/(?P<first_slug>[-\w]+?)/(?P<second_slug>[-\w]+?)'\
'/(?P<third_slug>[-\w]+?).html/$',
'apps.Discussion.views.pricing',
The second one - using ():
url((r'^(?i)top-dir/(?P<first_slug>[-\w]+?)/(?P<second_slug>[-\w]+?)',
r'/(?P<third_slug>[-\w]+?).html/$'),
'apps.Discussion.views.pricing'),
Both of them break by Regex. Is there a better approach to solve this issue. OR Is it a bad practice to write such long Regex for urls.
Adjacent strings are concatenated, so you can do something like this:
url(r'^(?i)top-dir/(?P<first_slug>[-\w]+?)/'
r'(?P<second_slug>[-\w]+?)/'
r'(?P<third_slug>[-\w]+?).html/$',
'apps.Discussion.views.pricing',)
PEP8 has no regex formatting tips. But try these:
use re.compile and have these benefits
quicker to match/search on them
reference them under a (short) name!
write the regex multiline with (white)spaces
use re.VERBOSE to ignore whitespace in the regex string
use flags instead of “magic groups” ((?i) → re.IGNORECASE)
slugs = re.compile(r'''
^
top-dir/
(?P<first_slug>[-\w]+?)/
(?P<second_slug>[-\w]+?)/
(?P<third_slug>[-\w]+?).html/
$
''', re.VERBOSE|re.IGNORECASE)
url(slugs, 'apps.Discussion.views.pricing', ...)

Combine case sensitive regex and case insensitive regex into one

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?
Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.
What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

Categories