Python regex to split dutch address - python

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:
re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")
Output is []
Any ideas?

PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.
re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")
This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in
re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")
which gives [('Wilhelminakade', '173', '')].
Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

Related

Are regular expressions the same in every programming language?

I am a python user looking to learn regular expressions and I have just a good good course on Udemy that seems to be OK. However it is neither a python course nor a python regular expression course.
Are regular expressions the same on any programming language ?
I mean would they be the same and use the exact same syntax I would be using with the re package in python ?
there are variations on them...
this site will give you a way to test your expression for some common languages (including python)...
https://regex101.com/
There are significant differences both large and subtle between implementations.
According to the (2.7) regex howto, Python's re module was based on Perl regular expressions. The regular expression syntax is almost the same. The usage in Perl is quite different; more compact (or more unreadable, depending on your views :-).
Also keep in mind that there are differences in regular expressions between Python 2 and 3, depending on which flags are used. Simplifying somewhat you could say that out of the box, Python 2 regexes handle ASCII strings while Python 3 handle unicode strings.
In Python regular expressions, the * and + qualifiers are greedy, that is they match as much text as possible. That makes for results that are not intuitive. For example, suppose you want to search for text between angle brackets. You might think that <.*> might do that. But observe:
In [1]: import re
In [2]: re.findall('<.*>', '<a> <b> <c>')
Out[2]: ['<a> <b> <c>']
You have to add a ? to make them non-greedy.
In [3]: re.findall('<.*?>', '<a> <b> <c>')
Out[3]: ['<a>', '<b>', '<c>']
To be explicit, you'd have to look for anything but the end character.
In [4]: re.findall('<[^>]*>', '<a> <b> <c>')
Out[4]: ['<a>', '<b>', '<c>']
UNIX-like systems such as Linux and *BSD generally support POSIX regular expressions in many utilities. Those come in two flavors, basic and extended. Basic POSIX regular expressions do not support the branching metacharacter |.

Detect pattern in a string in python

I need to be able to detect patterns in a string in Python. For example:
xx/xx/xx (where x is an integer).
How could I do this?
Assuming you want to match more than just dates, you'll want to look into using Regular Expressions (also called Regex). Here is the link for the re Python doc: https://docs.python.org/2/library/re.html This will tell you all of the special character sequences that you can use to build your regex matcher. If you're new to regex matching, then I suggest taking a look at some tutorials, for example: http://www.tutorialspoint.com/python/python_reg_expressions.htm
This is a case for regular expressions. The best resource to start out that I have read so far would be from a book called "Automate The boring Stuff with Python.
This is just a sample of how you migth implement regular expressions to solve your problem.
import re
regex = re.compile(r'\d\d\/\d\d/\d\d$')
mo = regex.findall("This is a test01/02/20")
print(mo)
and here is the output
['01/02/20']
Imports python's library to deal with regex(es)
import re
Then you ceate a regex object with:
regex = re.compile(r'\d\d\/\d\d/\d\d')
This migth look scary but it's actually very straight forward.
Youre defining a pattern. In this case the pattern is \d\d or two digits, followed by / then two more and so on.
Here is a good link to learn more
https://docs.python.org/2/library/re.html
Thou I defenetly suggest picking up the book.

How do I append a list of negative lookbehinds to a python regular expression?

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here:
a Regex for extracting sentence from a paragraph in python
But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).
Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).
This is a limitation of the current Python regex engine.
Longer answer:
You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..
Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.
If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():
(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*
would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.
>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.
You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).
So the regex can be constructured like this:
(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+
It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.
Example run:
>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']
There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).

Match letter in any language

How can I match a letter from any language using a regex in python 3?
re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.
I don't wish to match the ' in can't or underscores or any other type of formatting. I do wish my regex to match: c, a, n, t, Å, é, and 中.
For Unicode regex work in Python, I very strongly recommend the following:
Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.
Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.
For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.
Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.
What's wrong with using the \w special sequence?
# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)
You can match on
\p{L}
which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.
Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)
Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.
import re
text = "can't, Å, é, and 中ABC"
print(re.findall('\w+', text))
This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:
import regex
text = "can't, Å, é, and 中ABC _ sh_t"
print(regex.findall('\p{alpha}+', text))
For Portuguese language, use try this one:
[a-zA-ZÀ-ú ]+
As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.
The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.
import re
LANGS = [
'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž', # Czech
'ÄäÖöÜüẞß', # German
'ĄąĆćĘꣳŃńÓ󌜏źŻż', # Polish
]
pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')
print(result)
# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

Categories