Python pattern matching with language-specific characters - python

From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:
import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")
Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:
['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']
I want it to yield:
['Sektion', 'München', 'Gruppe', 'Süd']
I am grateful for suggestions how to solve this problem.

You may use
import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']
See the Python 3 demo.
The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.
In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:
p = re.compile(r'[^\W\d_]+', re.U)

Related

Python regex negative lookahead matching where it shouldn't

Example first:
import re
details = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
input_re = re.compile(r'(?!output[0-9]*) mem([0-9a-f]+)')
print(input_re.findall(details))
# Out: ['001', '005', '002', '006']
I am using negative lookahead to extract the hex part of the mem entries that are not preceded by an output, however as you can see it fails. The desired output should be: ['001', '002'].
What am I missing?
You may use this regex in findall:
\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)
RegEx Demo
RegEx Details:
\b: Word boundary
(?!output\d+): Negative lookahead to assert that we don't have output and 1+ digits ahead
\w+: Match 1+ word characters
\s+: Match 1+ whitespaces
mem([a-zA-F\d]+): Match mem followed by 1+ of any hex character
Code:
import re
s = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
print( re.findall(r'\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)', s) )
Output:
['001', '002']
Maybe an easier approach is to split it up in 2 regular expressions ?
First filter out anything that starts with output and is followed by mem like so
output[0-9]* mem([0-9a-f]+)
If you filter this out it would result in
input1 mem001 data2 mem002
When you have filtered them out just search for mem again
mem([0-9a-f]+)
That would result in your desired output
['001', '002']
Maybe not an answer to the original question, but it is a solution to your problem
First of all, let's understand why your original regex doesn't work:
A regex encapsulates two pieces of information: a description of a location within a text, and a description of what to capture from that location. Your original regex tells the regex matcher: "Find a location within the text where the following characters are not 'output'+digits but they are ' mem'+alphanumetics". Think of the logic of that expression: if the matcher finds a location in the text where the following characters are ' mem'+alphanumerics, then, in particular, the following characters are not 'output'+digits. Your look ahead does not add anything to the exoression.
What you really need is to tell the matcher: "Find a location in the text where the following characters are ' mem'+alphanumerics, and the previous characters are not 'output'+digits. So what you really need is a look-behind, not look-ahead.
#ArtyomVancyan proposed a good regex with a look-behind, and it could easily be modified to what you need: instead of a single digit after the 'output', you want potentially more digits, so just put an asterisk (*) after the '\d'.

Need Regex that matches all patterns with format as `{word}{.,#}{word}` with strict matching

So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.
I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.
I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)
You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.

Python regex to split both on number and on capital letter

I cannot get exactly what I want with regex
I have, for example a string
2000H2HfH
I need to get ['2000','H','2','Hf','H'].
So, I need to split by number and by capital letter or capital following string
I use this ([A-Z][a-z]?)(\d+)? and lose the staring number, which is understandable why, but I cannot get it back for the result to be readable?
You may use
re.findall(r'\d+|[A-Z][a-z]*', text)
See a regex demo. Details:
\d+ - 1+ digits
| - or
[A-Z][a-z]* - an upper case letter and then zero or more lowercase ones.
See a Python demo:
import re
text = "2000H2HfH"
print( re.findall(r'\d+|[A-Z][a-z]*', text) )
# => ['2000', 'H', '2', 'Hf', 'H']
You have two capture groups one after another, so you capture them one after other. To achieve your goal you should modify your capture like this
([A-Z][a-z]?|\d+)?
Here the | symbol means that you capture capital letter following by lowercase letters OR number.
There is very nice service to compose and test regular expressions https://regex101.com/

repeated pattern in regex

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.
It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).
Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

Ignore optional suffix without preceding delimiter

I would like to capture the first part of a word, ignoring the optional suffix. Both the suffix and preceding text are composed of the same class of characters (that is, there is no delimiter before the suffix).
My first try only captures the first letter:
m = re.search(r'([A-Za-z]+?)(?:Suff)?', 'textSuff')
m.groups()
>>> ('t',)
I want to capture "text" only, but when I make the first group element greedy, it grabs the entire string.
m = re.search(r'([A-Za-z]+)(?:Suff)?', 'textSuff')
m.groups()
>>> ('textSuff',)
Is it feasible without a different character to delimit the suffix?
If your pattern is all constructed from optional patterns, be sure you will get as few characters in return as possible. Thus, there must be at least a boundary. I guess the word boundary \b is a valid way to go here (since you need to match words):
([A-Za-z]+?)(?:Suff)?\b
See demo
IDEONE DEMO:
import re
p = re.compile(r'([A-Za-z]+?)(?:Suff)?\b')
test_str = "textSuff more words tSuff"
print(re.findall(p, test_str))
Outputs:
['text', 'more', 'words', 't']
You need to specify that after everything either the string must end or there must be a non-acceptable character....
m = re.search(r'([A-Za-z]+?)(?:Suff)?(?:[^A-Za-z]|$)'

Categories