Python Regex - get words around match - python

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.

In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))

new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

Related

Python get N characters from string with specific substring

I have a very long string that I extracted from a image file.
The string can look like this
...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n...
How do I extract just the 10 characters after the substring "Article-no:"?
I tried solving it with a different approach using rfind like this but it tends to fail every now and then if the start and end string is not accurate.
s = "... string shown above ..."
start = "Article-no: "
end = "Article description: "
print(s[s.find(start)+len(start):s.rfind(end)])
you can use split:
string.split("Article-no: ", 1)[1][0:10]
For this, a regular expression might come in very handy.
import re
# Create a pattern which matches "Article-no: " literally,
# and then grabs the digits that follow.
pattern = re.compile(r"Article-no: (\d+)")
s = "...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n..."
match = pattern.search(s)
if match:
print(match.group(1))
This outputs:
123456789
The regular expression used is Article-no: (\d+), which has the following parts:
Article-no: # Match this text literally
( # Open a new group (i.e. group 1)
\d+ # Match 1 or more occurrences of a digit
) # Close group 1
The re module will search the string for places where this matches, and then you can extract the digit from the matches.

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

About how to find all desired format in a str

I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

repeated pattern in regex

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.
It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).
Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Categories