Find pattern in string using regex with python 3 - python

I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not

You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.

string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())

You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.

You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online

Related

Find all occurrences of regex pattern, but ignore occurrences that contain another pattern

I have a block of text that I'm trying to parse:
「<%sM_item2><%sM_plusnum2>の| <%sM_slot>の部分を| <%sM_change_color>に カラーリングするのですね?|<br>|「それでは <%sM_item>が 10本と| <%nM_gold>ゴールドが必要ですが よろしいですか?|<yesno><close>
In this block of text, I'm trying to regex split on all occurrences of <???>, EXCEPT for when it matches on <%???>.
I have it mostly working with this:
re.split(r'<((?!%).+?)>', source_text)
['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', 'br', '|「それでは\u3000<%sM_item>が\u300010
本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', 'yesno', '', 'close', '']
My problem is although it kept the <%???> tags in place, it somehow stripped the <> characters from the matches (notice 'yesno', 'close', and 'br' tags no longer have those characters).
Based on the documentation of re.split:
Split string by the occurrences of pattern. If capturing parentheses are used
in pattern, then the text of all groups in the pattern are also returned as
part of the resulting list.
In this case, my parentheses needs to be placed on the outside of the match to preserve the ().
re.split('(<(?!%).+?>)', source_text)
['「<%sM_item2><%sM_plusnum2>の|\u3000<%sM_slot>の部分を|\u3000<%sM_change_color>に\u3000カラーリングするのですね?|', '<br>', '|「それでは\u3000<%sM_item>が\u300010本と|\u3000<%nM_gold>ゴールドが必要ですが\u3000よろしいですか?|', '<yesno>', '', '<close>', '']

Remove Twitter mentions from Pandas column

I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as #thisisauser. I try to remove that text at the same time I do other cleaning processes.
def clean_text(row, options):
if options['lowercase']:
row = row.lower()
if options['decode_html']:
txt = BeautifulSoup(row, 'lxml')
row = txt.get_text()
if options['remove_url']:
row = row.replace('http\S+|www.\S+', '')
if options['remove_mentions']:
row = row.replace('#[A-Za-z0-9]+', '')
return row
clean_config = {
'remove_url': True,
'remove_mentions': True,
'decode_utf8': True,
'lowercase': True
}
df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))
However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.
You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).
The right way of achieving your needs is using re module like:
import re
re.sub("#[A-Za-z0-9]+","", "#thisisauser text")
' text'
the problem is with the way you used replace method & not pandas
see output from the REPL
>>> my_str ="#thisisause"
>>> my_str.replace('#[A-Za-z0-9]+', '')
'#thisisause'
replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer
>>> import re
>>> my_str
'hello #username hi'
>>> re.sub("#[A-Za-z0-9]+","",my_str)
'hello hi'
Removing Twitter mentions, or words that start with a # char, in Pandas, you can use
df['tweet'] = df['tweet'].str.replace(r'\s*#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*\B#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+\b', '', regex=True)
If you need to remove remaining leading/trailing spaces after the replacement, add .str.strip() after .str.replace call.
Details:
\s*#\w+ - zero or more whitespaces, # and one or more "word" chars (letters, digits, underscores (and other connector punctuation in Python 3.x), also some diacritics (in Python 3.x) (see regex demo)
\s*\B#\w+ - zero or more whitespaces, a position other than word boundary, # and one or more "word" chars (see regex demo)
\s*#\S+ - zero or more whitespaces, # and one or more non-whitespace chars (see regex demo)
\s*#\S+\b - zero or more whitespaces, # and one or more non-whitespace chars (as many as possible) followed with a word boundary (see regex demo).
Without Pandas, use one of the the above regexps in re.sub:
text = re.sub(r'...pattern here...', '', text)
## or
text = re.sub(r'...pattern here...', '', text).strip()

How to comprehend the python regex compile matching result: `re.compile(r'a*')`

import re
pattern = re.compile(r'a*')
pattern.findall("aba")
result:
['a', '', 'a', '']
Why there is empty matches in the result? How to comprehend this?
To be more specific, what do the two empty matches--'' in the result stand for in the string "aba"?
findall(pattern, string, flags=0)¶
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
You are searching for a*. * matches zero or more repetitions of the character. So b matches a*, and so does anything else. It seems like you want a+ instead, which matches one or more repetitions of the character.
Let me try to explain, as I also could not find good information on the outputs. The documentation states that
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a previous empty match.
import re
text = 'abcaad'
print(f"'a' matches {re.findall('a' , text)}")
print(f"'a+' matches {re.findall('a+', text)}")
print(f"'a*' matches {re.findall('a*', text)}")
print(f"'z*' matches {re.findall('z*', text)}")
The output is
'a' matches ['a', 'a', 'a']
'a+' matches ['a', 'aa']
'a*' matches ['a', '', '', 'aa', '', '']
'z*' matches ['', '', '', '', '', '', '']
a matches exactly the character a thrice.
a+ matches one or more occurrences of character a.
a* matches zero or more occurrences of character a.
Besides matching a and aa, it also does not matches b, c, d and the whole string.
z* matches zero or more occurrences of character z.
It does not matches a, b, c, a, a, d and the whole string.

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Python - Regex - findall duplicates

I'm trying to match e-mails in html text using the following code in python
my_second_pat = '((\w+)( *?))(#|[aA][tT]|\([aA][tT]\))(((( *?)(\w+)( *?))(\.|[dD][oO][tT]|\([dD][oO][tT]\)))+)([eE][dD][uU]|[cC][oO][mM])'
matches = re.findall(my_second_pat,line)
for m in matches:
s = "".join(m)
email = "".join(s.split())
res.append((name,'e',email))
when I run it on a line = shoham#stanford.edu
I get:
[('shoham', 'shoham', '', '#', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]
what I expect:
[('shoham','#', 'stanford.', 'edu')]
It's matched as a one string on regexpal.com, so I guess I'm having trouble with re.findall
I'm new to both regex, and python. Any optimization/modifications is welcomed.
Try this:
(?i)([^#\s]{2,})(?:#|\s*at\s*)([^#\s.]{2,})(?:\.|\s*dot\s*)([^#\s.]{2,})
Debuggex Demo
If you need to limit to .com and .edu:
(?i)([^#\s]{2,})(?:#|\s*at\s*)([^#\s.]{2,})(?:\.|\s*dot\s*)(com|edu)
Debuggex Demo
Note that I have used the case-insensitive flag (?i) at the start of the regex, instead of using syntax like [Ee].
It is matching all of your capture groups, which contain optional matches.
Try this:
((?:(?:\w+)(?: *?))(?:#|[aA][tT]|\(?:[aA][tT]\))(?:(?:(?:(?: *?)(?:\w+)(?: *?))(?:\.|[dD][oO][tT]|\(?:[dD][oO][tT]\)))+)(?:[eE][dD][uU]|[cC][oO][mM]))
See this link to debug your expression:
http://regex101.com/r/jW4mP1

Categories