Python - Regex - findall duplicates - python

I'm trying to match e-mails in html text using the following code in python
my_second_pat = '((\w+)( *?))(#|[aA][tT]|\([aA][tT]\))(((( *?)(\w+)( *?))(\.|[dD][oO][tT]|\([dD][oO][tT]\)))+)([eE][dD][uU]|[cC][oO][mM])'
matches = re.findall(my_second_pat,line)
for m in matches:
s = "".join(m)
email = "".join(s.split())
res.append((name,'e',email))
when I run it on a line = shoham#stanford.edu
I get:
[('shoham', 'shoham', '', '#', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]
what I expect:
[('shoham','#', 'stanford.', 'edu')]
It's matched as a one string on regexpal.com, so I guess I'm having trouble with re.findall
I'm new to both regex, and python. Any optimization/modifications is welcomed.

Try this:
(?i)([^#\s]{2,})(?:#|\s*at\s*)([^#\s.]{2,})(?:\.|\s*dot\s*)([^#\s.]{2,})
Debuggex Demo
If you need to limit to .com and .edu:
(?i)([^#\s]{2,})(?:#|\s*at\s*)([^#\s.]{2,})(?:\.|\s*dot\s*)(com|edu)
Debuggex Demo
Note that I have used the case-insensitive flag (?i) at the start of the regex, instead of using syntax like [Ee].

It is matching all of your capture groups, which contain optional matches.
Try this:
((?:(?:\w+)(?: *?))(?:#|[aA][tT]|\(?:[aA][tT]\))(?:(?:(?:(?: *?)(?:\w+)(?: *?))(?:\.|[dD][oO][tT]|\(?:[dD][oO][tT]\)))+)(?:[eE][dD][uU]|[cC][oO][mM]))
See this link to debug your expression:
http://regex101.com/r/jW4mP1

Related

python re.sub not replacing all the occurance of string

I'm not getting the desire output, re.sub is only replacing the last occurance using python regular expression, please explain me what i"m doing wrong
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
re.sub("http://.*[#]", "", srr)
'image-1CE005XG03'
Desire output without http://www.google.com/#image from the above string.
image-1CCCC|image-1VVDD|image-123|image-1CE005XG03
I would use re.findall here, rather than trying to do a replacement to remove the portions you don't want:
src = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
matches = re.findall(r'https?://www\.\S+#([^|\s]+)', src)
output = '|'.join(matches)
print(output) # image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Note that if you want to be more specific and match only Google URLs, you may use the following pattern instead:
https?://www\.google\.\S+#([^|\s]+)
>>> "|".join(re.findall(r'#([^|\s]+)', srr))
'image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03'
Here is another solution,
"|".join(i.split("#")[-1] for i in srr.split("|"))
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Using correct regex in re.sub as suggested in comment above:
import re
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
print (re.sub(r"\s*https?://[^#\s]*#", "", srr))
Output:
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
RegEx Details:
\s*: Match 0 or more whitespaces
https?: Match http or https
://: Match ://
[^#\s]*: Match 0 or more of any characters that are not # and whitespace
#: Match a #

Regular expression to remove special characters from start or end of a string

I want to remove special characters from start or end of string,
#Can't&
Using regular expression and I've tired,
`[^\w\s]`
But this regular expression removes ' which is inside the word and return below word,
Cant
Can't seem to wrap my head around this any ideas would be highly appreciated.
Can be simplified like this:
res = re.sub(r'^\W+|\W+$', '', txt)
Use the following approach (using regex alternation ..|..):
import re
s = "#Can't&"
res = re.sub(r'^[^\w\s]+|[^\w\s]+$', '', s)
print(res) # Can't

Remove Twitter mentions from Pandas column

I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as #thisisauser. I try to remove that text at the same time I do other cleaning processes.
def clean_text(row, options):
if options['lowercase']:
row = row.lower()
if options['decode_html']:
txt = BeautifulSoup(row, 'lxml')
row = txt.get_text()
if options['remove_url']:
row = row.replace('http\S+|www.\S+', '')
if options['remove_mentions']:
row = row.replace('#[A-Za-z0-9]+', '')
return row
clean_config = {
'remove_url': True,
'remove_mentions': True,
'decode_utf8': True,
'lowercase': True
}
df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))
However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.
You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).
The right way of achieving your needs is using re module like:
import re
re.sub("#[A-Za-z0-9]+","", "#thisisauser text")
' text'
the problem is with the way you used replace method & not pandas
see output from the REPL
>>> my_str ="#thisisause"
>>> my_str.replace('#[A-Za-z0-9]+', '')
'#thisisause'
replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer
>>> import re
>>> my_str
'hello #username hi'
>>> re.sub("#[A-Za-z0-9]+","",my_str)
'hello hi'
Removing Twitter mentions, or words that start with a # char, in Pandas, you can use
df['tweet'] = df['tweet'].str.replace(r'\s*#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*\B#\w+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+', '', regex=True)
df['tweet'] = df['tweet'].str.replace(r'\s*#\S+\b', '', regex=True)
If you need to remove remaining leading/trailing spaces after the replacement, add .str.strip() after .str.replace call.
Details:
\s*#\w+ - zero or more whitespaces, # and one or more "word" chars (letters, digits, underscores (and other connector punctuation in Python 3.x), also some diacritics (in Python 3.x) (see regex demo)
\s*\B#\w+ - zero or more whitespaces, a position other than word boundary, # and one or more "word" chars (see regex demo)
\s*#\S+ - zero or more whitespaces, # and one or more non-whitespace chars (see regex demo)
\s*#\S+\b - zero or more whitespaces, # and one or more non-whitespace chars (as many as possible) followed with a word boundary (see regex demo).
Without Pandas, use one of the the above regexps in re.sub:
text = re.sub(r'...pattern here...', '', text)
## or
text = re.sub(r'...pattern here...', '', text).strip()

Find pattern in string using regex with python 3

I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not
You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())
You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.
You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online

How to write regular expression to use re.split in python

I have a string like this:
----------
FT Weekend
----------
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
I want to delete the part from ---------- to the next ---------- included.
I have been using re.sub:
pattern =r"-+\n.+\n-+"
re.sub(pattern, '', thestring)
pattern =r"-+\n.+?\n-+"
re.sub(pattern, '', thestring,flags=re.DOTALL)
Just use DOTALL flag.The problem with your regex was that by default . does not match \n.So you need to explicitly add a flag DOTALL making it match \n.
See demo.
https://regex101.com/r/hR7tH4/24
or
pattern =r"-+\n[\s\S]+?\n-+"
re.sub(pattern, '', thestring)
if you dont want to add a flag
Your regex doesn't match the expected part because .+ doesn't capture new line character. you can use re.DOTALL flag to forced . to match newlines or re.S.but instead of that You can use a negated character class :
>>> print re.sub(r"-+[^-]+-+", '', s)
''
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>
Or more precise you can do:
>>> print re.sub(r"-+[^-]+-+[^\w]+", '', s)
'Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>
The problem with your regex (-+\n.+\n-+) is that . matches any character but a newline, and that it is too greedy (.+), and can span across multiple ------- entities.
You can use the following regex:
pattern = r"(?s)-+\n.+?\n-+"
The (?s) singleline option makes . match any character including newline.
The .+? pattern will match 1 or more characters but as few as possible to match up to the next ----.
See IDEONE demo
For a more profound cleanup, I'd recommend:
pattern = r"(?s)\s*-+\n.+?\n-+\s*"
See another demo

Categories