Python Regex to extract codes from a string

Python Regex to extract codes from a string - python

I have a string like -
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2)
CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
I want to extract coupon codes "CODEONE" and "CODETWO" from this string if the following if condition gets true -
if "coupon code" in string:
Please help how i can extract these coupon codes? Actually i need a generic RE for this because i may have other strings where location of the codes may occur at different place and it is also possible that there is only one code

This might help.
import re
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2) CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
for i in re.findall("\d+\)(.*)", Srting):
print(i.strip())
Output:
CODEONE
CODETWO

Related

phonenumbers python module not giving correct country code

I am trying to use the phone numbers module in python and am stuck with the below issue, it is giving the country code wrongly; both are US phone numbers.Can someone suggest how to proceeed
import phonenumbers
print(phonenumbers.parse("+301.795.1400"))
Output: Country Code: 30 National Number: 17951400 ---Wrong
print(phone numbers.parse("+1301.795.1400")) --- ( After Adding +1 or removing '+' it becomes correct)
output: Country Code: 1 National Number: 3017951400
For example :
+44 7923 903949 -- Country Code +44 which is correct
+782-205-2583 --Country Code +7 which is wrong
My expectation is +1 as country code ,phone number as 782-205-2583

A plus ('+') means that the following digit or digits are a country code. The Country code for the US is '1' (ie '+1'). You're putting the plus, telling the parser that the next digit or digits is a country code, but then omitting the country code that you need.
It looks to me like the module is working correctly.
see:
https://countrycode.org/

How to apply regex for multiple phrases on a dataframe column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd', 're', 'RE' from every row that starts with these phrases or contains these phrases. The issue I am facing is that I do not know how to apply regex for each case.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd:RE:Re: Please take action on the action needed items
4 Fix all the mistakes please
5 Fwd:Re: Take action on the attachments in this email
6 Fwd:RE: Action is required
I want a result dataframe like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 Fix all the mistakes please
5 Take action on the attachments in this email
6 Action is required
To get rid of 'Fwd' I used df['msg'].str.replace(r'^Fwd: ','')

If they can be anywhere in the string, you could use a repeating pattern:
^(?:(?:Fwd|R[eE]):)+\s*
^ Start of string
(?: Non capturing group
(?:Fwd|R[eE]): match either Fwd, Re or RE
)+ Close non capturing group and repeat 1+ times
\s* Match trailing whitespaces
Regex demo
In the replacement use an empty string.
You could also make the pattern case insensitive using re.IGNORECASE and use (?:fwd|re) if you want to match all possible variations.
For example
str.replace(r'^(?:(?:Fwd|R[eE]):)+\s*','')

The key concept in this case I believe is using the | operator which works as Either or Or for the pattern. It's very useful for these cases.
This is how I would solve the problem:
import pandas as pd
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7],
'summary':['Fwd: Please look at the attached documents and take action ',
'NSN for the ones who care',
'News for all team members ',
'Fwd:RE:Re: Please take action on the action needed items',
'Fix all the mistakes please ',
'Fwd:Re: Take action on the attachments in this email',
'Fwd:RE: Action is required',
'Redemption!']})
df['clean'] = df['summary'].str.replace(r'^Fwd:|R[eE]:\s*','')
print(df)
Output:
index ... clean
0 0 ... Please look at the attached documents and tak...
1 1 ... NSN for the ones who care
2 2 ... News for all team members
3 3 ... Please take action on the action needed items
4 4 ... Fix all the mistakes please
5 5 ... Take action on the attachments in this email
6 6 ... Action is required
7 7 ... Redemption!

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.

re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.

Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!

Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:

\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)

Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.