I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.
You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo
You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)
You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo
Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.
If I have text like this:
CARBON 1569
1.00% IRON 234
99% CARBON, 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON#1% IRON 141
ASD CARBON 2% IRON RANDOMWORD 23
Let's say I want to retain only the element names and percentage values (which includes numbers, decimal point and percentage sign). I can run a regex substitution to do so. I tried out plenty of combinations of stuff like (CARBON|IRON|ZINC), which replaces all occurences of element names, and [^0-9.\%]+ which retains all percentage values.
But I can't figure out how to combine these such that I retain both the percentage values and element names. Any help would be appreciated.
EDIT: The spaces would also need to be retained for the output to make sense. All unnecessary characters can be replaced by spaces. The expected output would be
CARBON 1569
1.00% IRON 234
99% CARBON 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON 1% IRON 141
CARBON 2% IRON 23
You may use this regex to match your desired text:
\b(CARBON\b|IRON\b|ZINC\b|\d+(?:\.\d+)?(?:%|\b))|\S
And replace it by '\1 ' (will add trailing spaces in input lines)
RegEx Demo
RegEx Detail:
\b: Word boundary
(: Start capture group
CARBON\b: Match CARBON followed by word boundary
|: OR
IRON\b: Match IRON followed by word boundary
|: OR
ZINC\b: Match ZINC followed by word boundary
|: OR
\d+(?:\.\d+)?: Match an integer or float number
(?:%|\b): Followed by % or word boundary
):
|: OR
\S: Match a non-whitespace character
To simplify you May start with this as per your requirement:
\b(?!CARBON|ZINC|IRON)[a-zA-Z#]+
Then you may have to post process something (like # being replaced by blank) as per your comments.
REGEX101
You can try replacing all the words except:
* Element names
* Numbers
* Percentage.
To achieve this you can use negative lookahead:
(?!CARBON|IRON|ZINC|(\d+\.\d+\%)|\d+)\b[a-zA-Z#]+
Demo
I want to rename a long list of file names to make them more searchable. The names where auto generated so there is some odd spacing issues. I wrote a little python script that does what I want. But I don't want to remove white spaces between words. For instance I have two names:
0 130 — HG — 1500 — 12" (Page 1 of 2)
01 30 — HD LOW POINT DRAIN
They should read :
0130-HG-1500-12"
0130-HD LOW POINT DRAIN
My code so far :
import os
import re
for filename in os.listdir("."):
if not filename.endswith(".py"):
os.replace(filename, re.sub("[(].*?[)]", "", # Remove anything between ()
"".join(filename.split() # Remove any whitespaces
).replace("—", "-"))) # Replace Em dash with hyphen
Everything is working except I cant figure out how to not strip white spaces between words only.
If by "words" you mean "strings made up of letters" then
re.sub('((?<=[^a-zA-Z]) | (?=[^a-zA-Z]))', '', filename)
will do the trick. In plain language, that would be "replace every space that is either after or before a non-letter character with nothing". Output:
In [24]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '01 30 — HD LOW POINT DRAIN ')
Out[24]: '0130—HD LOW POINT DRAIN'
In [25]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '0 130 — HG — 1500 — 12"')
Out[25]: '0130—HG—1500—12"'
Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols
Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.
I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$
the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something#whatever.xxx
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})
Another regex for phone numbers:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("\d{3}\d{3}\d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):
reg = re.compile("(\d{3}\d{3}\d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")
Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):
reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)
Below is completion of the answers above. This regex is also able to detect country code:
((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
https://regex101.com/r/6Qcrk1/1
For spanish phone numbers I use this with quite success:
re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :
re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(\d\d\d-)?\d\d\d-\d\d\d\d")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\\(?[2-9]\d{2}\\)?[ -])?[2-9]\d{2}-\d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[\d]{3}-[\d]{3}-[\d]{4}',doc)