Regex: Replace all except numbers, specific characters and specific words - python

If I have text like this:
CARBON 1569
1.00% IRON 234
99% CARBON, 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON#1% IRON 141
ASD CARBON 2% IRON RANDOMWORD 23
Let's say I want to retain only the element names and percentage values (which includes numbers, decimal point and percentage sign). I can run a regex substitution to do so. I tried out plenty of combinations of stuff like (CARBON|IRON|ZINC), which replaces all occurences of element names, and [^0-9.\%]+ which retains all percentage values.
But I can't figure out how to combine these such that I retain both the percentage values and element names. Any help would be appreciated.
EDIT: The spaces would also need to be retained for the output to make sense. All unnecessary characters can be replaced by spaces. The expected output would be
CARBON 1569
1.00% IRON 234
99% CARBON 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON 1% IRON 141
CARBON 2% IRON 23

You may use this regex to match your desired text:
\b(CARBON\b|IRON\b|ZINC\b|\d+(?:\.\d+)?(?:%|\b))|\S
And replace it by '\1 ' (will add trailing spaces in input lines)
RegEx Demo
RegEx Detail:
\b: Word boundary
(: Start capture group
CARBON\b: Match CARBON followed by word boundary
|: OR
IRON\b: Match IRON followed by word boundary
|: OR
ZINC\b: Match ZINC followed by word boundary
|: OR
\d+(?:\.\d+)?: Match an integer or float number
(?:%|\b): Followed by % or word boundary
):
|: OR
\S: Match a non-whitespace character

To simplify you May start with this as per your requirement:
\b(?!CARBON|ZINC|IRON)[a-zA-Z#]+
Then you may have to post process something (like # being replaced by blank) as per your comments.
REGEX101

You can try replacing all the words except:
* Element names
* Numbers
* Percentage.
To achieve this you can use negative lookahead:
(?!CARBON|IRON|ZINC|(\d+\.\d+\%)|\d+)\b[a-zA-Z#]+
Demo

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Getting quantity and unit

I want to get bold parts in sentences below.
Examples:
SmellNice Coffee 450 gr
Clean 2 k Rice
LukaLuka 1,5lt cold drink
Jumbo 7 gutgut eggs 12'li
Espresso 5 Klasik 10 Ad
Expression below works well until to the last two.
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)
I have added \s|$ end of the expression. Thinking that If the unit is not the last word then there should be a space after it. But it didn't work. Briefly, how can I capture all bold expressions?
It works with brackets:
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)(\s+|$)
x2 = (
"\d+" #digit
"[,'\s]" #space comma apostrophe
"[\d*\s*]?" #opt digit or space
"((gr)|g|(kg)|k|(adet)|([Aa]d)|(lı)|(li)|(lu)|(lü)|(cc)|(cl)|(ml)|(lt)|l|(mm)|(cm)|(mt)|m)" #all the weights to look for
"(\s+|$)" #it's gotta be followed with a space, or with end of line.
)

How to extract first floating numbers appearing after a word?

I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.
You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo
You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)
You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo
Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Categories