Proper regex (re) pattern in python [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 days ago.
Improve this question
I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.
Data:
Company Fragile9 Closes €9M Series B Funding
Appplle21 Receives CAD$17.5K in Equity Financing
Cat Raises $10.8 Millions in Series A Funding
Sun Raises EUR35M in Funding at a $1 Billion Valuation
Japan1337 Announces JPY 1.78 Billion Funding Round
From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).
So, in result, I expect to receive this:
€9M
CAD$17.5K
$10.8 Millions
EUR35M
JPY 1.78 Billion
The pattern that I use (throw rotten tomatoes at me):
try:
pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[\$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
raises = raises.group().upper().strip()
print(raises)
except:
raises = '???'
print(raises)
Also, sometimes the pattern that works in online python regex editor, will not work in actual script.

Some issues in your regex:
The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.
Not a problem, but there is no need to escape the currency symbols with a backslash.
The \? in the currency list is not a currency symbol.
The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \? but then that the ? should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.
The regex requires that the number has decimals. This should be made optional.
(K|M)* will allow KKKKKKK. You don't want a * here.
[(B|M)illion]* will allow the letters BMilon, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"
The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.
The regex does not provide for matching the final "s" in "millions".
Here is a correction:
(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
On regex101
In Python syntax:
pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"

Related

How to grab multiple paragraphs in the capture group? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
I'm using this code: (?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*(.+?)\n*Item.*?1B to grab the following text:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.
But it would not grab anything in the capturing group, unless it's one paragraph like this:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.
Your regex is matching any number of newlines, then any amount of text on one line, then any number of newlines - it's only looking for a single "paragraph" between newlines, since . does not capture across lines.
Try replacing it with something like [\s\S], which will capture everything - including newlines, paragraphs, text, space, anything you want. Of special note is that this will capture any number of paragraphs, with any amount of whitespace between them.
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors\n*([\s\S]*?)\n*Item.*?1B
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors Match up to the end of risk factors.
\n* Match as many newlines as needed 'till we hit the next paragraph.
([\s\S]*?) Capture anything, across any number of lines (lazy).
\n* Match as many newlines as needed 'till we hit the next paragraph.
Item.*?1B Match the rest of the content. (This doesn't match the . at the very end, did you mean for it to? If so, add \. to the end).
Try it here!
Try
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*((.*\n*)+)\n*Item.*?1B
And for the sake of your future regex headaches, an incredible resource:
https://regex101.com
Cheers-

Why does Regex finditer only return the first result

My string is a transcript, I want to capture the speaker, specifically their second name (Which needs to only match when fully capitalised)
Additionally, I want to match their speech until the next speaker begins, I want to loop this process over a huge text file eventually.
The problem is the match only returns one match object, even though there are two different speakers. Also I have tried online regex tester with the python flavor however, they return very different results (not sure why?).
str = 'Senator BACK\n (Western Australia) (21:15): This evening I had the pleasure (...) Senator DAY\n (South Australia) (21:34): Well, what a week it h(...) '
pattern = re.compile("(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator)")
for match in re.finditer(pattern, str):
print(match)
I want 2 match objects, both objects having a group for there surname and their speech. It's important to note also I have used Regex debuggers online however the python flavor gives different results to Python on my terminal.
Just replace the regex into:
(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator|$)
demo: https://regex101.com/r/gJDaWM/1/
With your current regex, you are enforcing the condition that each match must be followed by Senator through the positive lookahead.
You might actually have to change the positive lookahead into:
(?=Senator|Mr|Dr|$)
if you want to take into account Mr and Dr on top of Senator.

Match a specific string in Python except if it a given pattern is missing

This question might have already been asked, but I have found any solution so far.
I want to match strings that contain inhibition(.+)toxicity but I do not want to match growth inhibition(.+)toxicity.
I tried (!?growth )inhibition(.+)toxicity but it returns the string I want to exclude. However, using (!?growth) returns everything except the strings containing growth.
I do not understand what I am doing wrong with this regex.
EDIT: add example
string I want to mach: Inhibition of recombinant human TNF-alpha-induced cytotoxicity of mouse L929 cells
string I do not want to match: Evaluated for the inhibitory concentration required to cause growth inhibition of A427Mer- cell line of lung using the MTT Cytotoxicity Assay
There is a syntax error. The correct regex is:
(?<!growth )inhibition(.+)toxicity
Take a look at Regex Tutorial - Lookahead and Lookbehind Zero-Length Assertions.

Python Regex code that captures information between 2 characters ('=' and 'I')

(Yes, I know there are relevant regex questions that ask how to capture information between two characters. I tried, they didn't work for me. I also read the regex tutorials as deep as possible.)
I have this code that uses BeautifulSoup to scrap some information from a website in this form: Exchange rate: 1 USD = 60.50 INR
This string is stored in a variable called 'data'. I have to capture '60.50' from this string. I have this code for that:
data = _funct()
rate = re.search("?<=\=)(.*?)(?=\I" , data)
print rate
It doesn't work. Where am I going wrong?
You can use a simple regex like this:
(\w+\.\w+)
Working demo
As you can see the idea behind the regex is:
( ... ) Use parentheses to capture the content
\w+\.\w+ any alphanumeric followed by a dot plus more alphanumeric.
If you only want to capture digits you could use:
\d+\.\d+
If you take a look at the Code Generator for python you can get the code which is:
import re
p = re.compile(ur'(\w+\.\w+)')
test_str = u"Exchange rate: 1 USD = 60.50 INR"
re.search(p, test_str)
I believe your regex isn't working because you are missing an open parenthesis at the beginning and a close parenthesis at the end. Also, the backslash \ before I is not necessary (but it does work since \I isn't a metacharacter code or anything like that). So you could do the following:
(?<=\=)(.*?)(?=I)
Please see Regex 101 Demo here.
I think, however, as others have mentioned, there are better ways of going about this, namely to look for digits and a decimal point preceded by spaces. The is a difficulty in what was suggested, however, namely that the exchange rate could be missing a leading digit (it could lead with a decimal point), or the decimal point may not be present at all. With that in mind, I would suggest the following:
(?<=\=)(?:\s*)(\d+(?:\.\d*)?|\.\d+)
See Regex Demo here.

finding the occurance of strings in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have a long string which I have parsed through beautifulsoup and I need advice on the best way to extract data from this soup object.
The number I want is contained inside the soup object, inside () after this text.
View All (8)
What is the most efficient way to locate this, and get the number out of it.
In VBA I would have done this.
(1) Find where does this text string start if soup is length 1000 text is at 200
Then I would loop until I found the ending ), grab that text, store it in a variable, and process each character removing everything which is not a number.
So If I have > View All (8) I would end up with 8. The number inside here is not known, could be q00, 110, or 2000.
I have just started learning python, don't yet know how to use regular expression but that seems the way to go?
Sample String
">View All (90)</a>
Expected Result - hopeful
90
Sample String
">View All (8)</a>
Expected Result - hopeful
8
Seeing how my comment provoked some more questions, let me expand it a bit. First, welcome to the wonderful world of regular expressions. Regular expressions can be quite a headache, but mastering them is a very useful skill. A very clear tutorial was written by A.M. Kuchling, one of Python's old hackers from the early days. If memory serves me he wrote the re library, with (as an additional bonus) an undocumented implementation of lex in some 15 odd lines of python. But I digress. You can find the tutorial here. https://docs.python.org/2/howto/regex.html
Let me go over the expression bit by bit:
m = re.compile(r'View All \((\d*?)\)').search(soupstring);
print m.group(1)
The r in front of the quotation marks it as a raw string in Python. Python will preprocess normal string literals, so that a backslash is interpreted as a special character. E.g. a '\t' in a string will be replaced by the tab character. Try print '\' to see what I mean. To include a '\' in a string you have to escape it like this '\\'. This can be a problem as a backslash is also a escaping character for the regular expression engine. If you have to match patterns that contain backslashes, you will soon be writing patterns like this '\\\\'. Which can be fun . . . If you like 50 shades of grey, give it a try.
Inside the regular expression language: '(' characters are special. They are used to group parts of the match together. Since you are only interested in the digits between the parentheses, I used a group to extract this data. Other special characters are '{', '[', , '*', '?', '\' and their matching counterparts. I am sure I have forgotten a few, but you can look them up.
With that information, the '\(' will make more sense. Since I have escaped the '(' it tells the regular expression parser to ignore the special meaning of '(' and instead match it against a literal '(' character.
The sequence '\d' is again special. An escaped '\d' means, do not interpret this as a literal 'd', but interpret it as "any digit character".
The '*' means take the last pattern and match it zero or more times.
The '*?' variant means, use "greedy matching". It means return the first possible match instead of finding the longest possible match. In the context of regular expressions greed is usually good. As Sebastian has noted, the '?' is not needed here. However, if you ever need to find html elements or quoted strings, then you can use '<.*?>' or '".*?"'.
Please note that '.' is again special. It means match "any character (except the newline (well most of the time anyway))".
Have fun . . .

Categories