Catastrophic backtracking error with any single character or number? - python

First of all, I know the title is not as objective as it should be, I don't get why the below error is occurring on python "flavor" in regex101 website.
Just to explain what I'm trying to do, I have to match any number after "item", followed by everything until "consumo estimado".
Regex:
^item\s*(\d{0,})(.*?)consumo
Example text:
ITEM 1 – AGULHA DE PUNÇÃO
Agulha de punção 18 ga x 70 mm
Consumo Estimado Anual: 284
Ampla Participação
ITEM 2 - CATETER ANGIOGRAFICO PIGTAIL
Cateter angiográfico diagnóstico pigtail 5f x 100 cm
Consumo Estimado Anual: 210
Ampla Participação
ITEM 3 – Próteses Vasculares Dracon Reta 80 Cm
PROTESES VASCULARES ANELADA - Enxerto vascular reto constituído
em politetrafluoretileno (PTFE) extrudado e expandido construído com
suporte externo anelado que aumentam a resistência mecânica.
Tamanho
aproximado 8mm (diâmetro) x 70 -80 cm (comprimento)
Consumo Estimado Anual: 34
Ampla Participação
But after entering the word "consumo" followed by a space, I cant put anything else, resulting in "catastrophic backtracking"
Example Regex with error:
^item\s*(\d{0,})(.*?)consumo e
^item\s*(\d{0,})(.*?)consumo 1
The solution was to use .*? to capture everything between "consumo" and "estimado", which worked properly.
^item\s*(\d{0,})(.*?)consumo.*?estimado
Why is this error occurring? I couldn't find any explanation for it.
I already have the solution for the problem, but I just wanna know why the error happened.
https://regex101.com/r/uqm7ra/1
Edit 1:
As suggested, I have added the link to the current saved regex with the problem.
Edit 2:
As suggested, I also have tried to follow the "meta" when asking for anything here in Stack Overflow. Thanks for the advice!
I hope the question is better now.

\d{0,} looks iffy, the regex engine will retry with fewer and fewer digits which can be catastrophic. Anchor it with (\D.*?)?consumo to prevent that.
Also, if you want a number, you mean {1,} (or the more idiomatic and brief +; similarly, {0,} is customarily written *).
^item\s*(\d+)(\D.*?)?consumo

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

find words in a text [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a problem concerning the search for words for the purpose of a text.
In my code I look for words within an Italian text (this is divided into strings, based on the paragraphs) but when I have words like "e", "in", "ad", it tells me that it finds them many times but in reality, these are words like "begin", "adduce" and any word that contains the e. Is there an efficient way to avoid this "mistake"? I have searched everywhere but I just can't find anything, I think it's a simple problem but I'm not an expert at all, thanks to those who will help me. I would like to do it without importing any libraries
sample text:
['sostanza di cieli ed astri cercai per oceani. di donarmi il diluvio ti dissi io, o musa, scorgendo il destino.', " o zeus che infiniti addurre volle, principiando con stormi arditi fulmini di ira molto funesta laddove si alzasse eccessivamente il volare negato all'uomo.", 'imperterrita irrefrenabile poiché poiché memore di ciò, da qualunque principio, memore di di di ciò di ciò, da qualunque principio, ad ogni costo, dea figlia di zeus, narrane cagione e spirito. ']
i had to find these words (there is a possibility that not all of them are in the text, for example 'e' is missing):
uomo,
dissi io,
o musa,
molto,
eccessivamente,
e,
in,
di ciò
expected output: uomo, dissi io, o musa, molto, eccessivamente, di ciò
You likely want something more advanced which understands the grammar of the language you're trying to parse, but this may work for you
split each paragraph up into individual words
check each word for closeness to your word (ie Levenshtein distance or another metric)
Perhaps
import difflib
def iter_test_words(source_paragraph, words_to_check):
for word_test in source_paragraph.split(): # split by whitespace:
yield difflib.get_close_matches(word_test, words_to_check, n=1, cutoff=0.9)
Some further help
you could try/except and find the first index in the returned list [0] to find anomalous words (IndexError)
you likely need to tune your cutoff as-needed (or even dynamically; ie re-try for anomalies) to get good results
again, using and configuring a library for your needs will probably give better results .. ideally something which
understands the grammar
understands subtle (for computers) word variations (ie. for your case, are Italian tenses of "to go" andando and andato the same? but that ondato "wave" is another concept despite being a better textual match)
>>> import difflib
>>> difflib.get_close_matches("andato", ["andando", "ondato"])
['ondato', 'andando']
>>> difflib.SequenceMatcher(None, "andato", "andando").ratio()
0.7692307692307693
>>> difflib.SequenceMatcher(None, "andato", "ondato").ratio()
0.8333333333333334
You can use regular expression for this purpose. The special sequence \b matches word boundaries. For example, searching for the pattern \bin\b will search for the beginning of a word, followed by "in", followed by the end of a word.
Here is the code:
>>> import re
>>> len(re.findall(r'\bin\b', 'begin in begin end'))
1

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Select string which contains punctuation

so I'm trying to remove title from a set of professors' name.
Like Dr.Eng, Dr.rer.nat, M.S., Dr., S.Si so on and so forth. Basically any string that contains more than one dot.
This is an example list after I have split the name and the title based on ","
2 [CHOTIMAH, Dr., M.S., RINTO ANUGRAHA NQZ, S...
3 [HARSOJO, S.U., M.Sc., Dr., SUDARMAJI, S.S...
4 [IKHSAN SETIAWAN, S.Si., M.Si., ARI SETIAWAN...
5 [EKO SULISTYA, Dr., M.Si., YOSEF ROBERTUS UT...
6 [SUNARTA, Drs., M.S., WAGINI R., Drs., M.S.]
7 [BAMBANG MURDAKA EKA JATI, Drs., M.S., KAMSU...
8 [AHMAD KUSUMA ATMAJA, S.Si., M.Sc., Dr.Eng....
9 [MOH. ALI JOKO WASONO, M.S., Dr.]
I have tried r'\S*[^\w\s]\S' but it returned
CHOTIMAH, INTO ANUGRAHA NQZ, .
HARSOJO, UDARMAJI, i.
IKHSAN SETIAWAN, RI SETIAWAN, ng.
EKO SULISTYA, OSEF ROBERTUS UTOMO, Dr.
SUNARTA, AGINI .
BAMBANG MURDAKA EKA JATI, AMSUL ABRAHA, Prof.
AHMAD KUSUMA ATMAJA, ITRAYANA, Dr.
MOH. ALI JOKO WASONO, Dr.
Some professors' names are shortened to XXX. Ex: MOHAMMAD TO MOH. And I don't want that to get removed.
Any help is appreciated!
\w{0,}\.(\w{0,}\.)? This regex test string will grab any length word followed by a period, and will look for another word of any length followed by a period optionally. This captures Dr., M.S. etc. I'm pretty sure that's what you're asking for, if not let me know.
In the future you can use regexr.com to easily test regex matches. Also you've tagged this post with Python and Pandas but those aren't really relevant tags. Please either include more code to make tags relevant or avoid using irrelevant tags

Python regex for UK number

Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols
Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.
I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$

Categories