Avoid Regex catastrophic backtracking in "dateparser" Python package

Avoid Regex catastrophic backtracking in "dateparser" Python package - python

I'm trying to optimize dateparser (https://pypi.org/project/dateparser/) logic. I've copied the source and here I have a Regex like:
(
(
(?P<digits>\d+)
|
## Grab any digits
(?P<digits_modifier>{digits_modifier})
{time}
|
(?P<days>{days})
|
(?P<months>{months})
|
## Delimiters, ie Tuesday[,] July 18 or 6[/]17[/]2008
## as well as whitespace
(?P<delimiters>{delimiters})
|
## These tokens could be in phrases that dateutil does not yet recognize
## Some are US Centric
(?P<extra_tokens>{extra_tokens})
){{3,}}
)
I don't want to copy + paster {days}, {months} ... substituted fragments here because they are too verbose,
but here are the sources:
the code: https://www.codepile.net/pile/a8KVbjGb
the resulted Regex: https://www.codepile.net/pile/NnDJ4rzm
The Regex successfully searches for dates in text like
Beginning on February 1, 1998, and\ncontinuing until July 18, 2002,
But sometimes the regex is too slow. Especially when dealing with tables in plain text like this:
710 5,208 1,577 2,274
3,302 15,638 5,603 -
32,584 166,848
I guess there is a catastrophic backtracking issue (it doesn't happen if I remove "(?P\d+)" Regex part, but then I ruin my parsing logic) But I don't know how to avoid it here.
What I have tried:
1) simplifying and tuning some simple sub-patterns:
\d+ -> \d\d* or \d+ -> \d{1,8}
ACDT|ACST|ACT|ACWDT|ACWST -> AC(DT|ST|T|WDT|WST)
All that I've got is performance improvement like 44s to 42.5s.
2) I've tried to break the source text by separators like ' ', '|', '['
re.split(r'\|. |; |\s\s', text)
and then processing the resulted text chunks. But then I've failed to pass unit tests for text like
" Beginning on February 1, 1998"

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.

Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Parse Output for Python

My software outputs these two types of output:
-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log
I would like to get the file names from both outputs:
E:\\_WiE10-18.0.100-77.iso, for the first one
E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log, for the second one
If i use something like the code below, it won't work if the second parameter has spaces in it. It works if there aren't any spaces in the Domain Username.
for item in outputs:
outputs.extend(item.split())
for item2 in [' '.join(outputs[6:])]:
new_list.append(item2)
How can I get all the parameters individually, including the filenames?

If regex is an option:
text = """-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log"""
import re
for h in re.findall(r"^.*?\d\d:\d\d:\d\d (.*)",text,flags=re.MULTILINE):
print(h)
Output:
E:\_WiE10-18.0.100-77.iso
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log
Pattern explained:
The pattern r"^.*?\d\d:\d\d:\d\d (.*)" looks for linestart '^' + as less anythings as possible '.*?' + the time-stamp '\d\d:\d\d:\d\d ' followed by a space and captures all behind it till end of line into a group.
It uses the re.MULTILINE flag for that.
Edit:
Capturing the individual things needs some more capturing groups:
import re
for h in re.findall(r"^([rwexXst-]+) ([^0-9]+) +\d+.+? +(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (.*)",text,flags=re.MULTILINE):
# ^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^
# flags grpName datetime filename
for k in h:
print(k)
print("")
Output:
-rwx------
Administrators/Domain Users
2018-04-16 16:04:40
E:\_WiE10-18.0.100-77.iso
-rwxrwx---
Administrators/unknown
2018-04-17 01:33:23
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log

You could use a regular expression like
\b[A-Z]:\\\\.+

Aside from using regex, you can try something similar to this.
output = '-rwx------ ... 2018-04-16 16:04:40 E:\\\\_WiE10-18.0.100-77.iso'
drive_letter_start = output.find(':\\\\')
filename = output[drive_letter_start - 1:]
It looks for the first occurrence of ':\\'and gets the drive letter before the substring (i.e. ':\\') and the full file path after the substring.
EDIT
Patrick Artner's answer is better and completely answers OP's question compared to this answer. This only encompasses capturing the file path. I am leaving this answer here should anyone find it useful.

How to ignore white space inbetween words but not other characters?

I want to rename a long list of file names to make them more searchable. The names where auto generated so there is some odd spacing issues. I wrote a little python script that does what I want. But I don't want to remove white spaces between words. For instance I have two names:
0 130 — HG — 1500 — 12" (Page 1 of 2)
01 30 — HD LOW POINT DRAIN
They should read :
0130-HG-1500-12"
0130-HD LOW POINT DRAIN
My code so far :
import os
import re
for filename in os.listdir("."):
if not filename.endswith(".py"):
os.replace(filename, re.sub("[(].*?[)]", "", # Remove anything between ()
"".join(filename.split() # Remove any whitespaces
).replace("—", "-"))) # Replace Em dash with hyphen
Everything is working except I cant figure out how to not strip white spaces between words only.

If by "words" you mean "strings made up of letters" then
re.sub('((?<=[^a-zA-Z]) | (?=[^a-zA-Z]))', '', filename)
will do the trick. In plain language, that would be "replace every space that is either after or before a non-letter character with nothing". Output:
In [24]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '01 30 — HD LOW POINT DRAIN ')
Out[24]: '0130—HD LOW POINT DRAIN'
In [25]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '0 130 — HG — 1500 — 12"')
Out[25]: '0130—HG—1500—12"'

Search strings with python

I would like to know how can I search specific strings with python. Actually I opened a markdown file which contain a sheet like below:
| --------- | -------- | --------- |
|**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. |
|**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|
|**unscrewed**| - | -Slowly and very carefully he unscrewed the ink bottle, dipped his quill into it, and began to write,|
|**downtrodden**| - | -For years, Aunt Petunia and Uncle Vernon had hoped that if they kept Harry as downtrodden as possible, they would be able to squash the magic out of him.|
|**sheets,**| - | -As long as he didn’t leave spots of ink on the sheets, the Dursleys need never know that he was studying magic by night.|
|**flinch**| - | -But he hoped she’d be back soon — she was the only living creature in this house who didn’t flinch at the sight of him.|
And I have to get the strings from each lines which decorates with |** **|, like:
propped
Pointless
unscrewed
downtrodden
sheets
flinch
I tried to use the regular expression but failed to extract it.

import re
y = '(?<=\|\*{2}).+?(?=,{0,1}\*{2}\|)'
reg = re.compile(y)
a = '| --------- | -------- | --------- | |**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. | |**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|'
reg.findall(a)
Regex(y) above explained:
(?<=\|\*{2}) - Matches if the current position in the string is preceded by a match for \|\*{2} i.e. |**
.+? - Will try to find anything(except for new line) repeated 1 or more times. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
(?=,{0,1}\*{2}\|) - ?= matches any string preceding the regex mentioned. In this case I have mentioned ,{0,1}\*{2}\|, which means zero or one , and 2 * and ending |.

Try using the following regex :
(?<=\|)(?!\s).*?(?!\s)(?=\|)
see demo / explanation

If the asterisks are in the text you are searching and you do not want the comma after sheets. The pattern would be a pipe followed by two asterisks then anything that follows that is not an asterisk or a comma.
\|\*{2}([^*,]+)
If you can live with the comma or if there might be commas you want to catch
\|\*{2}([^*]+)
Use either pattern with re.findall or re.finditer to capture the text you want.
If using the second pattern, you would need to run through the groups and strip any unwanted commas.

I have wrote below program to achieve the required output. I created a file string_test where all raw strings I copied:
a=re.compile("^\|\*\*([^*,]+)")
with open("string_test","r") as file1:
for i in file1.readlines():
match=a.search(i)
if match:
print match.group(1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoid Regex catastrophic backtracking in "dateparser" Python package - python

Related

Using RegEx in Python to extract contents

Extract date from a string with a lot of numbers

Parse Output for Python

How to ignore white space inbetween words but not other characters?

Search strings with python

Categories

Resources