regex count occurrences - python

I am looking for a way to count the occurrences found in the string based on my regex. I used findall() and it returns a list but then the len() of the list is only 1? shouldn't the len() of the list be 2?
import re
string1 = r'Total $200.00 Total $900.00'
regex = r'(.*Total.*|.*Invoice.*|.*Amount.*)?(\s+?\$\s?[1-9]{1,10}.*(?:
[.,]\d{3})*(?:[.,]\d{2})?)'
patt = re.findall(regex,string1)
print(patt)
print(len(patt))
Resut:
> [('Total $200.00 Total', ' $900.00')]
> 1
not sure if my regex is causing it to miscalculate. I am looking to get the Total from a file but there are many combinations of this.
Examples:
Total $900.00
Invoice Amt $500.00
Total 800.00
etc.
I am looking to count this because there could be multiple invoice details in one file.

First off, because that's a common misconception:
There is no need to match "all text up to the match" or "all the text after a match". You can drop those .* in your regex. Start with what you actually want to match.
import re
string1 = 'Total $200.00 Total $900.00'
amount_pattern = r'(?:Total|Amt|Invoice Amt|Others)[:\s]*\$([\d\.,]*\d)'
amount_expr = re.compile(amount_pattern, re.IGNORECASE)
amount_expr.findall(string1)
# -> ['200.00', '900.00']
\$([\d\.,]*\d) is a half-way reasonable approximation of prices ("things that start with a $ and then contain a bunch of digits and possibly dots and commas"). The final \d makes sure we are not accidentally matching sentence punctuation. It might be good enough, but you know what data you are working with. Feel free to come up with a more specific sub-expression. Include an optional leading - if you expect to see negative amounts.

Try:
>>> re.findall(r'(\w*\s+\$\d+\.\d+)', string1)
['Total $200.00', 'Total $900.00']
The issue you are having is your regex has two capture groups so re.findall returns a tuple of those two matches. One tuple with two matches inside has a length of 1.

Related

Python extract information after phrase or group of words

I am trying to extract information from PDF.
Simple search worked:
filecontent = ReadDoc.getContent("c:\\temp\\pdf_1.pdf")
match = re.search('Document ID: (\d+)', filecontent)
if match:
docid = match.group(1)
But when I want to search a long phrase, e.g.
I want to extract '$999,999.00', which may appear in the document like "Total Cumulative Payment (USD) $999,999.00" or "Total cumulative payment $55587323.23". Note that there is a difference in the text and I need to use some kind of fuzzy search, find the sentence, somehow extract the $ from there.
Similarly I also need to extract some date, number, amount, money in between phrases/words.
Appreciate your help!
I think this should do what you want:
import re
textlist = ["some other amount as $32,4545.34 and Total Cumulative Payment (USD) $999,999.00 and such","Total cumulative payment $55587323.23"]
matchlist = []
for text in textlist:
match = re.findall("(\$[.\d,]+)", text)
if match:
matchlist.extend(match)
print(matchlist)
results:
['$32,4545.34', '$999,999.00', '$55587323.23']
The regex is look for a $ and grab ., and numbers up to the next space. Depending on what other kind of data you are parsing it may need to be tweaked, I assuming you only want to capture periods, commas and numbers.
update:
it will now find any number of occurrences and put them all in a list
Well something like this can be done with regular expressions:
import re
source = 'total cumulative payment $2000.00; some other amount $1234.56. Total Cumulative Payment (USD) $5,600,000.06'
matches = re.findall( r'total\s+cumulative\s+payment[^$0-9]+\$([0-9,.]+)', source, re.IGNORECASE )
amounts = [ float( x.replace( ',', '' ).rstrip('.') ) for x in matches ]
This will match the two specific examples you've given. But you haven't given much of an idea of how loose the matching criteria should be, or what the rules are. The solution above will miss amounts if the source document has a spelling mistake in the word "cumulative". Or if the amount appears without the dollar sign. It also allows any amount of intervening text between "total cumulative payment" and the dollar amount (so you'll get a false positive from source = "This document contains information about total cumulative payment values, (...3 more pages of introductory material...) and by the way you owe me $20.") Now, these things can be tweaked and improved - but only if you know what is going to be important and what is not, and tighten the specification of the question accordingly.

Searching for non uniform time mentions in a string

I'm having trouble with my python script
import re
text = 'asd;lkas;ldkasld12:00 AMalskjdadlakjasdasdas1:24 PMasldkjaskldjaslkdjd'
banana = re.findall ('\d\d:\d{2} \wM', text)
print (banana)
I'm trying to search for any mentions of time, but I can't find the strings if they are single digit in the text.
You are searching for exactly 2 numbers with \d\d. You need to change it to:
'\d{1,2}:\d{2} \wM'
This will look for 1 or 2 numbers. Also, I suppose that you want to match AM or PM with \wM in that case you could use:
'\d{1,2}:\d{2} [AP]M'
date= re.findall("\d{1,2}:\d{2) [A|P]M", text)
The {1,2} gives an upper and lower limit to the amount of digits it should expect.
The [A|P]M gives it specific instruction to find either AM or PM. Reducing the risk of false positives.
If you want some more information on what regex can do here is the documentation that helped me learn:
https://docs.python.org/2/library/re.html
I think this iswhat you are looking for:
banana = re.findall ('\d?\d:\d{2} \wM', text)

Python: Replace all substring occurrences with regular expressions

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories