Searching for non uniform time mentions in a string - python

I'm having trouble with my python script
import re
text = 'asd;lkas;ldkasld12:00 AMalskjdadlakjasdasdas1:24 PMasldkjaskldjaslkdjd'
banana = re.findall ('\d\d:\d{2} \wM', text)
print (banana)
I'm trying to search for any mentions of time, but I can't find the strings if they are single digit in the text.

You are searching for exactly 2 numbers with \d\d. You need to change it to:
'\d{1,2}:\d{2} \wM'
This will look for 1 or 2 numbers. Also, I suppose that you want to match AM or PM with \wM in that case you could use:
'\d{1,2}:\d{2} [AP]M'

date= re.findall("\d{1,2}:\d{2) [A|P]M", text)
The {1,2} gives an upper and lower limit to the amount of digits it should expect.
The [A|P]M gives it specific instruction to find either AM or PM. Reducing the risk of false positives.
If you want some more information on what regex can do here is the documentation that helped me learn:
https://docs.python.org/2/library/re.html

I think this iswhat you are looking for:
banana = re.findall ('\d?\d:\d{2} \wM', text)

Related

Matching a String in Python using regex

I have a string say like this:
ARAN22 SKY BYT and TRO_PAN
In the above string The first alphabet can be A or S or T or N and the two numbers after RAN can be any two digit. However the rest will be always same and last three characters will be always like _PAN.
So the few possibilities of the string are :
SRAN22 SK BYT and TRO_PAN
TRAN25 SK BYT and TRO_PAN
NRAN25 SK BYT and TRO_PAN
So I was trying to extract the string every time in python using regex as follows:
import re
pattern = "([ASTN])RAN" + "\w+\s+" +"_PAN"
pat_check = re.compile(pattern, flags=re.IGNORECASE)
sample_test_string = 'NRAN28 SK BYT and TRO_PAN'
re.match(pat_check, sample_test_string)
here string can be anything like the above examples I gave there.
But its not working as I am not getting the string name ( the sample test string) which I should. Not sure what I am doing wrong. Any help will be very much appreciated.
You are using \w+\s+, which will match one or more word (0-9A-Za-z_) characters, followed by one or more space characters. So it will match the two digits and space after RAN but then nothing more. Since the next characters are not _PAN, the match will fail. You need to use [\w\s]+ instead:
pattern = "([ASTN])RAN" + "[\w\s]+" +"_PAN"

Match Regex permuations without repeating but with a twist

It seems that I can't find a solution for this perhaps an easy problem: I want to be able to match with a simple regex all possible permutations of 5 specified digits, without repeating, where all digits must be used. So, for this sequence:
12345
the valid permutation is:
54321
but
55555
is not valid.
However, if the provided digits have the same number once or more, only in that case the accepted permutations will have those repeated digits, but each digit must be used only once. For example, if the provided number is:
55432
we see that 5 is provided 2 times, so it must be also present two times in each permutation, and some of the accepted answers would be:
32545
45523
but this is wrong:
55523
(not all original digits are used and 5 is repeated more than twice)
I came very close to solve this using:
(?:([43210])(?!.*\1)){5}
but unfortunately it doesn't work when there are multiple same digits provided(like 43211).
One way to solve this is to make a character class out of the search digits and build a regex to search for as many digits in that class as are in the search string. Then you can filter the regex results based on the sorted match string being the same as the sorted search string. For example:
import re
def find_perms(search, text):
search = sorted(search)
regex = re.compile(rf'\b[{"".join(search)}]{{{len(search)}}}\b')
matches = [m for m in regex.findall(text) if sorted(m) == search]
return matches
print(find_perms('54321', '12345 54321 55432'))
print(find_perms('23455', '12345 54321 55432'))
print(find_perms('24455', '12345 54321 55432'))
Output:
['12345', '54321']
['55432']
[]
Note I've included word boundaries (\b) in the regex so that (for example) 12345 won't match 654321. If you want to match substrings as well, just remove the word boundaries from the regex.
The mathematical term for this is a mutliset. In Python, this is handled by the Counter data type. For example,
from collections import Counter
target = '55432'
candidate = '32545'
Counter(candidate) == Counter(target)
If you want to generate all of the multisets, here's one question dealing with that: How to generate all the permutations of a multiset?

regex count occurrences

I am looking for a way to count the occurrences found in the string based on my regex. I used findall() and it returns a list but then the len() of the list is only 1? shouldn't the len() of the list be 2?
import re
string1 = r'Total $200.00 Total $900.00'
regex = r'(.*Total.*|.*Invoice.*|.*Amount.*)?(\s+?\$\s?[1-9]{1,10}.*(?:
[.,]\d{3})*(?:[.,]\d{2})?)'
patt = re.findall(regex,string1)
print(patt)
print(len(patt))
Resut:
> [('Total $200.00 Total', ' $900.00')]
> 1
not sure if my regex is causing it to miscalculate. I am looking to get the Total from a file but there are many combinations of this.
Examples:
Total $900.00
Invoice Amt $500.00
Total 800.00
etc.
I am looking to count this because there could be multiple invoice details in one file.
First off, because that's a common misconception:
There is no need to match "all text up to the match" or "all the text after a match". You can drop those .* in your regex. Start with what you actually want to match.
import re
string1 = 'Total $200.00 Total $900.00'
amount_pattern = r'(?:Total|Amt|Invoice Amt|Others)[:\s]*\$([\d\.,]*\d)'
amount_expr = re.compile(amount_pattern, re.IGNORECASE)
amount_expr.findall(string1)
# -> ['200.00', '900.00']
\$([\d\.,]*\d) is a half-way reasonable approximation of prices ("things that start with a $ and then contain a bunch of digits and possibly dots and commas"). The final \d makes sure we are not accidentally matching sentence punctuation. It might be good enough, but you know what data you are working with. Feel free to come up with a more specific sub-expression. Include an optional leading - if you expect to see negative amounts.
Try:
>>> re.findall(r'(\w*\s+\$\d+\.\d+)', string1)
['Total $200.00', 'Total $900.00']
The issue you are having is your regex has two capture groups so re.findall returns a tuple of those two matches. One tuple with two matches inside has a length of 1.

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories