mapping RegEx to additional "instruction" using a dictonary - python

I looked over some old Java code of mine, in which I extracted dates and their formats, from a number of strings. It was a terrible mess of if conditions and regex patterns and matchers.
So I thought about how I would solve this nowadays and in Python. I have a number of regex patterns which map to a date format, from which susequently a time stamp is created. I heard "If there is a switch statement in Java, in Python there should be a dictonary":
pattern_dic = {
"[\\d]{2}:[\\d]{2}, .{3} [\\d]{1,2}, [\\d]{4} \\(UTC\\)": "HH:mm, MMM dd, yyyy (zzz)",
"[\d]{2}:[\d]{2}, [\d]{1,2} .{3} [\d]{4} \(UTC\)" : "HH:mm, dd MMM yyyy (zzz)",
...
}
*I think that I have to change these date patterns because I just copied them from the Java solution.
In another problem in which I had regex / replacement pairs, I found a pretty nice solution using the dictionary like this
(courtesy to some brilliant person on Stack Overflow). This works only if the matching regex is a simple string, so it can looked up in the dictionary (I think).
pattern_acc = re.compile(r'\b(' + '|'.join(pattern_dic.keys()) + r')\b')
comment = pattern_acc.sub(lambda x: pattern_dic[x.group()], comment)
Here is what I came up with so far. My problem is that I don't know how i can get the matching part of the regex to look up in my dictionary ("matching_date_pattern"):
def multi_match(input_string, pattern_dic):
date_pattern = re.compile(r'\b(' + '|'.join(pattern_dic.keys()) + r')\b')
matches = date_pattern.findall(input_string)
date_formats = []
for match in matches:
matching_string = match.group()
date_format = pattern_dic["matching_date_pattern"]
date_formats.append((matching_string, date_format))
edit:
I should have stated that I would like to solve this as a preliminary problem. I would like to separate the matching and the searching. While being able to access the matching pattern.
Think for example if the regular expressions consist of many groups and the "instructions" they are matched to become more complex. Imagine for example that you expect a lot of different text objects, like links, markdown elements, and so on. My problem in the moment boils down to knowing which pattern matched, between matching and searching.
Maybe the question is also how expensive it is to compile patterns, since compiling them separately of course make it easier to access them.

This code you snatched from Stack Overflow is good is you want to match any of multiple regexps, but doesn't solve your problem of finding which of your regexps matched in every particular case. You should rather just iterate over pattern_dic and check every key in turn:
def multi_match(input_string, pattern_dic):
for regexp in pattern_dic:
re.search(regexp, input_string)
matching_string = match.group()
date_format = pattern_dic[regexp]
date_formats.append((matching_string, date_format))
return date_formats
Side remark: .append takes one argument, so it is necessary to form a tuple - thus additional pair of parentheses.

Related

How to filter out specific strings from a string

Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by . or by -.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.
So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w can be replaced with \w{5} where this specifies 5 alphanumerics. I would use re.findall.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]

Regex fuzzy word match

Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. Here are some examples:
doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"
I want to match sentences based on lists of entities and outcomes:
entities = ['students','rebels']
outcomes = ['arrested','killed']
How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? This is what I have:
re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)
i.e., a number, some optional random text, an entity, some more optional random text, and an outcome.
Something is going wrong, I think because of the OR statements. Thanks for your help!
This regex should match your two examples:
pattern = r'\d+\s+.*?(' + '|'.join(entities) + r').*?(' + '|'.join(outcomes) + ')'
What you were missing were parentheses around the ORs.
However, using only regex likely won't give you good results. Consider using Natural Language Processing libraries like NLTK that parses sentences.
As #ReutSharabani already answered, this is not a proper way to do nlp, but this answers the literal question.
The regex should read:
import re;
entities = ['students','rebels'];
outcomes = ['arrested','killed'];
p = re.compile(r'(\d{1,5})\D{1,50}('+'|'.join(entities)+')\D{1,50}('+'|'.join(outcomes)+')');
m = p.match(doc1);
number = m.group(1);
entity = m.group(2);
outcome = m.group(3);
You forgot to group () your OR-operations. Instead what you generated was a|b|\W|c|d|\W (short version).
You ought to try out the regex module!
It has built in fuzzy match capabilities. The other answers seem much more robust and sleek, but this could be done simply with fuzzy matching as well!
pattern = r'\d{1,5}(%(entities)s)(%(outcomes)s){i}' %{'entities' : '|'.join(entities), 'outcomes' : '|'.join(outcomes)}
regex.match(pattern, news_sentence)
What's happening here is that the {i} indicates you want a match with any number of inserts. The problem here is that it could insert characters into one of the entities or outcomes and still yield a match. If you want to accept slight alterations on spelling to any of your outcomes or entities, then you could also use {e<=1} or something. Read more in the provided link about approximate matching!

Regular expression dictionary in python

Is it possible to implement a dictionary with keys as regular expressions and actions (with parameters) as values?
for e.g.
key = "actionname 1 2", value = "method(1, 2)"
key = "differentaction par1 par2", value = "appropriate_method(par1, par2)"
User types in the key, i need to execute the matching method with the parameters provided as part of user input.
It would be great if we can achieve the lookup in O(1) time, even if its not possible atleast i am looking for solutions to solve this problem.
I will be having few hundred regular expressions (say 300) and matching parameterized actions to execute.
I can write a loop to achieve this, but is there any elegant way to do this without using a for loop?
Related question: Hashtable/dictionary/map lookup with regular expressions
Yes, it's perfectly possible:
import re
dict = {}
dict[re.compile('actionname (\d+) (\d+)')] = method
dict[re.compile('differentaction (\w+) (\w+)')] = appropriate_method
def execute_method_for(str):
#Match each regex on the string
matches = (
(regex.match(str), f) for regex, f in dict.iteritems()
)
#Filter out empty matches, and extract groups
matches = (
(match.groups(), f) for match, f in matches if match is not None
)
#Apply all the functions
for args, f in matches:
f(*args)
Of course, the values of your dictionary can be python functions.
Your matching function can try to match your string to each key and execute appropriate function if there is a match. This will be linear in time in the best case, but I don't think you can get anything better if you want to use regular expressions.
But looking at your example data I think you should reconsider whether you need regular expressions at all. Perhaps you can just parse your input string into, e.g. <procedure-name> <parameter>+ and then lookup appropriate procedure by it's name (simple string), that can be O(1)
Unfortunately this is not possible. You will need to iterate over the regular expressions in order to find out if they match. The lookup in the dictionary will be O(1) though (but that doesn't solve your problem).
IMHO, you are asking the WRONG QUESTION.
You ask if there's an elegant way to do this. Answer: The most elegant way is the most OBVIOUS way. Code will be read 10x to 20x as often as it's modified. Therefore, if you write something 'elegant' that's hard to read and quickly understand, you've just sabotaged the guy after you who has to modify it somehow.
BETTER CODE:
Another answer here reads like this:
matches = ( (regex.match(str), f) for regex, f in dict.iteritems() )
This is functionally equivalent (IMPORTANTLY, the same in terms of Python generated bytecode) to:
# IMHO 'regex' var should probably be named 'pattern' since it's type is <sre.SRE_Pattern>
for pattern, func in dictname.items():
if pattern.match(str):
func()
But, the below sample is hugely easier to read and understand at a glance.
I apologize (a little) if you're one of those people who is offended by code that is even slightly more wordy than you think it could be. My criteria, and Guido's as mentioned in PEP-8, is that the clearest code is the best code.

Apply multiple negative regex to expression in Python

This question is similar to "How to concisely cascade through multiple regex statements in Python" except instead of matching one regular expression and doing something I need to make sure I do not match a bunch of regular expressions, and if no matches are found (aka I have valid data) then do something. I have found one way to do it but am thinking there must be a better way, especially if I end up with many regular expressions.
Basically I am filtering URL's for bad stuff ("", \\", etc.) that occurs when I yank what looks like a valid URL out of an HTML document but it turns out to be part of a JavaScript (and thus needs to be evaluated, and thus the escaping characters). I can't use Beautiful soup to process these pages since they are far to mangled (actually I use BeautifulSoup, then fall back to my ugly but workable parser).
So far I have found the following works relatively well: I compile a dict or regular expressions outside the main loop (so I only have to compile it once, but benefit from the speed increase every time I use it), I then loop a URL through this dict, if there is a match then the URL is bad, if not the url is good:
regex_bad_url = {"1" : re.compile('\"\"'),
"2" : re.compile('\\\"')}
Followed by:
url_state = "good"
for key, pattern in regex_bad_url_components.items():
match = re.search(pattern, url)
if (match):
url_state = "bad"
if (url_state == "good"):
# do stuff here ...
Now the obvious thought is to use regex "or" ("|"), i.e.:
re.compile('(\"\"|\\\")')
Which reduces the number of compares and whatnot, but makes it much harder to trouble shoot (with one expression per compare I can easily add a print statement like:
print "URL: ", url, " matched by key ", key
So is there someway to get the best of both worlds (i.e. minimal number of compares) yet still be able to print out which regex is matching the URL, or do I simply need to bite the bullet and have my slower but easier to troubleshoot code when debugging and then squoosh all the regex's together into one line for production? (which means one more step of programming and code maintenance and possible problems).
Update:
Good answer by Dave Webb, so the actual code for this would look like:
match = re.search(r'(?P<double_quotes>\"\")|(?P<slash_quote>\\\")', fullurl)
if (match == None):
# do stuff here ...
else:
#optional for debugging
print "url matched by", match.lastgroup
"Squoosh" all the regexes into one line but put each in a named group using (?P<name>...) then use MatchOjbect.lastgroup to find which matched.

Categories