How to perform pattern matching between two strings?

How to perform pattern matching between two strings? - python

I want to have a good pattern matching code which can exactly match between both strings.
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
Approach 1:
tmp_body = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in y]).split())
tmp_body_1 = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in x]).split())
if tmp_body in tmp_body_1:
print "true"
In my problem x will always be a base string and y will change
Approach 2:
Fuzzy logic --> But was not getting good results through it
Approach 3:
Using regex which I don't know
I am still figuring out ways to solve it with regex.
Removal of special characters from both base and incoming string
Matches the GB and Color
Splitting the GB from the number for good matching
These things I have figured out.

How about the following approach. Split each into words, lowercase each word and store in a set. x must then be a subset of y. So for your example it will fail as 16 does not match 64:
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
set_x = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", x)])
set_y = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", y)])
print set_x
print set_y
print set_x.issubset(set_y)
Giving the following results:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', 'mobile', 'phone', '64', 'gb', '6', 'gsm', 'silver', 'iphone'])
False
If 64 is changed to 16 then you get:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', '16', 'mobile', 'phone', 'gb', '6', 'gsm', 'silver', 'iphone'])
True

Looks like you're trying to do longest common substring here ofntwo unknown strings.
Find common substring between two strings
Regex only works when you have a known pattern to your strings. You could use LCS to derive a pattern that you could use to test additional strings, but I don't think that's what you want.
If you are wanting to extract the capacity, model, and other information from these strings, you may want to use multiple patterns to find each piece of information. Some information may not be available. Your regular expressions will need to flex in order to handle a wider input (hard for me to assume all variations given a sample size of 2).
capacity = re.search(r'(\d+)\s*GB', useragent)
model = re.search(r'Apple iPhone ([A-Za-z0-9]+)', useragent)
These patterns won't make much sense to you unless you read the Python re module documentation. Basically, for capacity, I'm searching for 1 or more digits followed by 0 or more whitespace followed by GB. If I find a match, the result is a match object and I can get the capacity with match.group(). Similar story for finding iPhone version, though my pattern doesn't work for "6 Plus".
Since you have no control over the generation of these strings, if this is a script that you plan on using 3 years from now, expect to be a slave to it, updating the regular expression patterns as new string formats become available. Hopefully this is a one-off number crunching exercise that can be scrapped as soon as you answered your question.

Related

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!

Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]

Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.

findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Parsing semi structured text strings in Python

I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language.
However the script have been written by many people in the past, and each had their own style of writing.
some Examples would be:
On Device 1 Set word 45 and 46 to hex 331
On Device 1 set words 45 and 46 bits 3..7 to 280
on Device 1 set word 45 to oct 332
on device 1 set speed to 60kts Words 3-4 to hex 34
(there are many more different ways used in the source text)
The issue is its not always logical nor consistent
I have looked at Regexp, and matching certain words. This works out ok, but when I need to know the next word (e.g in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not). In the case of 'Words' i need to look for the words to set, as well as their values.
in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331
or if possible Set word 45 to hex 331 and word 46 to hex 331
i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (i.e value) manually
alternatively, i could split the string using a space and process each word manually, then be able to do something like
assuming list is
['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']
for i in range (0,sp.__len__()):
rew = re.search("[Ww]ord", sp[i])
if rew:
print ("Found word, next val is ", sp[i+1])
is there a better way to do what i want? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.

I suggest you develop a program that gradually explores the syntax that people have used to write the scripts.
E.g., each instruction in your examples seems to break down into a device-part and a settings-part. So you could try matching each line against the regex ^(.+) set (.+). If you find lines that don't match that pattern, print them out. Examine the output, find a general pattern that matches some of them, add a corresponding regex to your program (or modify an existing regex), and repeat. Proceed until you've recognized (in a very general way) every line in your input.
(Since capitalization appears to be inconsistent, you can either do case-insensitive matches, or convert each line to lowercase before you start processing it. More generally, you may find other 'normalizations' that simplify subsequent processing. E.g., if people were inconsistent about spaces, you can convert every run of whitespace characters into a single space.)
(If your input has typographical errors, e.g. someone wrote "ste" for "set", then you can either change the regex to allow for that (... (set|ste) ...), or go to (a copy of) the input file and just fix the typo.)
Then go back to the lines that matched ^(.+) set (.+), print out just the first group for each, and repeat the above process for just those substrings.
Then repeat the process for the second group in each "set" instruction. And so on, recursively.
Eventually, your program will be, in effect, a parser for the script language. At that point, you can start to add code to convert each recognized construct into the output language.
Depending on your experience with Python, you can find ways to make the code concise.

Depending on what you actually want from these strings, you could use a parser, e.g. parsimonious:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
grammar = Grammar(
r"""
command = set operand to? number (operator number)* middle? to? numsys? number
operand = (~r"words?" / "speed") ws
middle = (~r"[Ww]ords" / "bits")+ ws number
to = ws "to" ws
number = ws ~r"[-\d.]+" "kts"? ws
numsys = ws ("oct" / "hex") ws
operator = ws "and" ws
set = ~"[Ss]et" ws
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def __init__(self):
self.cmds = []
def generic_visit(self, node, visited_children):
pass
def visit_operand(self, node, visited_children):
self.cmds.append(('operand', node.text))
def visit_number(self, node, visited_children):
self.cmds.append(('number', node.text))
examples = ['Set word 45 and 46 to hex 331',
'set words 45 and 46 bits 3..7 to 280',
'set word 45 to oct 332',
'set speed to 60kts Words 3-4 to hex 34']
for example in examples:
tree = grammar.parse(example)
hs = HorribleStuff()
hs.visit(tree)
print(hs.cmds)
This would yield
[('operand', 'word '), ('number', '45 '), ('number', '46 '), ('number', '331')]
[('operand', 'words '), ('number', '45 '), ('number', '46 '), ('number', '3..7 '), ('number', '280')]
[('operand', 'word '), ('number', '45 '), ('number', '332')]
[('operand', 'speed '), ('number', '60kts '), ('number', '3-4 '), ('number', '34')]

python regex: extract string from escaped sequences

I do not get it. Why people down vote this without explanation? What mistake I made?
How to extract Apple Recipe, 3, pages, 29.4KB from the following string?
'\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t'
I've tried re.compile('\w+') but can only get results like:
Apple
Recipe
29
.
4
KB
However, I want to get them together as they are, not separately. For example, I want to get Apple Recipe together but not as two separate tokens.

data = """\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t"""
import re
g = re.findall(r'[^\r\n\t]+', data)
print(g)
Prints:
['Apple Recipe', '3', 'pages', '29.4KB']
The [^\r\n\t]+ will match any string that doesn't contain \r, \n or \t characters.

txt = """\r\n\t\t\t\t\t\r\n\t\t\t\t\tApple Recipe\r\r\n\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t3\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\tpages\r\n
\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\t29.4KB\r\n
\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t"""
import re
output = re.findall(r'\w+[.\d]?\w+', txt)
print(output)
u will get the required output
['Apple', 'Recipe', '3', 'pages', '29.4KB']

In Python, how to check if words in a string are keys in a dictionary?

For a class I am talking the twitter sentiment analysis problem. I have looked at the other questions on the site and they don't help for my particular issue.
I am given a string that is one tweet with its letters changed so that they are all in lowercase. For example,
'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
as well as a dictionary of words where the key is the word and the value is the value for the sentiment for that word. To be more specific, a key can be a single word (such as 'hello'), more than one word separated by a space (such as 'yellow hornet'), or a hyphenated compound word (such as '2-dimensional'), or a number (such as '365').
I need to find the sentiment of the tweet by adding the sentiments for every eligible word and dividing by the number of eligible words (by eligible word, I mean word that is in the dictionary). I'm not sure what's the best way to go about checking if a tweet has a word in the dictionary.
I tried using the "key in string" convention with looping through all the keys, but this was problematic because there are a lot of keys and word-in-words would be counted (e.g. eradicate counts cat, ate, era, etc. as well)
I then tried using .split(' ') and looping through the elements of the resultant list but I ran into problems because of punctuation and keys which are two words.
Anyone have any ideas on how I can more suitably tackle this?
For example: using the example above, still : -0.625, love : 0.625, every other word is not in the dictionary. so this should return (-0.625 + 0.625)/2 = 0.

The whole point of dictionaries is that they are quick at looking things up:
for word in instring.split():
if wordsdict.has_key(word):
print word
You would probably do better at getting rid of punctuation, etc, (thank-you Soke), by using regular expressions rather than split, e.g.
for word in re.findall(r'[\w]', instring):
if wordsdict.get(word) is not None:
print word
Of course you will have to have some maximum length of word groupings, possibly generated with a single run through of the dictionary and then take your pairs, triples, etc. and also check them.

you can use nltk its very powerfull what you want to do, it can be done by split too:
>>> import string
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> import nltk
>>> my_dict = {'still' : -0.625, 'love' : 0.625}
>>> words = nltk.word_tokenize(a)
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(', '#', 'tel', 'aviv', 'kosher', 'pizza', ')', 'http', ':', '//t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
using split:
>>> words = a.split()
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(#', 'tel', 'aviv', 'kosher', 'pizza)', 'http://t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
my_dict.get(key,default), so get will return value if key is found in dictionary else it will return default. In this case '0'
check this example: you asked for place
>>> import string
>>> my_dict = {'still' : -0.625, 'love' : 0.625,'place':1}
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> words = nltk.word_tokenize(a)
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.5

going by length of the dictionary key might be one solution.
For example, you have the dict as:
Sentimentdict = {"habit":5, "bad habit":-1}
the sentence might be:
s1="He has good habit"
s2="He has bad habit"
s1 should be getting good sentiment compare to s2. Now, you can do this:
for w in sorted(Sentimentdict.keys(), key=lambda x: len(x)):
if w in s1:
remove the word and do your sentiment calculation

Python: help composing regex pattern

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string
"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."
I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.
for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
print m.group(1)
print m.group(2)
print m.group(3)
print m.group(4)
the output is:
begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46
and I want it to be:
32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46
Thank you for any help.

.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.
>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'),
('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
With a modified pattern, forcing single quotes to be present at the start/end of the match:
>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]

You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).
Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.

Another option to Blckknght and Tim Pietzcker's is
re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)
Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.
The advantage is that it's more rigid: there's no way to get the delimiter in the result, so
'begin:33,13:134:2:2006-11-31 T 11:46:end'
would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.
The disadvantage is that it's more rigid, of course.
I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.