I do pattern matching in text using CLiPS pattern.search (Python 2.7).
I need to extract both phrases that correspond to 'VBN NP' and 'NP TO NP'.
I can do it separately and then join results:
from pattern.en import parse,parsetree
from pattern.search import search
text="Published case-control studies have a lot of information about susceptibility to asthma."
sentenceTree = parsetree(text, relations=True, lemmata=True)
matches = []
for match in search("VBN NP",sentenceTree):
matches.append(match.string)
for match in search("NP TO NP",sentenceTree):
matches.append(match.string)
print matches
# Output: [u'Published case-control studies', u'susceptibility to asthma']
But id I want to join this to one search pattern. If I try this I get no results at all.
matches = []
for match in search("VBN NP|NP TO NP",sentenceTree):
matches.append(match.string)
print matches
#Output: []
Official documentation gives no clues for this. I also had tried '{VBN NP}|{NP TO NP}' '[VBN NP]|[NP TO NP]' but without any luck.
Question is:
Is it possible to join search patterns in CLiPS pattern.search?
And if answer is "yes" then how to do it?
This pattern worked for me, {VBN NP} *+ {NP TO NP}, along with the match() and group() methods
>>> from pattern.search import match
>>> from pattern.en import parsetree
>>> t = parsetree('Published case-control studies have a lot of information about susceptibility to asthma.',relations= True)
>>> m = match('{VBN NP} *+ {NP TO NP}',t)
>>> m.group(0) #matches the complete pattern
[Word(u'Published/VBN'), Word(u'case-control/NN'), Word(u'studies/NNS'), Word(u'have/VBP'), Word(u'a/DT'), Word(u'lot/NN'), Word(u'of/IN'), Word(u'information/NN'), Word(u'about/IN'), Word(u'susceptibility/NN'), Word(u'to/TO'), Word(u'asthma/NN')]
>>> m.group(1) # matches the first group
[Word(u'Published/VBN'), Word(u'case-control/NN')]
>>> m.group(2) # matches the second group
[Word(u'susceptibility/NN'), Word(u'to/TO'), Word(u'asthma/NN')]
finally you can display the result as
>>> matches=[]
>>> for i in range(2):
... matches.append(m.group(i+1).string)
...
>>> matches
[u'Published case-control', u'susceptibility to asthma']
Related
I have a large list of strings and I want to check whether a string occurs in a larger string. The list contains of strings of one word and also strings of multiple words. To do so I have written the following code:
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"
emptylist = []
for i in example_text:
res = [ele for ele in example_list if(ele in i)]
emptylist.append(res)
However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text. I also tried using sets:
word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)
This however chops op 'morning sickness' into 'morning' and 'sickness'. Does anyone know what is the correct way to tackle this problem?
Nice examples have already been provided in this post by members.
I made the matching_text a little more challenging where the pain occurred more than once. I also aimed for a little more information about where the match location starts. I ended up with the following code.
I worked on the following sentence.
"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
import re
from collections import defaultdict
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
TruthFalseDict = defaultdict(list)
for i in example_list:
MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
if MatchedTruths:
for j in MatchedTruths:
TruthFalseDict[i].append(j.start())
print(dict(TruthFalseDict))
The above gives me the following output.
{'pain': [55, 69], 'headache': [38], 'sickness': [78]}
Using PyParsing:
import pyparsing as pp
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"
list_of_matches = []
for word in example_list:
rule = pp.OneOrMore(pp.Keyword(word))
for t, s, e in rule.scanString(example_text):
if t:
list_of_matches.append(t[0])
print(list_of_matches)
Which yields:
['headache', 'sickness', 'morning sickness']
You should be able to use a regex using word boundaries
>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']
This will not match 'pain' in 'kneepain' since that does not begin with a word boundary. But it would properly match substrings that contained whitespace.
Assume I have text like this:
<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>
I want to use a single regex to extract all of the text between the <li>/list tags using python.
regexp = <p>.+?(<li>.+?</li>).+?</p>
This only returns the first item in the list surrounded by the <li>/list tags:
<li>pizza</li>
Is there a way for me to grab all of the items between the <li>/list tags so my output would look like:
<li>pizza</li><li>burgers</li><li>fries</li>
This should work:
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
res = ''.join(re.findall('<li>[^<]*</li>', source))
# <li>pizza</li><li>burgers</li><li>fries</li>
Assuming you have already extracted the example string you state you can do:
import re
s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
re.findall("<li>.+?</li>", s)
Output:
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']
Why do you need the <p> tags ?
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
m = re.findall('(<li>.+?</li>)',source)
print m
returns want you want.
Edit
If you only want text that is between <p> tags you can do it in two steps :
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p> and also <li>coke</li>'
ss = re.findall('<p>(.+?)</p>',source)
for s in ss:
m = re.findall('(<li>.+?</li>)',s)
print m
Try this regex with re.findall()
To get text: <li>([^<]*)</li> , To get tags: <li>[^<]*</li>
>>> import re
>>> s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
>>> text=re.findall("<li>([^<]*)</li>", s)
>>> tag=re.findall("<li>[^<]*</li>", s)
>>> text
['pizza', 'burgers', 'fries']
>>> tag
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']
>>>
Given a file like this:
# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, #/
A片 A片 [A pian4] /adult movie/pornography/
I want to build a json object that:
skip lines that starts with #
breaks lines into 4 parts
tradition character (spans from start ^ until the next space)
simplified character (spans from the first space to the second)
pinyin (spans between the square brackets [...])
the gloss space between the first / till the last / (note there are cases where there can be slashes within the gloss, e.g. /adult movie/pornography/
I am currently doing it as such:
>>> for line in text.split('\n'):
... if line.startswith('#'): continue;
... line = line.strip()
... simple, _, line = line.partition(' ')
... trad, _, line = line.partition(' ')
... print simple, trad
...
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片
To get the [...], I had to do:
>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'
And to find the /.../, I had to do:
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'
How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?
I could extract the info using regular expressions instead. This way, you can catch blocks in groups and then handle them as desired:
import re
with open("myfile") as f:
data = f.read().split('\n')
for line in data:
if line.startswith('#'): continue
m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
if m:
print(m.groups())
That is regular expression splits the string in the following groups:
^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
^^^^^ ^^^^^ ^^^^^ ^^
1) 2) 3) 4)
That is:
the first word.
the second word.
the text within [ and ].
the text from / up to the / before the end of the line.
It returns:
('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, #')
('A片', 'A片', 'A pian4', 'adult movie/pornography')
p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
simple, trad, pinyin, gloss = m.groups()
See https://docs.python.org/2/howto/regex.html#grouping for more details.
This might help:
preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
re.MULTILINE | re.UNICODE)
with open('your_file') as f:
for line in f:
match = preg.match(line)
if match:
print(match.groups())
Take a look here for a detailed explanation of the used regular expression.
I created following regex to match all the four groups:
REGEX DEMO
^(.*)\s(.*)\s(\[.*\])\s(\/.*\/)
This does assume that there is only one space in between the groups however if you have more you can just add a modifier.
Here is a demo of how this works with python with the lines provided in the question:
IDEONE DEMO
I have following texts, each line has two phrases and separated with "\t"
RoadTunnel RouteOfTransportation
LaunchPad Infrastructure
CyclingLeague SportsLeague
Territory PopulatedPlace
CurlingLeague SportsLeague
GatedCommunity PopulatedPlace
What I want to get is to add _ to separate words, the results should be:
Road_Tunnel Route_Of_Transportation
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
There is no cases such as "ABTest" or "aBTest", and there are cases such as three words together "RouteOfTransportation" I tried several ways but not succeeded.
One of my tries is:
textProcessed = re.sub(r"([A-Z][a-z]+)(?=([A-Z][a-z]+))", r"\1_", text)
But there is no result
Use a regular expression and re.sub.
>>> import re
>>> s = '''LaunchPad Infrastructure
... CyclingLeague SportsLeague
... Territory PopulatedPlace
... CurlingLeague SportsLeague
... GatedCommunity PopulatedPlace'''
>>> subbed = re.sub('([A-Z][a-z]+)([A-Z])', r'\1_\2', s)
>>> print(subbed)
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
edit: Here's another one, since your test cases don't cover enough to be sure what exactly you want:
>>> re.sub('([a-zA-Z])([A-Z])([a-z])', r'\1_\2\3', 'ABThingThing')
'AB_Thing_Thing'
Combining re.findall and str.join:
>>> "_".join(re.findall(r"[A-Z]{1}[^A-Z]*", text))
Depending on your needs, a slightly different solution can be this:
import re
result = re.sub(r"([a-zA-Z])(?=[A-Z])", r"\1_", s)
It will insert a _ before any upper case letter that follows another letter (whether it is upper or lower case).
"TheRabbit IsBlue" => "The_Rabbit Is_Blue"
"ABThing ThingAB" => "A_B_Thing Thing_A_B"
It does not support special chars.
I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]