I am trying to parse the result output from a natural language parser (Stanford parser).
Some of the results are as below:
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
The result am trying to get are:
['dep', 'Company', 'rent']
['conj_or', 'rent', 'share']
['amod', 'information', 'personal']
...
['amod', 'companies', 'non-affiliated']
...
['aux', 'requested', "'ve"]
First I tried to directly get these elements out, but failed.
Then I realized regex should be the right way forward.
However, I am totally unfamiliar with regex. With some exploration, I got:
m = re.search('(?<=())\w+', line)
m2 =re.search('(?<=-)\d', line)
and stuck.
The first one can correctly get the first elements, e.g. 'dep', 'amod', 'conj_or', but I actually have not totally figured out why it is working...
Second line is trying to get the second elements, e.g. 'Company', 'rent', 'information', but I can only get the number after the word. I cannot figure out how to lookbefore rather than lookbehind...
BTW, I also cannot figure out how to deal with exceptions such as 'non-affiliated' and "'ve".
Could anyone give some hints or help. Highly appreciated.
It is difficult to give an optimal answer without knowing the full range of possible outputs, however, here's a possible solution:
>>> [re.findall(r'[A-Za-z_\'-]+[^-\d\(\)\']', line) for line in s.split('\n')]
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
It works by finding all the groups of contiguous letters ([A-Za-z] represent the interval between capital A and Z and small a and z) or the characters "_" and "'" in the same line.
Furthermore it enforce the rule that your matched string must not have in the last position a given list of characters ([^...] is the syntax to say "must not contain any of the characters (replace "..." with the list of characters)).
The character \ escapes those characters like "(" or ")" that would otherwise be parsed by the regex engine as instructions.
Finally, s is the example string you gave in the question...
HTH!
Here is something you're looking for:
([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)
The parenthesis around [\w-]* are for grouping, so that you can access data as:
ex = r'([\w-]*)\(([\w-]*)-\d*, ([\w-]*)-\d*\)'
m = re.match(ex, line)
print(m.group(0), m.group(1), m.group(2))
Btw, I recommend using "Kodos" program written in Python+PyQT to learn and test regular expressions. It's my favourite tool to test regexs.
If the results from the parser are as regular as suggested, regexes may not be necessary:
from pprint import pprint
source = """
dep(Company-1, rent-5')
conj_or(rent-5, share-10)
amod(information-12, personal-11)
prep_about(rent-5, you-14)
amod(companies-20, non-affiliated-19)
aux(provide-23, to-22)
xcomp(you-14, provide-23)
dobj(provide-23, products-24)
aux(requested-29, 've-28)
"""
items = []
for line in source.splitlines():
head, sep, tail = line.partition('(')
if head:
item = [head]
head, sep, tail = tail.strip('()').partition(', ')
item.append(head.rpartition('-')[0])
item.append(tail.rpartition('-')[0])
items.append(item)
pprint(items)
Output:
[['dep', 'Company', 'rent'],
['conj_or', 'rent', 'share'],
['amod', 'information', 'personal'],
['prep_about', 'rent', 'you'],
['amod', 'companies', 'non-affiliated'],
['aux', 'provide', 'to'],
['xcomp', 'you', 'provide'],
['dobj', 'provide', 'products'],
['aux', 'requested', "'ve"]]
Related
Sooo, I have this problem in which I have to create a list of lists, that contain every word from each line that has a length greater then 4. The challenge is to solve this with a one-liner.
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
So far I managed this res = [i for ele in text.splitlines() for i in ele.split(' ') if len(i) > 4] but it returns ['candle', 'burns', 'ends;', 'night;', 'foes,', 'friends—', 'gives', 'lovely', 'light!'] insetead of [['candle', 'burns', 'ends;'], ['night;'], ['foes,', 'friends—'], ['gives', 'lovely', 'light!']]
Any ideas? :D
So in this case i would utilize some regular expressions to find your results.
By doing a list comprehension as you did with a regular expression you end up automatically placing the matches into new lists.
This particular search pattern looks for any number or letter (both capital or not) in a recurrence of 4 or more times.
import re
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
results = [re.findall('\w{4,}', line) for line in text.split('\n')]
print(results)
Output:
[['candle', 'burns', 'both', 'ends'], ['will', 'last', 'night'], ['foes', 'friends'], ['gives', 'lovely', 'light']]
If you wish to keep the special characters you might want to look into expanding the regular expression so it includes all characters except whitespace.
There are great tools to play around with if you look for "online regular expression tools" so you get some more feedback when trying to build your own patterns.
IIUC, this oneliner should work for you (without the use of additional packages):
[[w.strip(';,!—') for w in l.split() if len(w)>=4] for l in text.split('\n')]
Output:
[['candle', 'burns', 'both', 'ends'],
['will', 'last', 'night'],
['foes', 'friends'],
['gives', 'lovely', 'light']]
I have a list of issues like below and I would like to remove all special characters, numbers from this list of issues and i would like to do tokenization and stop words removal from this issues list:
issue=[[hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by
the_way two-three!~`` [problems]:are there;]
[happy"journey" (and) \\r\\n\\rbring 576 chachos?>]]
I have tried below code but I am not getting desired output:
import re
ab=re.sub('[^A-Za-z0-9]+', '', issue)
bc=re.split(r's, ab)
I would like to see output like below:
issue_output=[['hi','going','uk','us','tomorrow','morning',
'way','two','three','problems' ]
[ 'happy','journey','bring','chachos']]
There are two glaring problems with the code that you have posted. First is that your input list issue is not formatted properly which makes it impossible to parse. Depending on the way you actually want it formatted, the answer to your question might change, but in general, this leads to the second problem, which is that you are trying to do re.sub on a list. You want to do the substitution on the list's elements. You can use list comprehension for that:
issue_output = [re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in issue]
Since there is no valid Python list provided in the question, I will assume the values in the list based on my best guess.
issue = [
['hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by the_way two-three!~`` [problems]:are there;'],
['happy"journey" (and) \\r\\n\\rbring 576 chachos?>']
]
In this case, when you have a list of lists of strings, you need to adjust the list comprehension for that.
cleaned_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in inner_list] for inner_list in issue]
This returns a list of lists with strings inside:
[['hi iam going to uk us r ntomorrow morning by the way two three problems are there '], ['happy journey and r n rbring 576 chachos ']]
If you want to have the separate words in that list, simply split() them after substitution.
tokenized_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item.split()) for item in inner_list][0] for inner_list in issue]
This gives the result of:
[['hi', 'iam', 'going', 'to', 'uk', 'us', 'r', 'ntomorrow', 'morning', 'by', 'the', 'way', 'two', 'three', 'problems', 'are', 'there'], ['happy', 'journey', 'and', 'r', 'n', 'rbring', '576', 'chachos']]
When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:
import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']
regex = re.compile("|".join(keywords))
regex.findall(message.lower())
Result:
['beer', 'beer', 'german beer']
But the expected result would be:
['beer', 'beer', 'german beer', 'german']
Another way to do that could be:
results = []
for k in keywords:
regex = re.compile(k)
for r in regex.findall(message.lower()):
results.append(r)
['beer', 'beer', 'beer', 'german beer', 'german']
It works like I want, but I think it is not the best way to do that. Can somebody help me?
re.findall cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.
Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:
results = [r for k in keywords for r in re.findall(k, message.lower())]
Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.
re.findall is described in http://docs.python.org/2/library/re.html
"Return all non-overlapping matches of pattern in string..."
Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.
My cleaner (for me) version for your last solution
results = []
for key in keywords:
results.extend(re.findall(key, message, re.IGNORECASE))
Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"
Here is what I have so far
text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']
The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']
Can this be done with a simple regex pattern?
import itertools as it
import re
three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
if key:
group=list(group)
for i in range(0,len(group)-1):
print(' '.join(group[i:i+2]))
# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think
It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?
map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
May be you can rewrite the lambda for shorter (like just '+')
And BTW ' is not part of \w or \s
Something like this with additional checks for list boundaries should do:
>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>
There are two problems with your approach:
Neither \w nor \s matches punctuation.
When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.
To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.
But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.
re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
^^^ ^
lookahead assertion