regex or statement in any order - python

Python regular expression I have a string that contains keywords but sometimes the keywords dont exist and they are not in any particular oder. I need help with the regular expression.
Keywords are:
Up-to-date
date added
date trained
These are the keywords i need to find amongst a number of other keywords and they may not exist and will be in any order.
What the sting looks like
<div>
<h2 class='someClass'>text</h2>
blah blah blah Up-to-date blah date added blah
</div>
what i've tried:
regex = re.compile('</h2>.*(Up\-to\-date|date\sadded|date\strained)*.*</div>')
regex = re.compile('</h2>.*(Up\-to\-date?)|(date\sadded?)|(date\strained?).*</div>')
re.findall(regex,string)
The outcome i'm looking for would be:
If all exists
['Up-to-date','date added','date trained']
If some exists
['Up-to-date','','date trained']

Does it have to be a regex? If not, you could use find:
In [12]: sentence = 'hello world cat dog'
In [13]: words = ['cat', 'bear', 'dog']
In [15]: [w*(sentence.find(w)>=0) for w in words]
Out[15]: ['cat', '', 'dog']

This code does what you want, but it smells:
import re
def check(the_str):
output_list = []
u2d = re.compile('</h2>.*Up\-to\-date*.*</div>')
da = re.compile('</h2>.*date\sadded*.*</div>')
dt = re.compile('</h2>.*date\strained*.*</div>')
if re.match(u2d, the_str):
output_list.append("Up-to-date")
if re.match(da, the_str):
output_list.append("date added")
if re.match(dt, the_str):
output_list.append("date trained")
return output_list
the_str = "</h2>My super cool string with the date added and then some more text</div>"
print check(the_str)
the_str2 = "</h2>My super cool string date added with the date trained and then some more text</div>"
print check(the_str2)
the_str3 = "</h2>My super cool string date added with the date trained and then Up-to-date some more text</div>"
print check(the_str3)

Related

Match Text Within Parenthesis Multiple Times

Assume I have text like this:
<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>
I want to use a single regex to extract all of the text between the <li>/list tags using python.
regexp = <p>.+?(<li>.+?</li>).+?</p>
This only returns the first item in the list surrounded by the <li>/list tags:
<li>pizza</li>
Is there a way for me to grab all of the items between the <li>/list tags so my output would look like:
<li>pizza</li><li>burgers</li><li>fries</li>
This should work:
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
res = ''.join(re.findall('<li>[^<]*</li>', source))
# <li>pizza</li><li>burgers</li><li>fries</li>
Assuming you have already extracted the example string you state you can do:
import re
s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
re.findall("<li>.+?</li>", s)
Output:
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']
Why do you need the <p> tags ?
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
m = re.findall('(<li>.+?</li>)',source)
print m
returns want you want.
Edit
If you only want text that is between <p> tags you can do it in two steps :
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p> and also <li>coke</li>'
ss = re.findall('<p>(.+?)</p>',source)
for s in ss:
m = re.findall('(<li>.+?</li>)',s)
print m
Try this regex with re.findall()
To get text: <li>([^<]*)</li> , To get tags: <li>[^<]*</li>
>>> import re
>>> s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
>>> text=re.findall("<li>([^<]*)</li>", s)
>>> tag=re.findall("<li>[^<]*</li>", s)
>>> text
['pizza', 'burgers', 'fries']
>>> tag
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']
>>>

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

ordering text by the agreement with keywords

I am trying to order a number of short paragraphs by their agreement with a list of keywords. This is used to provide a user with the text ordered by interest.
Let's assume I already have the list of keywords, hopefully reflecting the users interest. I thought this is a fairly standard procedure and expected some python package for that. But so far my Google search was not very successful.
I can easily come up with a brute force solution myself, but I was wondering whether somebody knows an efficient way to do this?
EDIT:
Ok here is an example:
keywords = ['cats', 'food', 'Miau']
text1 = 'This is text about dogs'
text2 = 'This is text about food'
text3 = 'This is text about cat food'
I need a procedure which leads to the order text3, text2, text1
thanks
This is the simplest thing I can think of:
import string
input = open('document.txt', 'r')
text = input.read()
table = string.maketrans("","")
text = text.translate(table, string.punctuation)
wordlist = text.split()
agreement_cnt = 0
for word in list_of_keywords:
agreement_cnt += wordlist.count(word)
got the removing punctuation bit from here: Best way to strip punctuation from a string in Python.
Something like this might be a good starting point:
>>> keywords = ['cats', 'food', 'Miau']
>>> text1 = 'This is a text about food fed to cats'
>>> matched_word_count = len(set(text1.split()).intersection(set(keywords)))
>>> print matched_word_count
2
If you want to correct for capitalization or capture word forms (i.e. 'cat' instead of 'cats'), there's obviously more to consider, though.
Taking the above and capturing match counts for a list of different strings, and then sorting the results to find the "best" match, should be relatively simple.

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

How can I count phrases and use the phrases as headers in Python?

I have a file in which I am trying to obtain counts of phrases. There are about 100 phrases I need to count in certain lines of text. As a simple example, I have the following:
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
header = ''
for phrase in phrases.splitlines():
header = header+'|'+phrase
header = 'id'+header
I'd like to be able to have output which looks like this:
id|hello|name|john doe
1|3|1|1
2|0|1|1
I have the header down. I'm just not sure how to count each phrase and append the output.
Create a list of the headers
In [6]: p=phrases.strip().split('\n')
In [7]: p
Out[7]: ['hello', 'name', 'john doe']
use a regex using word-boundaries i.e. \b to get the number of occurances avoiding partial matches. the flag re.I is to make the search case-insensitive.
In [11]: import re
In [14]: re.findall(r'\b%s\b' % p[0], text1)
Out[14]: ['hello', 'hello', 'hello']
In [15]: re.findall(r'\b%s\b' % p[0], text1, re.I)
Out[15]: ['hello', 'hello', 'hello']
In [16]: re.findall(r'\b%s\b' % p[1], text1, re.I)
Out[16]: ['name']
In [17]: re.findall(r'\b%s\b' % p[2], text1, re.I)
Out[17]: ['john doe']
put a len() around that to get the number of pattern found.
You can count words in a string using .count()
>>> text1.lower().count('hello')
3
so this should work (aside from the mismatches mentioned in the comments below)
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
texts = [text1,text2]
header = ''
for phrase in phrases.splitlines():
header = header+'|'+phrase
header = 'id'+header
print header
for id,text in enumerate(texts):
textcount = [id]
for phrase in header.split('|')[1:]:
textcount.append(text.lower().count(phrase))
print "|".join(map(str,textcount))
The above assumes you have a list of the texts in order of their id's, but if they all begin with 'id=n' you could do something like:
for text in texts:
id = text[3] # assumes id is 4th char
textcount = [id]
While it doesn't answer your question (#askewchan and #Fredrik have done that), I thought I'd offer some advice about the rest of your approach:
You might be better served by defining your phrases in a list:
phrases = ['hello', 'name', 'john doe']
which then lets you skip the loop in creating the header:
header = 'id|' + '|'.join (phrases)
and you can leave out the .split ('|')[1:] part in askewchan's answer, for example, in favour of just for phrase in phrases:
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
import re
import collections
txts = [text1, text2]
phrase_list = phrases.split()
print "id|%s" % "|".join([ p for p in phrase_list])
for txt in txts:
(tid, rest) = re.match("id=(\d):\s*(.*)", txt).groups()
counter = collections.Counter(re.findall("\w+", rest))
print "%s|%s" % ( tid, "|".join([str(counter.get(p, 0)) for p in phrase_list]))
Gives:
id|hello|name|john|doe
1|3|1|1|1
2|0|1|0|0

Categories