python regex finding all groups of words - python

Here is what I have so far
text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']
The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']
Can this be done with a simple regex pattern?

import itertools as it
import re
three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
if key:
group=list(group)
for i in range(0,len(group)-1):
print(' '.join(group[i:i+2]))
# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think
It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?

map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
May be you can rewrite the lambda for shorter (like just '+')
And BTW ' is not part of \w or \s

Something like this with additional checks for list boundaries should do:
>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>

There are two problems with your approach:
Neither \w nor \s matches punctuation.
When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.
To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.
But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.
re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
^^^ ^
lookahead assertion

Related

Multiline text splitting

Sooo, I have this problem in which I have to create a list of lists, that contain every word from each line that has a length greater then 4. The challenge is to solve this with a one-liner.
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
So far I managed this res = [i for ele in text.splitlines() for i in ele.split(' ') if len(i) > 4] but it returns ['candle', 'burns', 'ends;', 'night;', 'foes,', 'friends—', 'gives', 'lovely', 'light!'] insetead of [['candle', 'burns', 'ends;'], ['night;'], ['foes,', 'friends—'], ['gives', 'lovely', 'light!']]
Any ideas? :D
So in this case i would utilize some regular expressions to find your results.
By doing a list comprehension as you did with a regular expression you end up automatically placing the matches into new lists.
This particular search pattern looks for any number or letter (both capital or not) in a recurrence of 4 or more times.
import re
text = '''My candle burns at both ends;
It will not last the night;
But ah, my foes, and oh, my friends—
It gives a lovely light!'''
results = [re.findall('\w{4,}', line) for line in text.split('\n')]
print(results)
Output:
[['candle', 'burns', 'both', 'ends'], ['will', 'last', 'night'], ['foes', 'friends'], ['gives', 'lovely', 'light']]
If you wish to keep the special characters you might want to look into expanding the regular expression so it includes all characters except whitespace.
There are great tools to play around with if you look for "online regular expression tools" so you get some more feedback when trying to build your own patterns.
IIUC, this oneliner should work for you (without the use of additional packages):
[[w.strip(';,!—') for w in l.split() if len(w)>=4] for l in text.split('\n')]
Output:
[['candle', 'burns', 'both', 'ends'],
['will', 'last', 'night'],
['foes', 'friends'],
['gives', 'lovely', 'light']]

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

In Python, how to check if words in a string are keys in a dictionary?

For a class I am talking the twitter sentiment analysis problem. I have looked at the other questions on the site and they don't help for my particular issue.
I am given a string that is one tweet with its letters changed so that they are all in lowercase. For example,
'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
as well as a dictionary of words where the key is the word and the value is the value for the sentiment for that word. To be more specific, a key can be a single word (such as 'hello'), more than one word separated by a space (such as 'yellow hornet'), or a hyphenated compound word (such as '2-dimensional'), or a number (such as '365').
I need to find the sentiment of the tweet by adding the sentiments for every eligible word and dividing by the number of eligible words (by eligible word, I mean word that is in the dictionary). I'm not sure what's the best way to go about checking if a tweet has a word in the dictionary.
I tried using the "key in string" convention with looping through all the keys, but this was problematic because there are a lot of keys and word-in-words would be counted (e.g. eradicate counts cat, ate, era, etc. as well)
I then tried using .split(' ') and looping through the elements of the resultant list but I ran into problems because of punctuation and keys which are two words.
Anyone have any ideas on how I can more suitably tackle this?
For example: using the example above, still : -0.625, love : 0.625, every other word is not in the dictionary. so this should return (-0.625 + 0.625)/2 = 0.
The whole point of dictionaries is that they are quick at looking things up:
for word in instring.split():
if wordsdict.has_key(word):
print word
You would probably do better at getting rid of punctuation, etc, (thank-you Soke), by using regular expressions rather than split, e.g.
for word in re.findall(r'[\w]', instring):
if wordsdict.get(word) is not None:
print word
Of course you will have to have some maximum length of word groupings, possibly generated with a single run through of the dictionary and then take your pairs, triples, etc. and also check them.
you can use nltk its very powerfull what you want to do, it can be done by split too:
>>> import string
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> import nltk
>>> my_dict = {'still' : -0.625, 'love' : 0.625}
>>> words = nltk.word_tokenize(a)
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(', '#', 'tel', 'aviv', 'kosher', 'pizza', ')', 'http', ':', '//t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
using split:
>>> words = a.split()
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(#', 'tel', 'aviv', 'kosher', 'pizza)', 'http://t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
my_dict.get(key,default), so get will return value if key is found in dictionary else it will return default. In this case '0'
check this example: you asked for place
>>> import string
>>> my_dict = {'still' : -0.625, 'love' : 0.625,'place':1}
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> words = nltk.word_tokenize(a)
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.5
going by length of the dictionary key might be one solution.
For example, you have the dict as:
Sentimentdict = {"habit":5, "bad habit":-1}
the sentence might be:
s1="He has good habit"
s2="He has bad habit"
s1 should be getting good sentiment compare to s2. Now, you can do this:
for w in sorted(Sentimentdict.keys(), key=lambda x: len(x)):
if w in s1:
remove the word and do your sentiment calculation

Should I be using regex in Python

I have a string like so:
'cathy is a singer on fridays'
and I want to be able to replace the fourth word with other verbs
so
'cathy is a dancer on fridays'
I assumed the right way to do this would be to use regex and stop when you reach the third whitespace but you can do groupings with regex and * which accepts any char. I can't seem to get it working.
Any advice would be useful. I am new to Python so please dont judge.Also is regex appropriate for this or should I use another method?
Thank you
No, Regex is not needed for this. See below:
>>> mystr = 'cathy is a singer on fridays'
>>> x = mystr.split()
>>> x
['cathy', 'is', 'a', 'singer', 'on', 'fridays']
>>> x[3] = "dancer"
>>> x
['cathy', 'is', 'a', 'dancer', 'on', 'fridays']
>>> " ".join(x)
'cathy is a dancer on fridays'
Or, more compact:
>>> mystr = 'cathy is a singer on fridays'
>>> x = mystr.split()
>>> " ".join(x[:3] + ["dancer"] + x[4:])
'cathy is a dancer on fridays'
>>>
The core principle here is the .split method of a string.
You can get what you want by splitting and joining the string after substituting the desired piece
stringlist = 'cathy is a singer on fridays'.split()
stringlist[3] = 'dancer'
print(' '.join(stringlist))
Here is the solution using backreferences and the sub function from re
Documentation here
import re
msg = 'cathy is a singer on fridays'
print re.sub('(\w+) (\w+) (\w+) (\w+)', r'\1 \2 \3 dancer', msg, 1)
Output
>>> cathy is a dancer on fridays
if you really just want the third word, split/slice/join is easier:
mytext = 'cathy is a singer on fridays'
mysplit = mytext.split(' ')
' '.join(mysplit[:3] + ['dancer',] + mysplit[4:])
regex can do much more complicated things, and there is a re.split, and there might be a faster way to do it, but this is reasonable and readable.
You can either split the string using split(' ') or a tokenizer like nltk which might also provide you some more functionality for this specific use case with part of speech analysis. If you are trying to replace it with random nouns of profession look for a word bank. Regex is overkill for what you need.
If you already know the position of the word you want to replace in the string, you could simply use:
def replace_word(sentence, new_word, position):
sent_list = sentence.split()
sent_list[position] = new_word
return " ".join(sent_list)

Counting the number of unique words [duplicate]

This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)

Categories