Regular expression matches a few symbols but not includes some - python

There is paragraph, and I want to use regular expression to extract all the words inside.
a bdag agasg it's the cookies for dogs',don't you think so? the word 'wow' in english means.you hey b 097 dag final
I have tried several regexes with re.findall(regX,str), and found one that can match most words.
regX = "[ ,\.\?]?([a-z]+'?[a-z]?)[ ,\.\?]?"
['a', 'bdag', 'agasg', "it's", 'the', 'cookies', 'for', "dogs'", "don't", 'you', 'think', 'so', 'the', 'word', "wow'", 'in', 'english', 'means', 'you', 'hey', 'b', 'dag', 'final']
All are good except **wow'**.
I wonder if regular expression could explain the logic "it can be a comma/space/period/etc but can't be an apostrophe".
Can someone advise?

Try:
[ ,\.\?']?([a-z]*('\w)?)[\' ,\.\?]?
Added another group so you'll have to select only group 1.

I didn't fully understand what you wanted the output to be but,
try this:
[ ,\.\?]?(["-']?+[a-z]+["-']?[a-z]?)[ ,\.\?]?
using this regex lets you get the ' and " within the text.
if this still was not what you wanted please let me know so I can update my answer.

Related

Python: Use regex to create a list of lists based on text sentences separated by ".", "?", or "!"

I have the following sample text that I already cleaned. Below is just a sample of it:
and can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of
quiet with turbulent and dangerous lunacy? he does confess he feels himself distracted. but from what cause he will by
no means speak. nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring
him on to some confession of his true state. did he receive you well? most like a gentleman. but with much forcing of
his disposition. niggard of question. but of our demands most free in his reply.
I want to do the following:
create a list of lists named hamsplits, such that hamsplits[i] is a list of all the words in the i-th sentence of the text.
sentences should be stored in the order that they appear, and so should the words within each sentence
sentences end with '.', '?', and '!'
Desired output example:
hamsplits[0] == ['and', 'can', 'you', 'by', ..., 'dangerous', 'lunacy']
I tried the code below using just '.' as a test but it doesn't return list of lists:
hamsplits3 = hamsplits2.split('.')
Instead it returns this:
['\n\nand can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of \nquiet with turbulent and dangerous lunacy? he does confess he feels himself distracted', ' but from what cause he will by \nno means speak', ' nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring \nhim on to some confession of his true state', ' did he receive you well? most like a gentleman', ' but with much forcing of \nhis disposition', ' niggard of question', ' but of our demands most free in his reply', " did you assay him? ... ]
What am I doing wrong? I don't want to use any imported packages outside of import re
You can try findall
import re
s = """and can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of
quiet with turbulent and dangerous lunacy? he does confess he feels himself distracted. but from what cause he will by
no means speak. nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring
him on to some confession of his true state. did he receive you well? most like a gentleman. but with much forcing of
his disposition. niggard of question. but of our demands most free in his reply."""
hamsplits = [i.strip().replace('\n', '').split(' ') for i in re.findall(r'[^.?!]+', s, re.MULTILINE)]
print(hamplist[0])
Output:
['and', 'can', 'you', 'by', 'no', 'drift', 'of', 'circumstance', 'get', 'from', 'him', 'why', 'he', 'puts', 'on', 'this', 'confusion', 'grating', 'so', 'harshly', 'all', 'his', 'days', 'of', 'quiet', 'with', 'turbulent', 'and', 'dangerous', 'lunacy']
U can try this.
import re
with open('input_text.txt') as file:
hamsplits =[ele.split() for ele in re.split('[.?!]',file.read())]
print(hamsplits[0])
output:
['and', 'can', 'you', 'by', 'no', 'drift', 'of', 'circumstance', 'get', 'from', 'him', 'why', 'he', 'puts', 'on', 'this', 'confusion', 'grating', 'so', 'harshly', 'all', 'his', 'days', 'of', 'quiet', 'with', 'turbulent', 'and', 'dangerous', 'lunacy']

How to check for words that are not immediately followed by a keyword, how about words not surrounded by the keyword?

I am trying to look for words that do not immediately come before the.
Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W). However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.
I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).
p = re.compile(r'(?<=the\W)\w+')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
The current output am getting is
'part','fair','most','fireworks'.
Edit:
Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.
p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
This brings me closer to the output I need to get.
Updated Output:
[('part', ' of the'), ('fair', ''),
('that', ' attracts the'), ('most', ''),
('people', ' is the'), ('fireworks', '')]
I just need the strings ('part','fair','that','most','people','fireworks').
Any advise?
I am trying to look for words that do not immediately come before 'the' .
Note that the code below does not use re.
words = 'the part of the fair that attracts the most people is the fireworks'
words_list = words.split()
words_not_before_the = []
for idx, w in enumerate(words_list):
if idx < len(words_list)-1 and words_list[idx + 1] != 'the':
words_not_before_the.append(w)
words_not_before_the.append(words_list[-1])
print(words_not_before_the)
output
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
using regex:
import re
m = re.sub(r'\b(\w+)\b the', 'the', 'the part of the fair that attracts the most people is the fireworks')
print([word for word in m.split(' ') if not word.isspace() and word])
output:
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
I am trying to look for words that do not immediately come before the.
Try this:
import re
# The capture group (\w+) matches a word, that is followed by a word, followed by the word: "the"
p = re.compile(r'(\w+)\W\w+\Wthe')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Output:
['part', 'that', 'people']
Try to spin it around, instead of finding the words that does not immediately follow the, eliminate all the occurrences that immediately follow the
import re
test = "the part of the fair that attracts the most people is the fireworks"
pattern = r"\s\w*\sthe|the\s"
print(re.sub(pattern, "", test))
output: part fair that most people fireworks
I have finally solved the question. Thank you all!
p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Added a non-capturing group '?:' inside the third group.
Output:
['part', 'fair', 'that', 'most', 'people', 'fireworks']

Python: keep apostrophe with verbs

I would like to tokenize a list of sentence, but keep negated verbs as unique words.
t = """As aren't good. Bs are good"""
print(word_tokenize(t))
['As', 'are', "n't", 'good', '.', 'Bs', 'are', 'good']
I would like to have "aren't" and "are" separate. With word_tokenize I get "n't". Same for other negated forms like (Couldn't, didn't, et).
How can I do it?
Thanks in advance
If you want to extract individual words from a space-separated sentence, use Python's split() method.
t = "As aren't good. Bs are good"
print (t.split())
['As', "aren't", 'good.', 'Bs', 'are', 'good']
You can specify other delimiters in the split() method as well. For example, if you wanted to tokenize your string based on a full-stop, you could do something like this:
print (t.split("."))
["As aren't good", ' Bs are good']
Read the documentation here.
use split of re module.https://docs.python.org/2/library/re.html
import re
t = "As aren't good. Bs are good"
list(filter(None,re.split(r"[\s+.]",t)))
output:
['As', "aren't", 'good', 'Bs', 'are', 'good']

how to write a Python program that reads from a text file, and builds a dictionary which maps each word

I am having difficulties with writing a Python program that reads from a text file, and builds a dictionary which maps each word that appears in the file to a list of all the words that immediately follow that word in the file. The list of words can be in any order and should include duplicates.
For example,the key "and" might have the list ["then", "best", "after", ...] listing all the words which came after "and" in the text.
Any idea would be great help.
A couple of ideas:
Set up a collections.defaultdict for your output. This is a dictionary with a default value for keys that don't yet exist (in this case, as aelfric5578 suggests, an empty list);
Build a list of all the words in your file, in order; and
You can use zip(lst, lst[1:]) to create pairs of consecutive list elements.
Welcome on stackoverflow.com
Are you sure you need a dictionary ?
It will takes a lot of memory if the text is long, just to repeat several times the same data for several entries.
While if you use a function, it will give you the desired list(s) at will.
For example:
s = """In Newtonian physics, free fall is any motion
of a body where its weight is the only force acting
upon it. In the context of general relativity where
gravitation is reduced to a space-time curvature,
a body in free fall has no force acting on it and
it moves along a geodesic. The present article
concerns itself with free fall in the Newtonian domain."""
import re
def say_me(word,li=re.split('\s+',s)):
for i,w in enumerate(li):
if w==word:
print '\n%s at index %d followed by\n%s' % (w,i,li[i+1:])
say_me('free')
result
free at index 3 followed by
['fall', 'is', 'any', 'motion', 'of', 'a', 'body', 'where', 'its', 'weight', 'is', 'the', 'only', 'force', 'acting', 'upon', 'it.', 'In', 'the', 'context', 'of', 'general', 'relativity', 'where', 'gravitation', 'is', 'reduced', 'to', 'a', 'space-time', 'curvature,', 'a', 'body', 'in', 'free', 'fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 38 followed by
['fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']
free at index 58 followed by
['fall', 'in', 'the', 'Newtonian', 'domain.']
The assignement li=re.split('\s+',s) is a manner to bind the parameter li to the object re.split('\s+',s) passed as argument.
This binding is done only one time: at the moment where the definition of the function is read by the interpreter to create the function object. It as a parameter defined with a default argument.
Here was I would do :
from collections import defaultdict
# My example line :
s = 'In the face of ambiguity refuse the temptation to guess'
# Previous string is quite easy to tokenize but in real world, you'll have to :
# Remove comma, dot, etc...
# Probably encode to ascii (unidecode 3rd party module can be helpful)
# You'll also probably want to normalize case
lst = s.lower().split(' ') # naive tokenizer
ddic = defaultdict(list)
for word1, word2 in zip(lst, lst[1:]):
ddic[word1].append(word2)
# ddic contains what you want (but is a defaultdict)
# if you want to work with "classical" dictionnary, just cast it :
# (Often it's not needed)
dic = dict(ddic)
Sorry if I seem to steal commentators ideas, but this is almost the same code that I used in some of my projects (similar document algorithms pre-computation)

Python title() with apostrophes

Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking   spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"

Categories