Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"
Related
this is my first question on this site. Please forgive me for any formatting or language errors. So this question is based on a book called "think python" by Allen Downey. The activity is to write a python program that reads a book in text format and removes all the whitespace such as spaces and tabs and punctuations and other symbols. I tried many different ways to remove the punctuations and it never removes the quotes and double-quotes. They persistently stay. I'll copy-paste the last code I tried.
import string
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
return item
def break_into_words(filename):
'''
This function reads file, breaks it into
a list of used words in lower case.
'''
book = open(filename)
words_list = []
for line in book:
for item in line.split():
item = del_punctuation(item)
item=item.lower()
#print(item)
words_list.append(item)
return words_list
print(break_into_words('input.txt'))
I have not included the code to remove the whitespace as they work perfectly. I have only included code for removing punctuations. All the punctuational characters are removed except for the quotes and the double-quotes. Please help me by finding the bug in the code or is it something to do with my IDE or compiler?
Thanks in advance
input.txt:
“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”
The output I get is copied below:
['“why', 'my', 'dear', 'you', 'must', 'know', 'mrs', 'long', 'says', 'that', 'netherfield', 'is', 'taken', 'by', 'a', 'young', 'man', 'of', 'large', 'fortune', 'from', 'the', 'north', 'of', 'england', 'that', 'he', 'came', 'down', 'on', 'monday', 'in', 'a', 'chaise', 'and', 'four', 'to', 'see', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'it', 'that', 'he', 'agreed', 'with', 'mr', 'morris', 'immediately', 'that', 'he', 'is', 'to', 'take', 'possession', 'before', 'michaelmas', 'and', 'some', 'of', 'his', 'servants', 'are', 'to', 'be', 'in', 'the', 'house', 'by', 'the', 'end', 'of', 'next', 'week”', '“what', 'is', 'his', 'name”', '“bingley”', '“is', 'he', 'married', 'or', 'single”', '“oh', 'single', 'my', 'dear', 'to', 'be', 'sure', 'a', 'single', 'man', 'of', 'large', 'fortune', 'four', 'or', 'five', 'thousand', 'a', 'year', 'what', 'a', 'fine', 'thing', 'for', 'our', 'girls”', '“how', 'so', 'how', 'can', 'it', 'affect', 'them”', '“my', 'dear', 'mr', 'bennet”', 'replied', 'his', 'wife', '“how', 'can', 'you', 'be', 'so', 'tiresome', 'you', 'must', 'know', 'that', 'i', 'am', 'thinking', 'of', 'his', 'marrying', 'one', 'of', 'them”', '“is', 'that', 'his', 'design', 'in', 'settling', 'here”']
It has removed all the punctuations except for the double quotes and single quotes (there are single quotes in the input I guess).
Thanks
Real texts may contains too many tricky symbols: n-dash –, m-dash —, over ten different quotes " ' ` ‘ ’ “ ” « » ‹› et cetera, et cetera...
It makes little sense to try to count all the possible punctuation symbols. Common way is try to get only letters (and spaces). Easiest way is to use RegExp:
import re
text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”'''
# remove everything except letters, spaces, \n and, for example, dashes
text = re.sub("[^A-z \n\-]", "", text)
# split the text by spaces and \n
output = text.split()
print(output)
But actually the matter is much more complicated than it looks at first glance. Say I'm is a two words? Probably so. What about someone's? Or rock'n'roll.
I think your text contains this character ” as double-quotes instead of ". ” doesn't exist in string.punctuation so you are not removing it. Maybe it is better to change your del_punctuation function a little:
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
item = item.replace('”','')
item = item.replace('“','')
return item
I have the following sample text that I already cleaned. Below is just a sample of it:
and can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of
quiet with turbulent and dangerous lunacy? he does confess he feels himself distracted. but from what cause he will by
no means speak. nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring
him on to some confession of his true state. did he receive you well? most like a gentleman. but with much forcing of
his disposition. niggard of question. but of our demands most free in his reply.
I want to do the following:
create a list of lists named hamsplits, such that hamsplits[i] is a list of all the words in the i-th sentence of the text.
sentences should be stored in the order that they appear, and so should the words within each sentence
sentences end with '.', '?', and '!'
Desired output example:
hamsplits[0] == ['and', 'can', 'you', 'by', ..., 'dangerous', 'lunacy']
I tried the code below using just '.' as a test but it doesn't return list of lists:
hamsplits3 = hamsplits2.split('.')
Instead it returns this:
['\n\nand can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of \nquiet with turbulent and dangerous lunacy? he does confess he feels himself distracted', ' but from what cause he will by \nno means speak', ' nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring \nhim on to some confession of his true state', ' did he receive you well? most like a gentleman', ' but with much forcing of \nhis disposition', ' niggard of question', ' but of our demands most free in his reply', " did you assay him? ... ]
What am I doing wrong? I don't want to use any imported packages outside of import re
You can try findall
import re
s = """and can you by no drift of circumstance get from him why he puts on this confusion grating so harshly all his days of
quiet with turbulent and dangerous lunacy? he does confess he feels himself distracted. but from what cause he will by
no means speak. nor do we find him forward to be sounded but with a crafty madness keeps aloof when we would bring
him on to some confession of his true state. did he receive you well? most like a gentleman. but with much forcing of
his disposition. niggard of question. but of our demands most free in his reply."""
hamsplits = [i.strip().replace('\n', '').split(' ') for i in re.findall(r'[^.?!]+', s, re.MULTILINE)]
print(hamplist[0])
Output:
['and', 'can', 'you', 'by', 'no', 'drift', 'of', 'circumstance', 'get', 'from', 'him', 'why', 'he', 'puts', 'on', 'this', 'confusion', 'grating', 'so', 'harshly', 'all', 'his', 'days', 'of', 'quiet', 'with', 'turbulent', 'and', 'dangerous', 'lunacy']
U can try this.
import re
with open('input_text.txt') as file:
hamsplits =[ele.split() for ele in re.split('[.?!]',file.read())]
print(hamsplits[0])
output:
['and', 'can', 'you', 'by', 'no', 'drift', 'of', 'circumstance', 'get', 'from', 'him', 'why', 'he', 'puts', 'on', 'this', 'confusion', 'grating', 'so', 'harshly', 'all', 'his', 'days', 'of', 'quiet', 'with', 'turbulent', 'and', 'dangerous', 'lunacy']
I am trying to look for words that do not immediately come before the.
Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W). However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.
I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).
p = re.compile(r'(?<=the\W)\w+')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
The current output am getting is
'part','fair','most','fireworks'.
Edit:
Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.
p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
This brings me closer to the output I need to get.
Updated Output:
[('part', ' of the'), ('fair', ''),
('that', ' attracts the'), ('most', ''),
('people', ' is the'), ('fireworks', '')]
I just need the strings ('part','fair','that','most','people','fireworks').
Any advise?
I am trying to look for words that do not immediately come before 'the' .
Note that the code below does not use re.
words = 'the part of the fair that attracts the most people is the fireworks'
words_list = words.split()
words_not_before_the = []
for idx, w in enumerate(words_list):
if idx < len(words_list)-1 and words_list[idx + 1] != 'the':
words_not_before_the.append(w)
words_not_before_the.append(words_list[-1])
print(words_not_before_the)
output
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
using regex:
import re
m = re.sub(r'\b(\w+)\b the', 'the', 'the part of the fair that attracts the most people is the fireworks')
print([word for word in m.split(' ') if not word.isspace() and word])
output:
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
I am trying to look for words that do not immediately come before the.
Try this:
import re
# The capture group (\w+) matches a word, that is followed by a word, followed by the word: "the"
p = re.compile(r'(\w+)\W\w+\Wthe')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Output:
['part', 'that', 'people']
Try to spin it around, instead of finding the words that does not immediately follow the, eliminate all the occurrences that immediately follow the
import re
test = "the part of the fair that attracts the most people is the fireworks"
pattern = r"\s\w*\sthe|the\s"
print(re.sub(pattern, "", test))
output: part fair that most people fireworks
I have finally solved the question. Thank you all!
p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Added a non-capturing group '?:' inside the third group.
Output:
['part', 'fair', 'that', 'most', 'people', 'fireworks']
There is paragraph, and I want to use regular expression to extract all the words inside.
a bdag agasg it's the cookies for dogs',don't you think so? the word 'wow' in english means.you hey b 097 dag final
I have tried several regexes with re.findall(regX,str), and found one that can match most words.
regX = "[ ,\.\?]?([a-z]+'?[a-z]?)[ ,\.\?]?"
['a', 'bdag', 'agasg', "it's", 'the', 'cookies', 'for', "dogs'", "don't", 'you', 'think', 'so', 'the', 'word', "wow'", 'in', 'english', 'means', 'you', 'hey', 'b', 'dag', 'final']
All are good except **wow'**.
I wonder if regular expression could explain the logic "it can be a comma/space/period/etc but can't be an apostrophe".
Can someone advise?
Try:
[ ,\.\?']?([a-z]*('\w)?)[\' ,\.\?]?
Added another group so you'll have to select only group 1.
I didn't fully understand what you wanted the output to be but,
try this:
[ ,\.\?]?(["-']?+[a-z]+["-']?[a-z]?)[ ,\.\?]?
using this regex lets you get the ' and " within the text.
if this still was not what you wanted please let me know so I can update my answer.
I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].
The nltk module does not seem to be up to the task however as:
"I wouldn't've done that."
tokenizes as:
['I', "wouldn't", "'ve", 'done', 'that', '.']
where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]
After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:
n't, 've, 'd, 'll, 's, 'm, 're
But the token "'ve" can also follow other contractions such as:
'd've, n't've, and (conceivably) 'll've
At the moment, I am trying to wrangle this regex:
\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b
However, this pattern also matches the badly formed:
"wouldn't've've"
It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.
I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.
Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.
EDIT:
Here's the output:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.
You can use the following complete regexes :
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
result :
['I', 'would', "n't", "'ve", 'done', 'that.']
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.
You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
so:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']
Here a simple one
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')