matching quoted strings and unquoted words - python

I try to write a regular expression to match eihter strings surrounded by double quotes (") or words separated by space () and have them in a list in python.
I don't really understand the output of my code, can anybody give me a hint or explain what my regular expression is doing exactly?
here is my code:
import re
regex = re.compile('(\"[^\"]*\")|( [^ ]* )')
test = '"hello world." here are some words. "and more"'
print(regex.split(test))
I expect an output like this:
['"hello world."', ' here ', ' are ', ' some ', ' words. ', '"and more"']
but I get the following:
['', '"hello world."', None, '', None, ' here ', 'are', None, ' some ', 'words.', None, ' "and ', 'more"']
where does the empty strings and the Nones come from.
and why does it match "hello world." but not "and more".
Thanks for your help, and a happy new year for those who celebrate it today!
EDIT:
to be precise: i dont need the surrounding spaces but i need the surrounding quotes. this output would be fine too:
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
EDIT2:
i ended up using shlex.split() like #PadraicCunningham suggested because it did exactly what i need and ihmo it is much more readable than regular expressions.
i still keep #TigerhawkT3's answer the accepted one because it solves the problem in the way i have asked it (with regular expressions).

Include the quoted match first so it prioritizes that, and then non-whitespace characters:
>>> s = '"hello world." here are some words. "and more"'
>>> re.findall(r'"[^"]*"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
You can get the same result with a non-greedy repeating pattern instead of the character set negation:
>>> re.findall(r'".*?"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']

shlex.split with posix=False will do it for you:
import shlex
test = '"hello world." here are some words. "and more"'
print(shlex.split(test,posix=False))
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
if you did not want the quotes, you would leave posix as True:
print(shlex.split(test))
['hello world.', 'here', 'are', 'some', 'words.', 'and more']

Looks like CSV, so use the appropriate tools:
import csv
lines = ['"hello world." here are some words. "and more"']
list(csv.reader(lines, delimiter=' ', quotechar='"'))
returns
[['hello world.', 'here', 'are', 'some', 'words.', 'and more']]

Related

python re.split(): how to save some of the delimiters (instead of all the delimiter by using bracket)

For the sentences:
"I am very hungry, so mum brings me a cake!
I want it split by delimiters, and I want all the delimiters except space to be saved as well. So the expected output is :
"I" "am" "very" "hungry" "," "so", "mum" "brings" "me" "a" "cake" "!" "\n"
What I am currently doing is re.split(r'([!:''".,(\s+)\n])', text), which split the whole sentences but also saved a lot of space characters which I don't want. I've also tried the regular expression \s|([!:''".,(\s+)\n]) , which gives me a lot of None somehow.
search or findall might be more appropriate here than split:
import re
s = "I am very hungry, so mum brings me a !#$## cake!"
print(re.findall(r'[^\w\s]+|\w+', s))
# ['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', '!#$##', 'cake', '!']
The pattern [^\w\s]+|\w+ means: a sequence of symbols which are neither alphanumeric nor whitespace OR a sequence of alphanumerics (that is, a word)
That is because your regular expression contains a capture group. Because of that capture group, it will also include the matches in the result. But this is likely what you want.
The only challenge is to filter out the Nones (and other values with truthiness False) in case there is no match, we can do this with:
def tokenize(text):
return filter(None, re.split(r'[ ]+|([!:''".,\s\n])', text))
For your given sample text, this produces:
>>> list(tokenize("I am very hungry, so mum brings me a cake!\n"))
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']
One approach is to surround the special characters (,!.\n) with space and then split on space:
import re
def tokenize(t, pattern="([,!.\n])"):
return [e for e in re.sub(pattern, r" \1 ", t).split(' ') if e]
s = "I am very hungry, so mum brings me a cake!\n"
print(tokenize(s))
Output
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']

How to split a string into list and combine two known token into one in python?

For a given string like:
"Today is a bright sunny day in New York"
I want to make my list to be:
['Today','is','a','bright','sunny','day','in','New York']
Another example:
"This is a hello world program"
The list be:
['This', 'is', 'a', 'hello world', 'program']
For every given string S, we have the entities E which needs to be kept together. The first example had entity E to be "New", "York" and the second example had entity to be "hello","world".
I have tried to get it done via regex but I am unsuccessful in splitting by spaces and merging two entities.
Example:
regex = "(navy blue)|[a-zA-Z0-9]*"
match = re.findall(regex, "the sky looks navy blue.",re.IGNORECASE)
print match
Output:
['', '', '', '', '', '', 'navy blue', '', '']
Use re.findall instead of split and supply the entity in alternation before the character class that represents string to extract
>>> s = "Today is a bright sunny day in New York"
>>> re.findall(r'New York|\w+', s)
['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New York']
>>> s = "This is a hello world program"
>>> re.findall(r'hello world|\w+', s)
['This', 'is', 'a', 'hello world', 'program']
change \w to appropriate character class, for ex: [a-zA-Z]
For the additional sample added to question
>>> regex = r"navy blue|[a-z\d]+"
>>> re.findall(regex, "the sky looks navy blue.", re.IGNORECASE)
['the', 'sky', 'looks', 'navy blue']
Use r strings to construct regex patterns as a good practice
grouping not needed here
use + instead of * so that at least one character has to be matched
since re.IGNORECASE is specified, either a-z or A-Z is enough in character class. can also use re.I as short-cut
\d is short-cut for [0-9]
Try this:
text = "Today is a bright sunny day in New York"
new_list = list(map(str, text.split(" ")))
This should give you the following output ['Today', 'is', 'a', 'bright', 'sunny', 'day', 'in', 'New', 'York']
Same for the next string:
hello = "This is a hello world program."
yet_another_list = list(map(str, hello.split(" ")))
Gives you ['This', 'is', 'a', 'hello', 'world', 'program.']
"this is hello word program".split(' ')
the split will automatically make a list. you can you split using any string or word or characters.

Python regex: tokenizing English contractions

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].
The nltk module does not seem to be up to the task however as:
"I wouldn't've done that."
tokenizes as:
['I', "wouldn't", "'ve", 'done', 'that', '.']
where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]
After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:
n't, 've, 'd, 'll, 's, 'm, 're
But the token "'ve" can also follow other contractions such as:
'd've, n't've, and (conceivably) 'll've
At the moment, I am trying to wrangle this regex:
\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b
However, this pattern also matches the badly formed:
"wouldn't've've"
It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.
I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.
Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.
EDIT:
Here's the output:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.
You can use the following complete regexes :
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
result :
['I', 'would', "n't", "'ve", 'done', 'that.']
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.
You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
so:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']
Here a simple one
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')

Python title() with apostrophes

Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking   spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"

RegEx Tokenizer to split a text into words, digits and punctuation marks

What I want to do is to split a text into his ultimate elements.
For example:
from nltk.tokenize import *
txt = "A sample sentences with digits like 2.119,99 or 2,99 are awesome."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['A','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.']
You can see it works fine. My Problem is: What happens if the digit is at the end of a text?
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['Today', 'it', "'s", '07.May', '2011.', 'Or', '2.999.']
The result should be:
['Today', 'it', "'s", '07.May', '2011','.', 'Or', '2.999','.']
What I have to do to get the result above?
I created a pattern to try to include periods and commas occurring inside words, numbers. Hope this helps:
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

Categories