Related
How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']
How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']
(Removed Code to stop Class mates copying)
Right now it will create a text file with positions of each word, so for example if I wrote
"Hello, my name is mika, Hello"
The positions in that list would be [1,2,3,4,5,6,2,1], and it will also list each word/punctuation but only once, so in this case it would be
['Hello', ',', 'my', 'name', 'is', 'mika']
The only thing now is to be able to get the words back into the original sentence using those positions in the list, which I can't seem to do.
I did try searching for other posts but it seemed to come up only with other people wanting the positions of the words rather than wanting to put the words back into a sentence using the positons.
I also thought it could be started by doing this:
for i in range(len(readlines[1])):
but I honestly have no idea how to go around doing this.
Edit: This has now been solved by #Abhishek, thank you.
indices = [1,2,3,4,5,6,2,1]
namelst = ['Hello', ',', 'my', 'name', 'is', 'mika']
newstr = " ".join([namelst[x-1] for x in indices])
print (newstr)
output:
>> 'Hello , my name is mika , Hello'
I agree there will be some offsets / spaces, but it will give you the complete sentence again
Code: (Will remove spaces after punctuation)
postitions = [1,2,3,4,5,6,2,1]
wordslist = ['Hello', ',', 'my', 'name', 'is', 'mika']
recreated=''
for i in indices:
w = namelst[i-1]
if w not in ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]:
w = ' ' + w
recreated = (recreated + w).strip()
print (recreated)
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Hello, my name is mika, Hello
C:\Users\dinesh_pundkar\Desktop>
You can use numpy to do this.
>>> import numpy as np
>>> indices = np.array([1,2,3,4,5,6,2,1])
>>> namelst = np.array(['Hello', ',', 'my', 'name', 'is', 'mika', ',', 'Hello'])
>>> ' '.join(namelst[indices-1])
'Hello , my name is mika , Hello'
I'm trying to find the most pythonic way to split a string like
"some words in a string"
into single words. string.split(' ') works ok but it returns a bunch of white space entries in the list. Of course i could iterate the list and remove the white spaces but I was wondering if there was a better way?
Just use my_str.split() without ' '.
More, you can also indicate how many splits to perform by specifying the second parameter:
>>> ' 1 2 3 4 '.split(None, 2)
['1', '2', '3 4 ']
>>> ' 1 2 3 4 '.split(None, 1)
['1', '2 3 4 ']
How about:
re.split(r'\s+',string)
\s is short for any whitespace. So \s+ is a contiguous whitespace.
Use string.split() without an argument or re.split(r'\s+', string) instead:
>>> s = 'some words in a string with spaces'
>>> s.split()
['some', 'words', 'in', 'a', 'string', 'with', 'spaces']
>>> import re; re.split(r'\s+', s)
['some', 'words', 'in', 'a', 'string', 'with', 'spaces']
From the docs:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
>>> a = "some words in a string"
>>> a.split(" ")
['some', 'words', 'in', 'a', 'string']
split parameter is not included in the result, so i guess theres something more about your string. otherwise, it should work
if you have more than one whitespace just use split() without parameters
>>> a = "some words in a string "
>>> a.split()
['some', 'words', 'in', 'a', 'string']
>>> a.split(" ")
['some', 'words', 'in', 'a', 'string', '', '', '', '', '']
or it will just split a by single whitespaces
The most Pythonic and correct ways is to just not specify any delimiter:
"some words in a string".split()
# => ['some', 'words', 'in', 'a', 'string']
Also read:
How can I split by 1 or more occurrences of a delimiter in Python?
text = "".join([w and w+" " for w in text.split(" ")])
converts large spaces into single spaces
How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']