Python split punctuation but still include it - python

This is the list of strings that I have:
[
['It', 'was', 'the', 'besst', 'of', 'times,'],
['it', 'was', 'teh', 'worst', 'of', 'times']
]
I need to split the punctuation in times,, to be 'times',','
or another example if I have Why?!? I would need it to be 'Why','?!?'
import string
def punctuation(string):
for word in string:
if word contains (string.punctuation):
word.split()
I know it isn't in python language at all! but that's what I want it to do.

You can use finditer even if the string is more complex.
>>> r = re.compile(r"(\w+)(["+string.punctuation+"]*)")
>>> s = 'Why?!?Why?*Why'
>>> [x.groups() for x in r.finditer(s)]
[('Why', '?!?'), ('Why', '?*'), ('Why', '')]
>>>

you can use regular expression, for example:
In [1]: import re
In [2]: re.findall(r'(\w+)(\W+)', 'times,')
Out[2]: [('times', ',')]
In [3]: re.findall(r'(\w+)(\W+)', 'why?!?')
Out[3]: [('why', '?!?')]
In [4]:

A generator solution without regex:
import string
from itertools import takewhile, dropwhile
def splitp(s):
not_punc = lambda c: c in string.ascii_letters+"'" # won't split "don't"
for w in s:
punc = ''.join(dropwhile(not_punc, w))
if punc:
yield ''.join(takewhile(not_punc, w))
yield punc
else:
yield w
list(splitp(s))

Something like this? (Assumes punct is always at end)
def lcheck(word):
for i, letter in enumerate(word):
if not word[i].isalpha():
return [word[0:(i-1)],word[i:]]
return [word]
value = 'times,'
print lcheck(value)

Related

Splitting a string when one of the substrings in a list is found in Python [duplicate]

I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?
re.split()
re.split(pattern, string[, maxsplit=0])
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
A case where regular expressions are justified:
import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Another quick way to do this without a regexp is to replace the characters first, as below:
>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']
So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:
>>> import re # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
where:
the […] matches one of the separators listed inside,
the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched single-character separators), and
filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).
This re.split() precisely "splits with multiple separators", as asked for in the question title.
This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).
The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!
Another way, without regex
import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
Pro-Tip: Use string.translate for the fastest string operations Python has.
Some proof...
First, the slow way (sorry pprzemek):
>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
... res = [s]
... for sep in seps:
... s, res = res, []
... for seq in s:
... res += seq.split(sep)
... return res
...
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552
Next, we use re.findall() (as given by the suggested answer). MUCH faster:
>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094
Finally, we use translate:
>>> from string import translate,maketrans,punctuation
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934
Explanation:
string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.
It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!
Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!
I had a similar dilemma and didn't want to use 're' module.
def my_split(s, seps):
res = [s]
for sep in seps:
s, res = res, []
for seq in s:
res += seq.split(sep)
return res
print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']
First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.
My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).
Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.
Option 1 - re.sub
I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.
import re
my_str = "Hey, you - what are you doing here!?"
words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())
In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.
Option 2 - str.replace
This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
for r in replacements:
my_str = my_str.replace(r, ' ')
words = my_str.split()
It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)
Option 3 - functools.reduce
(In Python 2, reduce is available in global namespace without importing it from functools.)
import functools
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()
join = lambda x: sum(x,[]) # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]
Then this becomes a three-liner:
fragments = [text]
for token in tokens:
fragments = join(f.split(token) for f in fragments)
Explanation
This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.
You can abstract this into a function and have tokens=string.punctuation by default.
Advantages of this approach:
This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.
I like re, but here is my solution without it:
from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
sep.__contains__ is a method used by 'in' operator. Basically it is the same as
lambda ch: ch in sep
but is more convenient here.
groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.
groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).
This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)
Use replace two times:
a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')
results in:
['11223', '33344', '33222', '3344']
try this:
import re
phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches
this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
In Python 3, your can use the method from PY4E - Python for Everybody.
We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:
your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))
Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.
Your can see the "punctuation":
In [10]: import string
In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
For your example:
In [12]: your_str = "Hey, you - what are you doing here!?"
In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))
In [14]: line = line.lower()
In [15]: words = line.split()
In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
For more information, you can refer:
PY4E - Python for Everybody
str.translate
str.maketrans
Python String maketrans() Method
Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.
First, create a series with the above string and then apply the method to the series.
thestring = pd.Series("Hey, you - what are you doing here!?")
thestring.str.split(pat = ',|-')
parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator).
The output is as follows:
[Hey, you , what are you doing here!?]
I'm re-acquainting myself with Python and needed the same thing.
The findall solution may be better, but I came up with this:
tokens = [x.strip() for x in data.split(',')]
using maketrans and translate you can do it easily and neatly
import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()
First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.
I come across this pretty frequently, and my usual solution doesn't require re.
One-liner lambda function w/ list comprehension:
(requires import string):
split_without_punc = lambda text : [word.strip(string.punctuation) for word in
text.split() if word.strip(string.punctuation) != '']
# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Function (traditional)
As a traditional function, this is still only two lines with a list comprehension (in addition to import string):
def split_without_punctuation2(text):
# Split by whitespace
words = text.split()
# Strip punctuation from each word
return [word.strip(ignore) for word in words if word.strip(ignore) != '']
split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.
General Function w/o Lambda or List Comprehension
For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:
def split_without(text: str, ignore: str) -> list:
# Split by whitespace
split_string = text.split()
# Strip any characters in the ignore string, and ignore empty strings
words = []
for word in split_string:
word = word.strip(ignore)
if word != '':
words.append(word)
return words
# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Of course, you can always generalize the lambda function to any specified string of characters as well.
I had to come up with my own solution since everything I've tested so far failed at some point.
>>> import re
>>> def split_words(text):
... rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
... return rgx.findall(text)
It seems to be working fine, at least for the examples below.
>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com email#google.com split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
Another way to achieve this is to use the Natural Language Tool Kit (nltk).
import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens
This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
The biggest drawback of this method is that you need to install the nltk package.
The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.
got same problem as #ooboo and find this topic
#ghostdog74 inspired me, maybe someone finds my solution usefull
str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()
input something in space place and split using same character if you dont want to split at spaces.
First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.
so for your problem first compile the pattern and then perform action on it.
import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)
Here is the answer with some explanation.
st = "Hey, you - what are you doing here!?"
# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey you what are you doing here '
# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()
# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'
or in one line, we can do like this:
(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()
# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
updated answer
Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:
def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output
I like pprzemek's solution because it does not assume that the delimiters are single characters and it doesn't try to leverage a regex (which would not work well if the number of separators got to be crazy long).
Here's a more readable version of the above solution for clarity:
def split_string_on_multiple_separators(input_string, separators):
buffer = [input_string]
for sep in separators:
strings = buffer
buffer = [] # reset the buffer
for s in strings:
buffer = buffer + s.split(sep)
return buffer
Here is my go at a split with multiple deliminaters:
def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w
I think the following is the best answer to suite your needs :
\W+ maybe suitable for this case, but may not be suitable for other cases.
filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
Heres my take on it....
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.
def split_string(text, splitlist):
for sep in splitlist:
text = text.replace(sep, splitlist[0])
return filter(None, text.split(splitlist[0])) if splitlist else [text]
def get_words(s):
l = []
w = ''
for c in s.lower():
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l
Here is the usage:
>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
If you want a reversible operation (preserve the delimiters), you can use this function:
def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]
for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)
listOfTokens = newListOfTokens
return listOfTokens

Get a string after a character in python

Getting a string that comes after a '%' symbol and should end before other characters (no numbers and characters).
for example:
string = 'Hi %how are %YOU786$ex doing'
it should return as a list.
['how', 'you']
I tried
string = text.split()
sample = []
for i in string:
if '%' in i:
sample.append(i[1:index].lower())
return sample
but it I don't know how to get rid of 'you786$ex'.
EDIT: I don't want to import re
You can use a regular expression.
>>> import re
>>>
>>> s = 'Hi %how are %YOU786$ex doing'
>>> re.findall('%([a-z]+)', s.lower())
>>> ['how', 'you']
regex101 details
This can be most easily done with re.findall():
import re
re.findall(r'%([a-z]+)', string.lower())
This returns:
['how', 'you']
Or you can use str.split() and iterate over the characters:
sample = []
for token in string.lower().split('%')[1:]:
word = ''
for char in token:
if char.isalpha():
word += char
else:
break
sample.append(word)
sample would become:
['how', 'you']
Use Regex (Regular Expressions).
First, create a Regex pattern for your task. You could use online tools to test it. See regex for your task: https://regex101.com/r/PMSvtK/1
Then just use this regex in Python:
import re
def parse_string(string):
return re.findall("\%([a-zA-Z]+)", string)
print(parse_string('Hi %how are %YOU786$ex doing'))
Output:
['how', 'YOU']

Python split string on space or sentence inside of parenthesis

I was wondering if it would be possible to split a string such as
string = 'hello world [Im nick][introduction]'
into an array such as
['hello', 'world', '[Im nick][introduction]']
It doesn't have to be efficient, but just a way to get all the words from a sentence split unless they are in brackets, where the whole sentence is not split.
I need this because I have a markdown file with sentences such as
- What is the weather in [San antonio, texas][location]
I need the san antonio texas to be a full sentence inside of an array, would this be possible? The array would look like:
array = ['what', 'is', 'the', 'weather', 'in', 'San antonio, texas][location]']
Maybe this could work for you:
>>> s = 'What is the weather in [San antonio, texas][location]'
>>> i1 = s.index('[')
>>> i2 = s.index('[', i1 + 1)
>>> part_1 = s[:i1].split() # everything before the first bracket
>>> part_2 = [s[i1:i2], ] # first bracket pair
>>> part_3 = [s[i2:], ] # second bracket pair
>>> parts = part_1 + part_2 + part_3
>>> s
'What is the weather in [San antonio, texas][location]'
>>> parts
['What', 'is', 'the', 'weather', 'in', '[San antonio, texas]', '[location]']
It searches for the left brackets and uses that as a reference before splitting by spaces.
This assumes:
that there is no other text between the first closing bracket and the second opening bracket.
that there is nothing after the second closing bracket
Here is a more robust solution:
def do_split(s):
parts = []
while '[' in s:
start = s.index('[')
end = s.index(']', s.index(']')+1) + 1 # looks for second closing bracket
parts.extend(s[:start].split()) # everything before the opening bracket
parts.append(s[start:end]) # 2 pairs of brackets
s = s[end:] # remove processed part of the string
parts.extend(s.split()) # add remainder
return parts
This yields:
>>> do_split('What is the weather in [San antonio, texas][location] on [friday][date]?')
['What', 'is', 'the', 'weather', 'in', '[San antonio, texas][location]', 'on', '[friday][date]', '?']
Maybe this short snippet can help you. But note that this only works if everything you said holds true for all the entries in the file.
s = 'What is the weather in [San antonio, texas][location]'
s = s.split(' [')
s[1] = '[' + s[1] # add back the split character
mod = s[0] # store in a variable
mod = mod.split(' ') # split the first part on space
mod.append(s[1]) # attach back the right part
print(mod)
Outputs:
['What', 'is', 'the', 'weather', 'in', '[San antonio, texas][location]']
and for s = 'hello world [Im nick][introduction]'
['hello', 'world', '[Im nick][introduction]']
For an one liner use functional programming tools such as reduce from the functool module
reduce( lambda x, y: x.append(y) if y and y.endswith("]") else x + y.split(), s.split(" ["))
or, slightly shorter with using standard operators, map and sum
sum(map( lambda x: [x] if x and x.endswith("]") else x.split()), []) s.split(" ["))
This code below will work with your example. Hope it helps :)
I'm sure it can be better but now I have to go. Please enjoy.
string = 'hello world [Im nick][introduction]'
list = string.split(' ')
finall = []
for idx, elem in enumerate(list):
currentelem = elem
if currentelem[0] == '[' and currentelem[-1] != ']':
currentelem += list[(idx + 1) % len(list)]
finall.append(currentelem)
elif currentelem[0] != '[' and currentelem[-1] != ']':
finall.append(currentelem)
print(finall)
Let me offer an alternative to the ones above:
import re
string = 'hello world [Im nick][introduction]'
re.findall(r'(\[.+\]|\w+)', string)
Produces:
['hello', 'world', '[Im nick][introduction]']
you can use regex split with lookbehind/lookahead, note it is simple to filter out empty entries with filter or a list comprehension than avoid in re
import re
s = 'sss sss bbb [zss sss][zsss ss] sss sss bbb [ss sss][sss ss]'
[x for x in re.split(r"(?=\[[^\]\[]+\])* ", s)] if x]

How to split a string (containing no spaces) with multiple delimiters in Python? [duplicate]

I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?
re.split()
re.split(pattern, string[, maxsplit=0])
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
A case where regular expressions are justified:
import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Another quick way to do this without a regexp is to replace the characters first, as below:
>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']
So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:
>>> import re # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
where:
the […] matches one of the separators listed inside,
the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched single-character separators), and
filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).
This re.split() precisely "splits with multiple separators", as asked for in the question title.
This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).
The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!
Another way, without regex
import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
Pro-Tip: Use string.translate for the fastest string operations Python has.
Some proof...
First, the slow way (sorry pprzemek):
>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
... res = [s]
... for sep in seps:
... s, res = res, []
... for seq in s:
... res += seq.split(sep)
... return res
...
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552
Next, we use re.findall() (as given by the suggested answer). MUCH faster:
>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094
Finally, we use translate:
>>> from string import translate,maketrans,punctuation
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934
Explanation:
string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.
It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!
Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!
I had a similar dilemma and didn't want to use 're' module.
def my_split(s, seps):
res = [s]
for sep in seps:
s, res = res, []
for seq in s:
res += seq.split(sep)
return res
print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']
First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.
My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).
Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.
Option 1 - re.sub
I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.
import re
my_str = "Hey, you - what are you doing here!?"
words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())
In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.
Option 2 - str.replace
This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
for r in replacements:
my_str = my_str.replace(r, ' ')
words = my_str.split()
It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)
Option 3 - functools.reduce
(In Python 2, reduce is available in global namespace without importing it from functools.)
import functools
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()
join = lambda x: sum(x,[]) # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]
Then this becomes a three-liner:
fragments = [text]
for token in tokens:
fragments = join(f.split(token) for f in fragments)
Explanation
This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.
You can abstract this into a function and have tokens=string.punctuation by default.
Advantages of this approach:
This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.
I like re, but here is my solution without it:
from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
sep.__contains__ is a method used by 'in' operator. Basically it is the same as
lambda ch: ch in sep
but is more convenient here.
groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.
groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).
This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)
Use replace two times:
a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')
results in:
['11223', '33344', '33222', '3344']
try this:
import re
phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches
this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
In Python 3, your can use the method from PY4E - Python for Everybody.
We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:
your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))
Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.
Your can see the "punctuation":
In [10]: import string
In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
For your example:
In [12]: your_str = "Hey, you - what are you doing here!?"
In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))
In [14]: line = line.lower()
In [15]: words = line.split()
In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
For more information, you can refer:
PY4E - Python for Everybody
str.translate
str.maketrans
Python String maketrans() Method
Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.
First, create a series with the above string and then apply the method to the series.
thestring = pd.Series("Hey, you - what are you doing here!?")
thestring.str.split(pat = ',|-')
parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator).
The output is as follows:
[Hey, you , what are you doing here!?]
I'm re-acquainting myself with Python and needed the same thing.
The findall solution may be better, but I came up with this:
tokens = [x.strip() for x in data.split(',')]
using maketrans and translate you can do it easily and neatly
import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()
First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.
I come across this pretty frequently, and my usual solution doesn't require re.
One-liner lambda function w/ list comprehension:
(requires import string):
split_without_punc = lambda text : [word.strip(string.punctuation) for word in
text.split() if word.strip(string.punctuation) != '']
# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Function (traditional)
As a traditional function, this is still only two lines with a list comprehension (in addition to import string):
def split_without_punctuation2(text):
# Split by whitespace
words = text.split()
# Strip punctuation from each word
return [word.strip(ignore) for word in words if word.strip(ignore) != '']
split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.
General Function w/o Lambda or List Comprehension
For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:
def split_without(text: str, ignore: str) -> list:
# Split by whitespace
split_string = text.split()
# Strip any characters in the ignore string, and ignore empty strings
words = []
for word in split_string:
word = word.strip(ignore)
if word != '':
words.append(word)
return words
# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Of course, you can always generalize the lambda function to any specified string of characters as well.
I had to come up with my own solution since everything I've tested so far failed at some point.
>>> import re
>>> def split_words(text):
... rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
... return rgx.findall(text)
It seems to be working fine, at least for the examples below.
>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com email#google.com split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
Another way to achieve this is to use the Natural Language Tool Kit (nltk).
import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens
This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
The biggest drawback of this method is that you need to install the nltk package.
The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.
got same problem as #ooboo and find this topic
#ghostdog74 inspired me, maybe someone finds my solution usefull
str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()
input something in space place and split using same character if you dont want to split at spaces.
First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.
so for your problem first compile the pattern and then perform action on it.
import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)
Here is the answer with some explanation.
st = "Hey, you - what are you doing here!?"
# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey you what are you doing here '
# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()
# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'
or in one line, we can do like this:
(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()
# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
updated answer
Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:
def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output
I like pprzemek's solution because it does not assume that the delimiters are single characters and it doesn't try to leverage a regex (which would not work well if the number of separators got to be crazy long).
Here's a more readable version of the above solution for clarity:
def split_string_on_multiple_separators(input_string, separators):
buffer = [input_string]
for sep in separators:
strings = buffer
buffer = [] # reset the buffer
for s in strings:
buffer = buffer + s.split(sep)
return buffer
Here is my go at a split with multiple deliminaters:
def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w
I think the following is the best answer to suite your needs :
\W+ maybe suitable for this case, but may not be suitable for other cases.
filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
Heres my take on it....
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.
def split_string(text, splitlist):
for sep in splitlist:
text = text.replace(sep, splitlist[0])
return filter(None, text.split(splitlist[0])) if splitlist else [text]
def get_words(s):
l = []
w = ''
for c in s.lower():
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l
Here is the usage:
>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
If you want a reversible operation (preserve the delimiters), you can use this function:
def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]
for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)
listOfTokens = newListOfTokens
return listOfTokens

Partition a string with more than 1 separator [duplicate]

I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?
re.split()
re.split(pattern, string[, maxsplit=0])
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
A case where regular expressions are justified:
import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Another quick way to do this without a regexp is to replace the characters first, as below:
>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']
So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:
>>> import re # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
where:
the […] matches one of the separators listed inside,
the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched single-character separators), and
filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).
This re.split() precisely "splits with multiple separators", as asked for in the question title.
This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).
The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!
Another way, without regex
import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
Pro-Tip: Use string.translate for the fastest string operations Python has.
Some proof...
First, the slow way (sorry pprzemek):
>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
... res = [s]
... for sep in seps:
... s, res = res, []
... for seq in s:
... res += seq.split(sep)
... return res
...
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552
Next, we use re.findall() (as given by the suggested answer). MUCH faster:
>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094
Finally, we use translate:
>>> from string import translate,maketrans,punctuation
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934
Explanation:
string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it's about as fast as you can get for string substitution.
It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!
Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!
I had a similar dilemma and didn't want to use 're' module.
def my_split(s, seps):
res = [s]
for sep in seps:
s, res = res, []
for seq in s:
res += seq.split(sep)
return res
print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']
First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.
My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).
Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.
Option 1 - re.sub
I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.
import re
my_str = "Hey, you - what are you doing here!?"
words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())
In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.
Option 2 - str.replace
This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
for r in replacements:
my_str = my_str.replace(r, ' ')
words = my_str.split()
It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)
Option 3 - functools.reduce
(In Python 2, reduce is available in global namespace without importing it from functools.)
import functools
my_str = "Hey, you - what are you doing here!?"
replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()
join = lambda x: sum(x,[]) # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]
Then this becomes a three-liner:
fragments = [text]
for token in tokens:
fragments = join(f.split(token) for f in fragments)
Explanation
This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you'd get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.
You can abstract this into a function and have tokens=string.punctuation by default.
Advantages of this approach:
This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the "tokens" could be a function which splits according to how nested parentheses are.
I like re, but here is my solution without it:
from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
sep.__contains__ is a method used by 'in' operator. Basically it is the same as
lambda ch: ch in sep
but is more convenient here.
groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.
groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).
This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)
Use replace two times:
a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')
results in:
['11223', '33344', '33222', '3344']
try this:
import re
phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches
this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
In Python 3, your can use the method from PY4E - Python for Everybody.
We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:
your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))
Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.
Your can see the "punctuation":
In [10]: import string
In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
For your example:
In [12]: your_str = "Hey, you - what are you doing here!?"
In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))
In [14]: line = line.lower()
In [15]: words = line.split()
In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
For more information, you can refer:
PY4E - Python for Everybody
str.translate
str.maketrans
Python String maketrans() Method
Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.
First, create a series with the above string and then apply the method to the series.
thestring = pd.Series("Hey, you - what are you doing here!?")
thestring.str.split(pat = ',|-')
parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator).
The output is as follows:
[Hey, you , what are you doing here!?]
I'm re-acquainting myself with Python and needed the same thing.
The findall solution may be better, but I came up with this:
tokens = [x.strip() for x in data.split(',')]
using maketrans and translate you can do it easily and neatly
import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()
First of all, I don't think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.
I come across this pretty frequently, and my usual solution doesn't require re.
One-liner lambda function w/ list comprehension:
(requires import string):
split_without_punc = lambda text : [word.strip(string.punctuation) for word in
text.split() if word.strip(string.punctuation) != '']
# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Function (traditional)
As a traditional function, this is still only two lines with a list comprehension (in addition to import string):
def split_without_punctuation2(text):
# Split by whitespace
words = text.split()
# Strip punctuation from each word
return [word.strip(ignore) for word in words if word.strip(ignore) != '']
split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.
General Function w/o Lambda or List Comprehension
For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:
def split_without(text: str, ignore: str) -> list:
# Split by whitespace
split_string = text.split()
# Strip any characters in the ignore string, and ignore empty strings
words = []
for word in split_string:
word = word.strip(ignore)
if word != '':
words.append(word)
return words
# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']
Of course, you can always generalize the lambda function to any specified string of characters as well.
I had to come up with my own solution since everything I've tested so far failed at some point.
>>> import re
>>> def split_words(text):
... rgx = re.compile(r"((?:(?<!'|\w)(?:\w-?'?)+(?<!-))|(?:(?<='|\w)(?:\w-?'?)+(?=')))")
... return rgx.findall(text)
It seems to be working fine, at least for the examples below.
>>> split_words("The hill-tops gleam in morning's spring.")
['The', 'hill-tops', 'gleam', 'in', "morning's", 'spring']
>>> split_words("I'd say it's James' 'time'.")
["I'd", 'say', "it's", "James'", 'time']
>>> split_words("tic-tac-toe's tic-tac-toe'll tic-tac'tic-tac we'll--if tic-tac")
["tic-tac-toe's", "tic-tac-toe'll", "tic-tac'tic-tac", "we'll", 'if', 'tic-tac']
>>> split_words("google.com email#google.com split_words")
['google', 'com', 'email', 'google', 'com', 'split_words']
>>> split_words("Kurt Friedrich Gödel (/ˈɡɜːrdəl/;[2] German: [ˈkʊɐ̯t ˈɡøːdl̩] (listen);")
['Kurt', 'Friedrich', 'Gödel', 'ˈɡɜːrdəl', '2', 'German', 'ˈkʊɐ', 't', 'ˈɡøːdl', 'listen']
>>> split_words("April 28, 1906 – January 14, 1978) was an Austro-Hungarian-born Austrian...")
['April', '28', '1906', 'January', '14', '1978', 'was', 'an', 'Austro-Hungarian-born', 'Austrian']
Another way to achieve this is to use the Natural Language Tool Kit (nltk).
import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens
This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
The biggest drawback of this method is that you need to install the nltk package.
The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.
got same problem as #ooboo and find this topic
#ghostdog74 inspired me, maybe someone finds my solution usefull
str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()
input something in space place and split using same character if you dont want to split at spaces.
First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.
so for your problem first compile the pattern and then perform action on it.
import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)
Here is the answer with some explanation.
st = "Hey, you - what are you doing here!?"
# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey you what are you doing here '
# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()
# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'
or in one line, we can do like this:
(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()
# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
updated answer
Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:
def split_string(source, splitlist):
output = [] # output list of cleaned words
atsplit = True
for char in source:
if char in splitlist:
atsplit = True
else:
if atsplit:
output.append(char) # append new word after split
atsplit = False
else:
output[-1] = output[-1] + char # continue copying characters until next split
return output
I like pprzemek's solution because it does not assume that the delimiters are single characters and it doesn't try to leverage a regex (which would not work well if the number of separators got to be crazy long).
Here's a more readable version of the above solution for clarity:
def split_string_on_multiple_separators(input_string, separators):
buffer = [input_string]
for sep in separators:
strings = buffer
buffer = [] # reset the buffer
for s in strings:
buffer = buffer + s.split(sep)
return buffer
Here is my go at a split with multiple deliminaters:
def msplit( str, delims ):
w = ''
for z in str:
if z not in delims:
w += z
else:
if len(w) > 0 :
yield w
w = ''
if len(w) > 0 :
yield w
I think the following is the best answer to suite your needs :
\W+ maybe suitable for this case, but may not be suitable for other cases.
filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")
Heres my take on it....
def split_string(source,splitlist):
splits = frozenset(splitlist)
l = []
s1 = ""
for c in source:
if c in splits:
if s1:
l.append(s1)
s1 = ""
else:
print s1
s1 = s1 + c
if s1:
l.append(s1)
return l
>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']
I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.
def split_string(text, splitlist):
for sep in splitlist:
text = text.replace(sep, splitlist[0])
return filter(None, text.split(splitlist[0])) if splitlist else [text]
def get_words(s):
l = []
w = ''
for c in s.lower():
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l
Here is the usage:
>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
If you want a reversible operation (preserve the delimiters), you can use this function:
def tokenizeSentence_Reversible(sentence):
setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
listOfTokens = [sentence]
for delimiter in setOfDelimiters:
newListOfTokens = []
for ind, token in enumerate(listOfTokens):
ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
listOfTokens = [item for sublist in ll for item in sublist] # flattens.
listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
newListOfTokens.extend(listOfTokens)
listOfTokens = newListOfTokens
return listOfTokens

Categories