python split string on multiple delimeters without regex - python

I have a string that I need to split on multiple characters without the use of regular expressions. for example, I would need something like the following:
>>>string="hello there[my]friend"
>>>string.split(' []')
['hello','there','my','friend']
is there anything in python like this?

If you need multiple delimiters, re.split is the way to go.
Without using a regex, it's not possible unless you write a custom function for it.
Here's such a function - it might or might not do what you want (consecutive delimiters cause empty elements):
>>> def multisplit(s, delims):
... pos = 0
... for i, c in enumerate(s):
... if c in delims:
... yield s[pos:i]
... pos = i + 1
... yield s[pos:]
...
>>> list(multisplit('hello there[my]friend', ' []'))
['hello', 'there', 'my', 'friend']

Solution without regexp:
from itertools import groupby
sep = ' []'
s = 'hello there[my]friend'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
I've just posted an explanation here https://stackoverflow.com/a/19211729/2468006

A recursive solution without use of regex. Uses only base python in contrast to the other answers.
def split_on_multiple_chars(string_to_split, set_of_chars_as_string):
# Recursive splitting
# Returns a list of strings
s = string_to_split
chars = set_of_chars_as_string
# If no more characters to split on, return input
if len(chars) == 0:
return([s])
# Split on the first of the delimiter characters
ss = s.split(chars[0])
# Recursive call without the first splitting character
bb = []
for e in ss:
aa = split_on_multiple_chars(e, chars[1:])
bb.extend(aa)
return(bb)
Works very similarly to pythons regular string.split(...), but accepts several delimiters.
Example use:
print(split_on_multiple_chars('my"example_string.with:funny?delimiters', '_.:;'))
Output:
['my"example', 'string', 'with', 'funny?delimiters']

If you're not worried about long strings, you could force all delimiters to be the same using string.replace(). The following splits a string by both - and ,
x.replace('-', ',').split(',')
If you have many delimiters you could do the following:
def split(x, delimiters):
for d in delimiters:
x = x.replace(d, delimiters[0])
return x.split(delimiters[0])

re.split is the right tool here.
>>> string="hello there[my]friend"
>>> import re
>>> re.split('[] []', string)
['hello', 'there', 'my', 'friend']
In regex, [...] defines a character class. Any characters inside the brackets will match. The way I've spaced the brackets avoids needing to escape them, but the pattern [\[\] ] also works.
>>> re.split('[\[\] ]', string)
['hello', 'there', 'my', 'friend']
The re.DEBUG flag to re.compile is also useful, as it prints out what the pattern will match:
>>> re.compile('[] []', re.DEBUG)
in
literal 93
literal 32
literal 91
<_sre.SRE_Pattern object at 0x16b0850>
(Where 32, 91, 93, are the ascii values assigned to , [, ])

Related

Match word if not followed or preceded by < or >

I am trying to not match words that are followed or preceded by an XML tag.
import re
strTest = "<random xml>hello this was successful price<random xml>"
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
print c1
Result is:
xml
this
was
successful
xml
Wanted Result:
this
was
successful
I have been trying negative lookahead and negative lookbehind assertions. I'm not sure if this is the right approach, I would appreciate any help.
First, to answer your question directly:
I do it by examining each 'word' consisting of a sequence of characters containing (mainly) alphabetics or '<' or '>'. When the regex offers them to some_only I look for one of the latter two characters. If neither appears I print the 'word'.
>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
... if '<' in matchobj.group() or '>' in matchobj.group():
... pass
... else:
... print (matchobj.group())
... pass
...
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful
This works for your test string; however, as others have already mentioned, using a regex on xml will usually lead to many woes.
To use a more conventional approach I had to tidy away a couple of errors in that xml string, namely to change random xml to random_xml and to using a proper closing tag.
I prefer to use the lxml library.
>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']
I'll give it a shot. Since we are already doing more than just a regex, put it into a list and drop the first/last items:
import re
strTest = "<random xml>hello this was successful price<random xml>"
thelist = []
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
thelist.append(c1)
thelist = thelist[1:-1]
print (thelist)
result:
['this', 'was', 'successful']
I would personally try to parse the XML instead, but since you have this code already up this slight modification could do the trick.
A simple way to do it, with a list, but I am supposing the followed or preceded word by an XML tag and the proper tag are not separated by an space:
test = "<random xml>hello this was successful price<random xml>"
test = test.split()
new_test = []
for val in test:
if "<" not in val and ">" not in val:
new_test.append(val)
print(new_test)
The result will be:
['this', 'was', 'successful']
My soultion...
I don't see the need to use regex at all, you could solve it in a one-line list comprehension:
words = [w for w in test.split() if "<" not in w and ">" not in w]

Python - defining string split delimiter?

How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.

Removing many types of chars from a Python string

I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'

Split string without non-characters

I'm trying to split a string that looks like this for example:
':foo [bar]'
Using str.split() on this of course returns [':foo','[bar]']
But how can I make it return just ['foo','bar'] containing only these characters?
I don't like regular expressions, but do like Python, so I'd probably write this as
>>> s = ':foo [bar]'
>>> ''.join(c for c in s if c.isalnum() or c.isspace())
'foo bar'
>>> ''.join(c for c in s if c.isalnum() or c.isspace()).split()
['foo', 'bar']
The ''.join idiom is a little strange, I admit, but you can almost read the rest in English: "join every character for the characters in s if the character is alphanumeric or the character is whitespace, and then split that".
Alternatively, if you know that the symbols you want to remove will always be on the outside and the word will still be separated by spaces, and you know what they are, you might try something like
>>> s = ':foo [bar]'
>>> s.split()
[':foo', '[bar]']
>>> [word.strip(':[]') for word in s.split()]
['foo', 'bar']
Do str.split() as normal, and then parse each element to remove the non-letters. Something like:
>>> my_string = ':foo [bar]'
>>> parts = [''.join(c for c in s if c.isalpha()) for s in my_string.split()]
['foo', 'bar']
You'll have to pass through the list ['foo','[bar]'] and strip out all non-letter characters, using regular expressions. Check Regex replace (in Python) - a simpler way? for examples and references to documentation.
You have to try regular expressions.
Use re.sub() to replace :,[,] characters and than split your resultant string with white space as delimiter.
>>> st = ':foo [bar]'
>>> import re
>>> new_st = re.sub(r'[\[\]:]','',st)
>>> new_st.split(' ')
['foo', 'bar']

Regex for extraction in Python

I have a string like this:
"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".
I would like to get this as an output:
(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))
I haven't been able to find the proper python regex to achieve that.
>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
if you still want number there you'd need to iterate over the output and convert it to the integer with int.
You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:
regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)
For the example this gives:
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.
[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]
Returns:
[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]
This method works regardless of the number of elements in the {{ }} blocks.
To get the exact output you wrote, you need a regex and a split:
import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))
To get it with the numbers converted, do this:
toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]
Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.
import re
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []
for match in re.finditer('{{.*?}}', s):
# Split on pipe (|) and filter out non-alphanumerics
parts = [filter(str.isalnum, part) for part in match.group().split('|')]
# Convert to int when possible
for index, part in enumerate(parts):
try:
parts[index] = int(part)
except ValueError:
pass
result.append(tuple(parts))
We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.
import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
lst = []
for x in iterable:
try:
lst.append(int(x))
except ValueError:
lst.append(x)
return tuple(lst)
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]
In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.
re.findall() returns a list of groups matched from the pattern.
Finally, a list comprehension splits each string and returns the result as a tuple.
Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.
from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList
LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))
patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
print tuple(p[0] for p in patt.searchString(s))
Prints:
(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Categories