Split string without non-characters - python

I'm trying to split a string that looks like this for example:
':foo [bar]'
Using str.split() on this of course returns [':foo','[bar]']
But how can I make it return just ['foo','bar'] containing only these characters?

I don't like regular expressions, but do like Python, so I'd probably write this as
>>> s = ':foo [bar]'
>>> ''.join(c for c in s if c.isalnum() or c.isspace())
'foo bar'
>>> ''.join(c for c in s if c.isalnum() or c.isspace()).split()
['foo', 'bar']
The ''.join idiom is a little strange, I admit, but you can almost read the rest in English: "join every character for the characters in s if the character is alphanumeric or the character is whitespace, and then split that".
Alternatively, if you know that the symbols you want to remove will always be on the outside and the word will still be separated by spaces, and you know what they are, you might try something like
>>> s = ':foo [bar]'
>>> s.split()
[':foo', '[bar]']
>>> [word.strip(':[]') for word in s.split()]
['foo', 'bar']

Do str.split() as normal, and then parse each element to remove the non-letters. Something like:
>>> my_string = ':foo [bar]'
>>> parts = [''.join(c for c in s if c.isalpha()) for s in my_string.split()]
['foo', 'bar']

You'll have to pass through the list ['foo','[bar]'] and strip out all non-letter characters, using regular expressions. Check Regex replace (in Python) - a simpler way? for examples and references to documentation.

You have to try regular expressions.
Use re.sub() to replace :,[,] characters and than split your resultant string with white space as delimiter.
>>> st = ':foo [bar]'
>>> import re
>>> new_st = re.sub(r'[\[\]:]','',st)
>>> new_st.split(' ')
['foo', 'bar']

Related

String.split() after n characters

I can split a string like this:
string = 'ABC_elTE00001'
string = string.split('_elTE')[1]
print(string)
How do I automate this, so I don't have to pass '_elTE' to the function? Something like this:
string = 'ABC_elTE00001'
string = string.split('_' + 4 characters)[1]
print(string)
Use regex, regex has a re.split thing which is the same as str.split just you can split by a regex pattern, it's worth a look at the docs:
>>> import re
>>> string = 'ABC_elTE00001'
>>> re.split('_\w{4}', string)
['ABC', '00001']
>>>
The above example is using a regex pattern as you see.
split() on _ and take everything after the first four characters.
s = 'ABC_elTE00001'
# s.split('_')[1] gives elTE00001
# To get the string after 4 chars, we'd slice it [4:]
print(s.split('_')[1][4:])
OUTPUT:
00001
You can use Regular expression to automate the extraction that you want.
import re
string = 'ABC_elTE00001'
data = re.findall('.([0-9]*$)',string)
print(data)
This is a, quite horrible, version that exactly "translates" string.split('_' + 4 characters)[1]:
s = 'ABC_elTE00001'
s.split(s[s.find("_"):(s.find("_")+1)+4])[1]
>>> '00001'

How can I replace all occurrences of a substring using regex?

I have a string, s = 'sdfjoiweng%#$foo$fsoifjoi', and I would like to replace 'foo' with 'bar'.
I tried re.sub(r'\bfoo\b', 'bar', s) and re.sub(r'[foo]', 'bar', s), but it doesn't do anything. What am I doing wrong?
You can replace it directly:
>>> import re
>>> s = 'sdfjoiweng%#$foo$fsoifjoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoifjoi
It will also work for more occurrences of foo like below:
>>> s = 'sdfjoiweng%#$foo$fsoifoojoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoibarjoi
If you want to replace only the 1st occurrence of foo and not all the foo occurrences in the string then alecxe's answer does exactly that.
re.sub(r'\bfoo\b', 'bar', s)
Here, the \b defines the word boundaries - positions between a word character (\w) and a non-word character - exactly what you have matching for foo inside the sdfjoiweng%#$foo$fsoifjoi string. Works for me:
In [1]: import re
In [2]: s = 'sdfjoiweng%#$foo$fsoifjoi'
In [3]: re.sub(r'\bfoo\b', 'bar', s)
Out[3]: 'sdfjoiweng%#$bar$fsoifjoi'
You can use replace function directly instead of using regex.
>>> s = 'sdfjoiweng%#$foo$fsoifjoifoo'
>>>
>>> s.replace("foo","bar")
'sdfjoiweng%#$bar$fsoifjoibar'
>>>
>>>
To further add to the above, the code below shows you how to replace multiple words at once! I've used this to replace 165,000 words in 1 step!!
Note \b means no sub string matching..must be a whole word..if you remove it then it will make sub-string match.
import re
s = 'thisis a test'
re.sub('\bthis\b|test','',s)
This gives:
'thisis a '

Extracting alphanumeric substring from a string in python

i have a string in python
text = '(b)'
i want to extract the 'b'. I could strip the first and the last letter of the string but the reason i wont do that is because the text string may contain '(a)', (iii), 'i)', '(1' or '(2)'. Some times they contain no parenthesis at all. but they will always contain an alphanumeric values. But i equally want to retrieve the alphanumeric values there.
this feat will have to be accomplished in a one line code or block of code that returns justthe value as it will be used in an iteratively on a multiple situations
what is the best way to do that in python,
I don't think Regex is needed here. You can just strip off any parenthesis with str.strip:
>>> text = '(b)'
>>> text.strip('()')
'b'
>>> text = '(iii)'
>>> text.strip('()')
'iii'
>>> text = 'i)'
>>> text.strip('()')
'i'
>>> text = '(1'
>>> text.strip('()')
'1'
>>> text = '(2)'
>>> text.strip('()')
'2'
>>> text = 'a'
>>> text.strip('()')
'a'
>>>
Regarding #MikeMcKerns' comment, a more robust solution would be to pass string.punctuation to str.strip:
>>> from string import punctuation
>>> punctuation # Just to demonstrate
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
>>> text = '*(ab2**)'
>>> text.strip(punctuation)
'ab2'
>>>
You could do this through python's re module,
>>> import re
>>> text = '(5a)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'5a'
>>> text = '*(ab2**)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'ab2'
Not fancy, but this is pretty generic
>>> import string
>>> ''.join(i for i in text if i in string.ascii_letters+'0123456789')
This works for all sorts of combinations of parenthesis in the middle of the string, and also if you have other non-alphanumeric characters (aside from the parenthesis) present.
re.match(r'\(?([a-zA-Z0-9]+)', text).group(1)
for your input provided by exmple it would be:
>>> a=['(a)', '(iii)', 'i)', '(1' , '(2)']
>>> [ re.match(r'\(?([a-zA-Z0-9]+)', text).group(1) for text in a ]
['a', 'iii', 'i', '1', '2']

Removing many types of chars from a Python string

I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'

python split string on multiple delimeters without regex

I have a string that I need to split on multiple characters without the use of regular expressions. for example, I would need something like the following:
>>>string="hello there[my]friend"
>>>string.split(' []')
['hello','there','my','friend']
is there anything in python like this?
If you need multiple delimiters, re.split is the way to go.
Without using a regex, it's not possible unless you write a custom function for it.
Here's such a function - it might or might not do what you want (consecutive delimiters cause empty elements):
>>> def multisplit(s, delims):
... pos = 0
... for i, c in enumerate(s):
... if c in delims:
... yield s[pos:i]
... pos = i + 1
... yield s[pos:]
...
>>> list(multisplit('hello there[my]friend', ' []'))
['hello', 'there', 'my', 'friend']
Solution without regexp:
from itertools import groupby
sep = ' []'
s = 'hello there[my]friend'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
I've just posted an explanation here https://stackoverflow.com/a/19211729/2468006
A recursive solution without use of regex. Uses only base python in contrast to the other answers.
def split_on_multiple_chars(string_to_split, set_of_chars_as_string):
# Recursive splitting
# Returns a list of strings
s = string_to_split
chars = set_of_chars_as_string
# If no more characters to split on, return input
if len(chars) == 0:
return([s])
# Split on the first of the delimiter characters
ss = s.split(chars[0])
# Recursive call without the first splitting character
bb = []
for e in ss:
aa = split_on_multiple_chars(e, chars[1:])
bb.extend(aa)
return(bb)
Works very similarly to pythons regular string.split(...), but accepts several delimiters.
Example use:
print(split_on_multiple_chars('my"example_string.with:funny?delimiters', '_.:;'))
Output:
['my"example', 'string', 'with', 'funny?delimiters']
If you're not worried about long strings, you could force all delimiters to be the same using string.replace(). The following splits a string by both - and ,
x.replace('-', ',').split(',')
If you have many delimiters you could do the following:
def split(x, delimiters):
for d in delimiters:
x = x.replace(d, delimiters[0])
return x.split(delimiters[0])
re.split is the right tool here.
>>> string="hello there[my]friend"
>>> import re
>>> re.split('[] []', string)
['hello', 'there', 'my', 'friend']
In regex, [...] defines a character class. Any characters inside the brackets will match. The way I've spaced the brackets avoids needing to escape them, but the pattern [\[\] ] also works.
>>> re.split('[\[\] ]', string)
['hello', 'there', 'my', 'friend']
The re.DEBUG flag to re.compile is also useful, as it prints out what the pattern will match:
>>> re.compile('[] []', re.DEBUG)
in
literal 93
literal 32
literal 91
<_sre.SRE_Pattern object at 0x16b0850>
(Where 32, 91, 93, are the ascii values assigned to , [, ])

Categories