I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']
Related
I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!##$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.
Use re.split with capture group:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!##$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +:
result = re.split('([0-9,.?:;~!##$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+ - 1+ digits
(?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
[^\W\d_]+ - any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.
Use re.split to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']
I want to split string like this:
string = '[[he (∇((comesΦf→chem,'
based on spaces, punctuation marks also unicode characters. I mean, what I expect in output is in following mode:
out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']
I am using
re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)
for this case, but it returned following output:
output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']
Please tell me how can i solve this problem.
Without using regexes and assuming words only contain ascii characters:
from string import ascii_letters
from itertools import groupby
LETTERS = frozenset(ascii_letters)
def is_alpha(char):
return char in LETTERS
def split_string(text):
for key, tokens in groupby(text, key=is_alpha):
if key: # Found letters, join them and yield a word
yield ''.join(tokens)
else: # not letters, just yield the single tokens
yield from tokens
Example result:
In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']
If you are using a python version less than 3.3 you can replace yield from tokens with:
for token in tokens: yield token
If you are on python2 keep in mind that split_string accepts a unicode string.
Note that modifying the is_alpha function you can define different kinds of grouping. For example if you wanted to considered all unicode letters as letters you could do: is_alpha = str.isalpha (or unicode.isalpha in python2):
In [3]: is_alpha = str.isalpha
In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']
Note the 'comesΦf' that before was splitted.
Hope i halp.
In [33]: string = '[[he (∇((comesΦf→chem,'
In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']
I have a string like this:
string = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
These are the symbols I want to remove from my String.
!, #, #, %, ^, &, *, (, ), _, +, =, `, /
What I have tried is:
listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
exceptionals = set(chr(e) for e in listofsymbols)
string.translate(None,exceptionals)
The error is:
an integer is required
Please help me doing this!
Try this
>>> my_str = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> my_str.translate(None, '!##%^&*()_+=`/')
This is my text of 2013-02-11, it contained characters like this Exceptional
Also, please refrain from naming variables that are already built-in names or part of the standard library.
How about this? I've also renamed string to s to avoid it getting mixed up with the built-in module string.
>>> s = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
>>> print ''.join([i for i in s if i not in listofsymbols])
This is my text of 2013-02-11, it contained characters like this Exceptional
Another proposal, easily expandable to more complex filter criteria or other input data type:
from itertools import ifilter
def isValid(c): return c not in "!##%^&*()_+=`/"
print "".join(ifilter(isValid, my_string))
I am newbie to python. I have an array of words and each word has to be checked to see whether it contains any special characters or digits. If it contains so then i have to skip that word. How should i do this?
Does it have to be a regular expression? If not, you can use the isalpha() string method.
My reading of the problem is that you want to discard any words that contain non-alphabetical characters. Try the following:
>>> array = ['hello', 'hello2', '?hello', '?hello2']
>>> filtered = filter(str.isalpha, array)
>>> print filtered
['hello']
You could also write it as a list comprehension:
>>> filtered = [word for word in array if word.isalpha()]
>>> print filtered
['hello']
If there are only a few characters you want to exclude then use a blacklist, otherwise use a white list.
import string
abadword="""aaaa
bbbbb"""
words=["oneGoodWord", "a,bc",abadword, "xx\n",'123',"gone", "tab tab", "theEnd.","anotherGoodWord"]
bad=list(string.punctuation) #string.punctuation='!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
bad+=['\n','\t','1'] #add some more characters you don't want
bad+=['one'] #this is redundant as in function skip set(word) becomes a set of word's characters. 'one' cannot match a character.
print bad #bad = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\n', '\t', '1', 'one']
bad=set(bad)
def skip(word):
return len(set(word) & bad)==0 #word has no characters in common with bad word
print "good words:"
print filter(skip,words) #prints ['oneGoodWord', 'gone', 'anotherGoodWord']