How can I simplify this function? - python

Is there any way to simplify this function? Specifically, I'd like to rewrite it with fewer lines of indentation.
# split string (first argument) at location of separators (second argument, should be a string)
def split_string(text, separators):
text = ' ' + text + ' '
words = []
word = ""
for char in text:
if char not in separators:
word += char
else:
if word:
words.append(word)
word = ""
if not words:
words.append(text)
return words

Try using re.split, for example:
re.split('[%s]' % (separators),string)
The [] creates a regular expression character class to split on.

Your code seems to produce
>>> split_string("foo.,.bar", ".,")
[' foo']
but your comment says
split_string("foo.,.bar", ".,") will return ["foo", "bar"]
Assuming the comment is what's intended, then I'd use itertools.groupby (I hate using regexes):
from itertools import groupby
def splitter(text, separators):
grouped = groupby(text, lambda c: c in separators)
return [''.join(g) for k,g in grouped if not k]
which gives
>>> splitter("foo.,.bar", ".,")
['foo', 'bar']
groupby returns an iterator over consecutive terms grouped by some function -- in this case, lambda c: c in separators -- of the terms.

You should use the split() method. Taken from the official documentation:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string.
If maxsplit is given, at most maxsplit splits are done (thus, the list will have
at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no
limit on the number of splits (all possible splits are made).
If sep is given, consecutive delimiters are not grouped together and are deemed
to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
The sep argument may consist of multiple characters (for example,
'1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a
specified separator returns [''].
If sep is not specified or is None, a different splitting algorithm is applied:
runs of consecutive whitespace are regarded as a single separator, and the result
will contain no empty strings at the start or end if the string has leading or
trailing whitespace. Consequently, splitting an empty string or a string consisting
of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

You can do:
myString = "Some-text-here"
splitWords = myString.split("-")
The above code will return a List of with the words separated. I used "-" as the delimiter, you may assign any delimiter you like. Default is the "space" delimiter like this:
myString = "Some text here"
splitWords = myString.split()

Related

only special characters remove from the list

From a pdf file I extract all the text as a string, and convert it into the list by removing all the double white spaces, newlines (two or more), spaces (if two or more), and on every dot (.).
Now in my list I want, if a value of a list consists of only special characters, that value should be excluded.
pdfFileObj = open('Python String.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()
z =re.split("\n+|[.]|\s{2,}",text)
while("" in z) :
z.remove("")
print(z)
My output is
['split()', 'method in Python split a string into a list of strings after breaking the', 'given string by the specified separator', 'Syntax', ':', 'str', 'split(separator, maxsplit)', 'Parameters', ':', 'separator', ':', 'This is a delimiter', ' The string splits at this specified separator', ' If is', 'no', 't provided then any white space is a separator', 'maxsplit', ':', 'It is a number, which tells us to split the string into maximum of provi', 'ded number of times', ' If it is not provided then the default is', '-', '1 that means there', 'is no limit', 'Returns', ':', 'Returns a list of s', 'trings after breaking the given string by the specifie', 'd separator']
Here are some values that contain only special characters and I want to remove those. Thanks
Remove those special characters before converting text to list.
remove while("" in z) : z.remove("") and add following line after read text variable:
text = re.sub('(a|b|c)', '', text)
In this example, my special characters are a, b and c.
Use a regular expression that tests if a string contains any letters or numbers.
import re
z = [x for x in z if re.search(r'[a-z\d]', x, flags=re.I)]
In the regexp, a-z matches letters, \d matches digits, so [a-z\d] matches any letter or digit (and the re.I flag makes it case-insensitive). So the list comprehension includes any elements of z that contain a letter or digit.

Replacing multiple chars in string with another character in Python

I have a list of strings I want to check if each string contains certain characters, if it does then replace the characters with another character.
I have something like below:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\\n', '\\t', '=']
word = 'Ad{min > HR'
for c in list(word):
if c in invalid_chars:
word = word.replace(c, '_')
print (word)
>>> Admin_>_HR
I am trying to convert this into a function using list comprehension but I am strange characters...
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([word.replace(ch, char_replace) for ch in list(word) if ch in checklist])
print(replace_chars(word, invalid_chars))
>>> Ad_min > HRAd{min_>_HRAd{min_>_HR
Try this general pattern:
''.join([ch if ch not in invalid_chars else '_' for ch in word])
For the complete function:
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([ch if ch not in checklist else char_replace for ch in word])
Note: no need to wrap string word in a list(), it's already iterable.
This might be a good use for str.translate(). You can turn your invalid_chars into a translation table with str.maketrans() and apply wherever you need it:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\n', '\t', '=']
invalid_table = str.maketrans({k:'_' for k in invalid_chars})
word = 'Ad{min > HR'
word.translate(invalid_table)
Result:
'Ad_min_>_HR'
This will be especially nice if you need to apply this translation to several strings and more efficient since you don't need to loop through the entire invalid_chars array for every letter, every time which you will if you us if x in invalid_chars inside a loop.
This is easier with regex. You can search for a whole group of characters with a single substitution call. It should perform better too.
>>> import re
>>> re.sub(f"[{re.escape(''.join(invalid_chars))}]", "_", word)
'Ad_min_>_HR'
The code in the f-string builds a regex pattern that looks like this
>>> pattern = f"[{re.escape(''.join(invalid_chars))}]"
>>> print(repr(pattern))
'[\\ ,;\\{\\}\\(\\)\\\n\\\t=]'
>>> print(pattern)
[\ ,;\{\}\(\)\
\ =]
That is, a regex character set containing each of your invalid chars. (The backslash escaping ensures that none of them are interpreted as a regex control character, regardless of which characters you put in invalid_chars.) If you had specified them as a string in the first place, the ''.join() would not be required.
You can also compile the pattern (using re.compile()) if you need to re-use it on multiple words.

Why does split() return more elements than split(" ") on same string?

I am using split() and split(" ") on the same string. But why is split(" ") returning less number of elements than split()? I want to know in what specific input case this would happen.
str.split with the None argument (or, no argument) splits on all whitespace characters, and this isn't limited to just the space you type in using your spacebar.
In [457]: text = 'this\nshould\rhelp\tyou\funderstand'
In [458]: text.split()
Out[458]: ['this', 'should', 'help', 'you', 'understand']
In [459]: text.split(' ')
Out[459]: ['this\nshould\rhelp\tyou\x0cunderstand']
List of all whitespace characters that split(None) splits on can be found at All the Whitespace Characters? Is it language independent?
If you run the help command on the split() function you'll see this:
split(...) S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the delimiter
string. If maxsplit is given, at most maxsplit splits are done. If sep
is not specified or is None, any whitespace string is a separator and
empty strings are removed from the result.
Therefore the difference between the to is that split() without specifing the delimiter will delete the empty strings while the one with the delimiter won't.
The method str.split called without arguments has a somewhat different behaviour.
First it splits by any whitespace character.
'foo bar\nbaz\tmeh'.split() # ['foo', 'bar', 'baz', 'meh']
But it also remove the empty strings from the output list.
' foo bar '.split(' ') # ['', 'foo', 'bar', '']
' foo bar '.split() # ['foo', 'bar']
In Python, the split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?"
s.split()
Out[9]:['Hello', 'world!', 'How', 'are', 'you?']
s.split("!")
Out[10]: ['Hello world', ' How are you?']
s.split("!")[0]
Out[11]: 'Hello world'
From my own experience, the most confusion had come from split()'s different treatments on whitespace.
Having a separator like ' ' vs None, triggers different behavior of split(). According to the Python documentation.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Below is an example, in which the sample string has a trailing space ' ', which is the same whitespace as the one passed in the second split(). Hence, this method behaves differently, not because of some whitespace character mismatch, but it's more of how this method was designed to work, maybe for convenience in common scenarios, but it can also be confusing for people who expect the split() to just split.
sample = "a b "
sample.split()
>>> ['a', 'b']
sample.split(' ')
>>> ['a', 'b', '']

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

Splitting strings separated by multiple possible characters?

...note that values will be delimited by one or more space or TAB characters
How can I use the split() method if there are multiple separating characters of different types, as in this case?
by default split can handle multiple types of white space, not sure if it's enough for what you need but try it:
>>> s = "a \tb c\t\t\td"
>>> s.split()
['a', 'b', 'c', 'd']
It certainly works for multiple spaces and tabs mixed.
Split using regular expressions and not just one separator:
http://docs.python.org/2/library/re.html
I had the same problem with some strings separated by different whitespace chars, and used \s as shown in the Regular Expressions library specification.
\s matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
you will need to import re as the regular expression handler:
import re
line = "something separated\t by \t\t\t different \t things"
workstr = re.sub('\s+','\t',line)
So, any whitespace or separator (\s) repeated one or more times (+) is transformed to a single tabulation (\t), that you can reprocess with split('\t')
workstr = "something`\t`separated`\t`by`\t`different`\t`things"
newline = workstr.split('\t')
newline = ['something','separated','by','different','things']
Do a text substitution first then split.
e.g. replace all tabs with spaces, then split on space.
You can use regular expressions first:
import re
re.sub('\s+', ' ', 'text with whitespace etc').split()
['text', 'with', 'whitespace', 'etc']
For whitespace delimeters, str.split() already does what you may want. From the Python Standard Library,
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

Categories