Is SPLIT doing strange things? [duplicate] - python

What is the point of '/segment/segment/'.split('/') returning ['', 'segment', 'segment', '']?
Notice the empty elements. If you're splitting on a delimiter that happens to be at position one and at the very end of a string, what extra value does it give you to have the empty string returned from each end?

str.split complements str.join, so
"/".join(['', 'segment', 'segment', ''])
gets you back the original string.
If the empty strings were not there, the first and last '/' would be missing after the join().

More generally, to remove empty strings returned in split() results, you may want to look at the filter function.
Example:
f = filter(None, '/segment/segment/'.split('/'))
s_all = list(f)
returns
['segment', 'segment']

There are two main points to consider here:
Expecting the result of '/segment/segment/'.split('/') to be equal to ['segment', 'segment'] is reasonable, but then this loses information. If split() worked the way you wanted, if I tell you that a.split('/') == ['segment', 'segment'], you can't tell me what a was.
What should be the result of 'a//b'.split() be? ['a', 'b']?, or ['a', '', 'b']? I.e., should split() merge adjacent delimiters? If it should, then it will be very hard to parse data that's delimited by a character, and some of the fields can be empty. I am fairly sure there are many people who do want the empty values in the result for the above case!
In the end, it boils down to two things:
Consistency: if I have n delimiters, in a, I get n+1 values back after the split().
It should be possible to do complex things, and easy to do simple things: if you want to ignore empty strings as a result of the split(), you can always do:
def mysplit(s, delim=None):
return [x for x in s.split(delim) if x]
but if one doesn't want to ignore the empty values, one should be able to.
The language has to pick one definition of split()—there are too many different use cases to satisfy everyone's requirement as a default. I think that Python's choice is a good one, and is the most logical. (As an aside, one of the reasons I don't like C's strtok() is because it merges adjacent delimiters, making it extremely hard to do serious parsing/tokenization with it.)
There is one exception: a.split() without an argument squeezes consecutive white-space, but one can argue that this is the right thing to do in that case. If you don't want the behavior, you can always to a.split(' ').

I'm not sure what kind of answer you're looking for? You get three matches because you have three delimiters. If you don't want that empty one, just use:
'/segment/segment/'.strip('/').split('/')

Having x.split(y) always return a list of 1 + x.count(y) items is a precious regularity -- as #gnibbler's already pointed out it makes split and join exact inverses of each other (as they obviously should be), it also precisely maps the semantics of all kinds of delimiter-joined records (such as csv file lines [[net of quoting issues]], lines from /etc/group in Unix, and so on), it allows (as #Roman's answer mentioned) easy checks for (e.g.) absolute vs relative paths (in file paths and URLs), and so forth.
Another way to look at it is that you shouldn't wantonly toss information out of the window for no gain. What would be gained in making x.split(y) equivalent to x.strip(y).split(y)? Nothing, of course -- it's easy to use the second form when that's what you mean, but if the first form was arbitrarily deemed to mean the second one, you'd have lot of work to do when you do want the first one (which is far from rare, as the previous paragraph points out).
But really, thinking in terms of mathematical regularity is the simplest and most general way you can teach yourself to design passable APIs. To take a different example, it's very important that for any valid x and y x == x[:y] + x[y:] -- which immediately indicates why one extreme of a slicing should be excluded. The simpler the invariant assertion you can formulate, the likelier it is that the resulting semantics are what you need in real life uses -- part of the mystical fact that maths is very useful in dealing with the universe.
Try formulating the invariant for a split dialect in which leading and trailing delimiters are special-cased... counter-example: string methods such as isspace are not maximally simple -- x.isspace() is equivalent to x and all(c in string.whitespace for c in x) -- that silly leading x and is why you so often find yourself coding not x or x.isspace(), to get back to the simplicity which should have been designed into the is... string methods (whereby an empty string "is" anything you want -- contrary to man-in-the-street horse-sense, maybe [[empty sets, like zero &c, have always confused most people;-)]], but fully conforming to obvious well-refined mathematical common-sense!-).

Well, it lets you know there was a delimiter there. So, seeing 4 results lets you know you had 3 delimiters. This gives you the power to do whatever you want with this information, rather than having Python drop the empty elements, and then making you manually check for starting or ending delimiters if you need to know it.
Simple example: Say you want to check for absolute vs. relative filenames. This way you can do it all with the split, without also having to check what the first character of your filename is.

Consider this minimal example:
>>> '/'.split('/')
['', '']
split must give you what's before and after the delimiter '/', but there are no other characters. So it has to give you the empty string, which technically precedes and follows the '/', because '' + '/' + '' == '/'.

If you don't want empty spaces to be returned by split use it without args.
>>> " this is a sentence ".split()
['this', 'is', 'a', 'sentence']
>>> " this is a sentence ".split(" ")
['', '', 'this', '', '', 'is', '', 'a', 'sentence', '']

always use strip function before split if want to ignore blank lines.
youroutput.strip().split('splitter')
Example:
yourstring =' \nhey\njohn\nhow\n\nare\nyou'
yourstring.strip().split('\n')

Related

Why is there an empty string, and only one, created when I split a string by "?" [duplicate]

What is the point of '/segment/segment/'.split('/') returning ['', 'segment', 'segment', '']?
Notice the empty elements. If you're splitting on a delimiter that happens to be at position one and at the very end of a string, what extra value does it give you to have the empty string returned from each end?
str.split complements str.join, so
"/".join(['', 'segment', 'segment', ''])
gets you back the original string.
If the empty strings were not there, the first and last '/' would be missing after the join().
More generally, to remove empty strings returned in split() results, you may want to look at the filter function.
Example:
f = filter(None, '/segment/segment/'.split('/'))
s_all = list(f)
returns
['segment', 'segment']
There are two main points to consider here:
Expecting the result of '/segment/segment/'.split('/') to be equal to ['segment', 'segment'] is reasonable, but then this loses information. If split() worked the way you wanted, if I tell you that a.split('/') == ['segment', 'segment'], you can't tell me what a was.
What should be the result of 'a//b'.split() be? ['a', 'b']?, or ['a', '', 'b']? I.e., should split() merge adjacent delimiters? If it should, then it will be very hard to parse data that's delimited by a character, and some of the fields can be empty. I am fairly sure there are many people who do want the empty values in the result for the above case!
In the end, it boils down to two things:
Consistency: if I have n delimiters, in a, I get n+1 values back after the split().
It should be possible to do complex things, and easy to do simple things: if you want to ignore empty strings as a result of the split(), you can always do:
def mysplit(s, delim=None):
return [x for x in s.split(delim) if x]
but if one doesn't want to ignore the empty values, one should be able to.
The language has to pick one definition of split()—there are too many different use cases to satisfy everyone's requirement as a default. I think that Python's choice is a good one, and is the most logical. (As an aside, one of the reasons I don't like C's strtok() is because it merges adjacent delimiters, making it extremely hard to do serious parsing/tokenization with it.)
There is one exception: a.split() without an argument squeezes consecutive white-space, but one can argue that this is the right thing to do in that case. If you don't want the behavior, you can always to a.split(' ').
I'm not sure what kind of answer you're looking for? You get three matches because you have three delimiters. If you don't want that empty one, just use:
'/segment/segment/'.strip('/').split('/')
Having x.split(y) always return a list of 1 + x.count(y) items is a precious regularity -- as #gnibbler's already pointed out it makes split and join exact inverses of each other (as they obviously should be), it also precisely maps the semantics of all kinds of delimiter-joined records (such as csv file lines [[net of quoting issues]], lines from /etc/group in Unix, and so on), it allows (as #Roman's answer mentioned) easy checks for (e.g.) absolute vs relative paths (in file paths and URLs), and so forth.
Another way to look at it is that you shouldn't wantonly toss information out of the window for no gain. What would be gained in making x.split(y) equivalent to x.strip(y).split(y)? Nothing, of course -- it's easy to use the second form when that's what you mean, but if the first form was arbitrarily deemed to mean the second one, you'd have lot of work to do when you do want the first one (which is far from rare, as the previous paragraph points out).
But really, thinking in terms of mathematical regularity is the simplest and most general way you can teach yourself to design passable APIs. To take a different example, it's very important that for any valid x and y x == x[:y] + x[y:] -- which immediately indicates why one extreme of a slicing should be excluded. The simpler the invariant assertion you can formulate, the likelier it is that the resulting semantics are what you need in real life uses -- part of the mystical fact that maths is very useful in dealing with the universe.
Try formulating the invariant for a split dialect in which leading and trailing delimiters are special-cased... counter-example: string methods such as isspace are not maximally simple -- x.isspace() is equivalent to x and all(c in string.whitespace for c in x) -- that silly leading x and is why you so often find yourself coding not x or x.isspace(), to get back to the simplicity which should have been designed into the is... string methods (whereby an empty string "is" anything you want -- contrary to man-in-the-street horse-sense, maybe [[empty sets, like zero &c, have always confused most people;-)]], but fully conforming to obvious well-refined mathematical common-sense!-).
Well, it lets you know there was a delimiter there. So, seeing 4 results lets you know you had 3 delimiters. This gives you the power to do whatever you want with this information, rather than having Python drop the empty elements, and then making you manually check for starting or ending delimiters if you need to know it.
Simple example: Say you want to check for absolute vs. relative filenames. This way you can do it all with the split, without also having to check what the first character of your filename is.
Consider this minimal example:
>>> '/'.split('/')
['', '']
split must give you what's before and after the delimiter '/', but there are no other characters. So it has to give you the empty string, which technically precedes and follows the '/', because '' + '/' + '' == '/'.
If you don't want empty spaces to be returned by split use it without args.
>>> " this is a sentence ".split()
['this', 'is', 'a', 'sentence']
>>> " this is a sentence ".split(" ")
['', '', 'this', '', '', 'is', '', 'a', 'sentence', '']
always use strip function before split if want to ignore blank lines.
youroutput.strip().split('splitter')
Example:
yourstring =' \nhey\njohn\nhow\n\nare\nyou'
yourstring.strip().split('\n')

Indexing the wrong character for an expression

My program seems to be indexing the wrong character or not at all.
I wrote a basic calculator that allows expressions to be used. It works by having the user enter the expression, then turning it into a list, and indexing the first number at position 0 and then using try/except statements to index number2 and the operator. All this is in a while loop that is finished when the user enters done at the prompt.
The program seems to work fine if I type the expression like this "1+1" but if I add spaces "1 + 1" it cannot index it or it ends up indexing the operator if I do "1+1" followed by "1 + 1".
I have asked in a group chat before and someone told me to use tokenization instead of my method, but I want to understand why my program is not running properly before moving on to something else.
Here is my code:
https://hastebin.com/umabukotab.py
Thank you!
Strings are basically lists of characters. 1+1 contains three characters, whereas 1 + 1 contains five, because of the two added spaces. Thus, when you access the third character in this longer string, you're actually accessing the middle element.
Parsing input is often not easy, and certainly parsing arithmetic expressions can get tricky quite quickly. Removing spaces from the input, as suggested by #Sethroph is a viable solution, but will only go that far. If you all of a sudden need to support stuff like 1+2+3, it will still break.
Another solution would be to split your input on the operator. For example:
input = '1 + 2'
terms = input.split('+') # ['1 ', ' 2'] note the spaces
terms = map(int, terms) # [1, 2] since int() can handle leading/trailing whitespace
output = terms[0] + terms[1]
Still, although this can handle situations like 1 + 2 + 3, it will still break when there's multiple different operators involved, or there are parentheses (but that might be something you need not worry about, depending on how complex you want your calculator to be).
IMO, a better approach would indeed be to use tokenization. Personally, I'd use parser combinators, but that may be a bit overkill. For reference, here's an example calculator whose input is parsed using parsy, a parser combinator library for Python.
You could remove the spaces before processing the string by using replace().
Try adding in:
clean_input = hold_input.replace(" ", "")
just after you create hold_input.

how to remove or translate multiple strings from strings?

I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Efficiently carry out multiple string replacements in Python

If I would like to carry out multiple string replacements, what is the most efficient way to carry this out?
An example of the kind of situation I have encountered in my travels is as follows:
>>> strings = ['a', 'list', 'of', 'strings']
>>> [s.replace('a', '')...replace('u', '') for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
The specific example you give (deleting single characters) is perfect for the translate method of strings, as is substitution of single characters with single characters. If the input string is a Unicode one, then, as well as the two above kinds of "substitution", substitution of single characters with multiple character strings is also fine with the translate method (not if you need to work on byte strings, though).
If you need to replace substrings of multiple characters, then I would also recommend using a regular expression -- though not in the way #gnibbler's answer recommends; rather, I'd build the regex from r'onestring|another|yetanother|orthis' (join the substrings you want to replace with vertical bars -- be sure to also re.escape them if they contain special characters, of course) and write a simple substituting-function based on a dict.
I'm not going to offer a lot of code at this time since I don't know which of the two paragraphs applies to your actual needs, but (when I later come back home and check SO again;-) I'll be glad to edit to add a code example as necessary depending on your edits to your question (more useful than comments to this answer;-).
Edit: in a comment the OP says he wants a "more general" answer (without clarifying what that means) then in an edit of his Q he says he wants to study the "tradeoffs" between various snippets all of which use single-character substrings (and check presence thereof, rather than replacing as originally requested -- completely different semantics, of course).
Given this utter and complete confusion all I can say is that to "check tradeoffs" (performance-wise) I like to use python -mtimeit -s'setup things here' 'statements to check' (making sure the statements to check have no side effects to avoid distorting the time measurements, since timeit implicitly loops to provide accurate timing measurements).
A general answer (without any tradeoffs, and involving multiple-character substrings, so completely contrary to his Q's edit but consonant to his comments -- the two being entirely contradictory it is of course impossible to meet both):
import re
class Replacer(object):
def __init__(self, **replacements):
self.replacements = replacements
self.locator = re.compile('|'.join(re.escape(s) for s in replacements))
def _doreplace(self, mo):
return self.replacements[mo.group()]
def replace(self, s):
return self.locator.sub(self._doreplace, s)
Example use:
r = Replacer(zap='zop', zip='zup')
print r.replace('allazapollezipzapzippopzip')
If some of the substrings to be replaced are Python keywords, they need to be passed in a tad less directly, e.g., the following:
r = Replacer(abc='xyz', def='yyt', ghi='zzq')
would fail because def is a keyword, so you need e.g.:
r = Replacer(abc='xyz', ghi='zzq', **{'def': 'yyt'})
or the like.
I find this a good use for a class (rather than procedural programming) because the RE to locate the substrings to replace, the dict expressing what to replace them with, and the method performing the replacement, really cry out to be "kept all together", and a class instance is just the right way to perform such a "keeping together" in Python. A closure factory would also work (since the replace method is really the only part of the instance that needs to be visible "outside") but in a possibly less-clear, harder to debug way:
def make_replacer(**replacements):
locator = re.compile('|'.join(re.escape(s) for s in replacements))
def _doreplace(mo):
return replacements[mo.group()]
def replace(s):
return locator.sub(_doreplace, s)
return replace
r = make_replacer(zap='zop', zip='zup')
print r('allazapollezipzapzippopzip')
The only real advantage might be a very modestly better performance (needs to be checked with timeit on "benchmark cases" considered significant and representative for the app using it) as the access to the "free variables" (replacements, locator, _doreplace) in this case might be minutely faster than access to the qualified names (self.replacements etc) in the normal, class-based approach (whether this is the case will depend on the Python implementation in use, whence the need to check with timeit on significant benchmarks!).
You may find that it is faster to create a regex and do all the replacements at once.
Also a good idea to move the replacement code out to a function so that you can memoize if you are likely to have duplicates in the list
>>> import re
>>> [re.sub('[aeiou]','',s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
>>> def replacer(s, memo={}):
... if s not in memo:
... memo[s] = re.sub('[aeiou]','',s)
... return memo[s]
...
>>> [replacer(s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']

Categories