Python - use lists instead of strings? - python

From an S.O answer:
"Don't modify strings.
Work with them as lists; turn them into strings only when needed.
... code sample ...
Python strings are immutable (i.e. they can't be modified). There are a lot of reasons for this. Use lists until you have no choice, only then turn them into strings."
Is this considered best practice?
I find it a bit odd that Python has methods that return new modified strings (such as upper(), title(), replace() etc.) but doesn't have an insert method that returns a new string. Have I missed such a method?
Edit: I'm trying to rename files by inserting a character:
import os
for i in os.listdir('.'):
i.insert(3, '_')
Which doesn't work due to immutability. Adding to the beginning of a string works fine though:
for i in os.listdir('.'):
os.rename(i, 'some_random_string' + i)
Edit2: the solution:
>>> for i in os.listdir('.'): │··
... os.rename(i, i[:4] + '_' + i[4:])
Slicing certainly is nice and solves my problem, but is there a logical explanation why there is no insert() method that returns a new string?
Thanks for the help.

If you want to insert at a particular spot, you can use slices and +. For example:
a = "hello"
b = a[:2] + '_S1M0N_' + a[2:]
then b will be equal to he_S1M0N_llo.

It's at least arguably a best practice if you are doing a very large number of modifications to a string. It is not a general purpose best practice. It's simply a useful technique for solving performance problems when doing heavy string manipulation.
My advice is, don't do it until performance becomes an issue.

You can define a generic function that works on any sequence (strings, lists, tuples, etc.) using the slice syntax:
def insert(s, c, p):
return s[:p] + c + s[p:]
insert('FILE1', '_', 4)
> 'FILE_1'

Related

How is the string.join(str_list, ''") implemented under the hood in Python?

I know that concatenating two strings using the += operator makes a new copy of the old string and then concatenates the new string to that, resulting in quadratic time complexity.
This answer gives a nice time comparison between the += operation and string.join(str_list, ''). It looks like the join() method runs in linear time (correct me if I am wrong). Out of curiosity, I wanted to know how the string.join(str_list, '') method is implemented in Python since strings are immutable objects?
It's implemented in C, so python mutability is less important. You can find the appropriate source here: unicodeobject.c

how to sum up a list of strings

Given a list of strings as such,
xs = ['1\n','2\n','3\n','4\n','5\n']
sum up the integers to return the sum as a string and append the sum to the list so that the returned list
xs = ['1\n','2\n','3\n','4\n','5\n','Sum:15\n']
I understand the process of going through the list and iterating this, I just don't understand how to get rid of the \n character so that I can only use the integer to find the sum?
def my_fun(x):
return x+["Sum: %s\n"%sum(map(int,x)),]
This uses a generator:
>>> xs + ['Sum:{0}\n'.format(str(sum(int(s) for s in xs)))]
['1\n', '2\n', '3\n', '4\n', '5\n', 'Sum:15\n']
To answer your question a bit more directly (leaving out the iteration since you said that's not the problem):
Believe it or not, int actually ignores the trailing newline when you use it to parse:
>>> int('1\n')
1
Once you have an int, you can do arithmetic as normal.
This is a documented feature in Python 2 and 3:
Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace.
-Python int documentation (emphasis mine)
If you're interested in more streamlined ways of doing the iteration, you can see Joran's answer and the comments on it, but if this is some kind of assignment, I wouldn't use them if I were you. It benefits you more to work through the problems yourself. You of course want to use the more advanced features for professional work.

Doing multiple, successive regex replacements in Python. Inefficient?

First off - my code works. It just runs slowly, and I'm wondering if i'm missing something that will make it more efficient. I'm parsing PDFs with python (and yes, I know that this should be avoided if at all possible).
My problem is that i have to do several rather complex regex substitutions - and when i say substitution, I really mean deleting. I have done the ones that strip out the most data first so that the next expressions don't need to analyze too much text, but that's all I can think of to speed things up.
I'm pretty new to python and regexes, so it's very conceivable this could be done better.
Thanks for reading.
regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})"
regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})"
regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"
regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*"
contentRaw = re.sub(regexStartPattern,"",contentRaw)
contentRaw = re.sub(regexEndPattern,"",contentRaw)
contentRaw = re.sub(regexPagePattern,"",contentRaw)
contentRaw = re.sub(regexCleanPattern,"",contentRaw)
I'm not sure if you do this inside of a loop. If not the following does not apply.
If you use a pattern multiple times you should compile it using re.compile( ... ). This way the pattern is only compiled once. The speed increase should be huge. Minimal example:
>>> a="a b c d e f"
>>> re.sub(' ', '-', a)
'a-b-c-d-e-f'
>>> p=re.compile(' ')
>>> re.sub(p, '-', a)
'a-b-c-d-e-f'
Another idea: Use re.split( ... ) instead of re.sub and operate on the array with the resulting fragments of your data. I'm not entirely sure how it is implemented, but I think re.sub creates text fragments and merges them into one string in the end, which is expensive. After the last step you can join the array using " ".join(fragments). Obviously, This method will not work if your patterns overlap somewhere.
It would be interesting to get timing information for your program before and after your changes.
Regex are always the last choice when trying to decode strings. So if you see another possibility to solve your problem, use that.
That said, you could use re.compile to precompile your regex patterns:
regexPagePattern = re.compile(r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})")
regexPagePattern.sub("",contentRaw)
That should speed things up a bit (a pretty nice bit ;) )

Maintaining sequence type in set or list comprehension in python

If I have a function that can operate on both sets and lists and should return a modified form of the sequence, is there a way to preserve the sequence type but still use a comprehension? For example, in the following if I call stripcommonpathprefix with a set, it works but has the undesired side effect of converting the set to a list. Is it possible to maintain the type (while still using a comprehension) without having to directly check isinstance and then return the correct type based on that? If not, what would be the cleanest way to do this?
def commonpathprefix(seq):
return os.path.commonprefix(seq).rpartition(os.path.sep)[0] + os.path.sep
def stripcommonpathprefix(seq):
prefix = commonpathprefix(seq)
prefixlen = len(prefix)
return prefix, [ p[prefixlen:] for p in seq ]
Thankyou and sorry if this is a basic question. I'm just starting to learn python.
P.S. I'm using Python 3.2.2
There is no good way to preserve the type of the sequence. As you have guessed, if you really want to do this, you will have to convert the answer at the end to the type you want. It's quite likely that you don't need to do this, so you should think hard about it.
One shortcut that might help you if you do decide to convert: the types of the built-in sequences are also constructors that can create those sequences:
def strip_common_path_prefix(seq):
# blah blah
return prefix, type(seq)(result)
There is no common way to do this without type checking. Also for sets you can use a set comprehension: { p[prefixlen:] for p in seq }.

Efficiently carry out multiple string replacements in Python

If I would like to carry out multiple string replacements, what is the most efficient way to carry this out?
An example of the kind of situation I have encountered in my travels is as follows:
>>> strings = ['a', 'list', 'of', 'strings']
>>> [s.replace('a', '')...replace('u', '') for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
The specific example you give (deleting single characters) is perfect for the translate method of strings, as is substitution of single characters with single characters. If the input string is a Unicode one, then, as well as the two above kinds of "substitution", substitution of single characters with multiple character strings is also fine with the translate method (not if you need to work on byte strings, though).
If you need to replace substrings of multiple characters, then I would also recommend using a regular expression -- though not in the way #gnibbler's answer recommends; rather, I'd build the regex from r'onestring|another|yetanother|orthis' (join the substrings you want to replace with vertical bars -- be sure to also re.escape them if they contain special characters, of course) and write a simple substituting-function based on a dict.
I'm not going to offer a lot of code at this time since I don't know which of the two paragraphs applies to your actual needs, but (when I later come back home and check SO again;-) I'll be glad to edit to add a code example as necessary depending on your edits to your question (more useful than comments to this answer;-).
Edit: in a comment the OP says he wants a "more general" answer (without clarifying what that means) then in an edit of his Q he says he wants to study the "tradeoffs" between various snippets all of which use single-character substrings (and check presence thereof, rather than replacing as originally requested -- completely different semantics, of course).
Given this utter and complete confusion all I can say is that to "check tradeoffs" (performance-wise) I like to use python -mtimeit -s'setup things here' 'statements to check' (making sure the statements to check have no side effects to avoid distorting the time measurements, since timeit implicitly loops to provide accurate timing measurements).
A general answer (without any tradeoffs, and involving multiple-character substrings, so completely contrary to his Q's edit but consonant to his comments -- the two being entirely contradictory it is of course impossible to meet both):
import re
class Replacer(object):
def __init__(self, **replacements):
self.replacements = replacements
self.locator = re.compile('|'.join(re.escape(s) for s in replacements))
def _doreplace(self, mo):
return self.replacements[mo.group()]
def replace(self, s):
return self.locator.sub(self._doreplace, s)
Example use:
r = Replacer(zap='zop', zip='zup')
print r.replace('allazapollezipzapzippopzip')
If some of the substrings to be replaced are Python keywords, they need to be passed in a tad less directly, e.g., the following:
r = Replacer(abc='xyz', def='yyt', ghi='zzq')
would fail because def is a keyword, so you need e.g.:
r = Replacer(abc='xyz', ghi='zzq', **{'def': 'yyt'})
or the like.
I find this a good use for a class (rather than procedural programming) because the RE to locate the substrings to replace, the dict expressing what to replace them with, and the method performing the replacement, really cry out to be "kept all together", and a class instance is just the right way to perform such a "keeping together" in Python. A closure factory would also work (since the replace method is really the only part of the instance that needs to be visible "outside") but in a possibly less-clear, harder to debug way:
def make_replacer(**replacements):
locator = re.compile('|'.join(re.escape(s) for s in replacements))
def _doreplace(mo):
return replacements[mo.group()]
def replace(s):
return locator.sub(_doreplace, s)
return replace
r = make_replacer(zap='zop', zip='zup')
print r('allazapollezipzapzippopzip')
The only real advantage might be a very modestly better performance (needs to be checked with timeit on "benchmark cases" considered significant and representative for the app using it) as the access to the "free variables" (replacements, locator, _doreplace) in this case might be minutely faster than access to the qualified names (self.replacements etc) in the normal, class-based approach (whether this is the case will depend on the Python implementation in use, whence the need to check with timeit on significant benchmarks!).
You may find that it is faster to create a regex and do all the replacements at once.
Also a good idea to move the replacement code out to a function so that you can memoize if you are likely to have duplicates in the list
>>> import re
>>> [re.sub('[aeiou]','',s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']
>>> def replacer(s, memo={}):
... if s not in memo:
... memo[s] = re.sub('[aeiou]','',s)
... return memo[s]
...
>>> [replacer(s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']

Categories