I am learning Python and trying to figure out an efficient way to tokenize a string of numbers separated by commas into a list. Well formed cases work as I expect, but less well formed cases not so much.
If I have this:
A = '1,2,3,4'
B = [int(x) for x in A.split(',')]
B results in [1, 2, 3, 4]
which is what I expect, but if the string is something more like
A = '1,,2,3,4,'
if I'm using the same list comprehension expression for B as above, I get an exception. I think I understand why (because some of the "x" string values are not integers), but I'm thinking that there would be a way to parse this still quite elegantly such that tokenization of the string a works a bit more directly like strtok(A,",\n\t") would have done when called iteratively in C.
To be clear what I am asking; I am looking for an elegant/efficient/typical way in Python to have all of the following example cases of strings:
A='1,,2,3,\n,4,\n'
A='1,2,3,4'
A=',1,2,3,4,\t\n'
A='\n\t,1,2,3,,4\n'
return with the same list of:
B=[1,2,3,4]
via some sort of compact expression.
How about this:
A = '1, 2,,3,4 '
B = [int(x) for x in A.split(',') if x.strip()]
x.strip() trims whitespace from the string, which will make it empty if the string is all whitespace. An empty string is "false" in a boolean context, so it's filtered by the if part of the list comprehension.
Generally, I try to avoid regular expressions, but if you want to split on a bunch of different things, they work. Try this:
import re
result = [int(x) for x in filter(None, re.split('[,\n,\t]', A))]
Mmm, functional goodness (with a bit of generator expression thrown in):
a = "1,2,,3,4,"
print map(int, filter(None, (i.strip() for i in a.split(','))))
For full functional joy:
import string
a = "1,2,,3,4,"
print map(int, filter(None, map(string.strip, a.split(','))))
For the sake of completeness, I will answer this seven year old question:
The C program that uses strtok:
int main()
{
char myLine[]="This is;a-line,with pieces";
char *p;
for(p=strtok(myLine, " ;-,"); p != NULL; p=strtok(NULL, " ;-,"))
{
printf("piece=%s\n", p);
}
}
can be accomplished in python with re.split as:
import re
myLine="This is;a-line,with pieces"
for p in re.split("[ ;\-,]",myLine):
print("piece="+p)
This will work, and never raise an exception, if all the numbers are ints. The isdigit() call is false if there's a decimal point in the string.
>>> nums = ['1,,2,3,\n,4\n', '1,2,3,4', ',1,2,3,4,\t\n', '\n\t,1,2,3,,4\n']
>>> for n in nums:
... [ int(i.strip()) for i in n if i.strip() and i.strip().isdigit() ]
...
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
How about this?
>>> a = "1,2,,3,4,"
>>> map(int,filter(None,a.split(",")))
[1, 2, 3, 4]
filter will remove all false values (i.e. empty strings), which are then mapped to int.
EDIT: Just tested this against the above posted versions, and it seems to be significantly faster, 15% or so compared to the strip() one and more than twice as fast as the isdigit() one
Why accept inferior substitutes that cannot segfault your interpreter? With ctypes you can just call the real thing! :-)
# strtok in Python
from ctypes import c_char_p, cdll
try: libc = cdll.LoadLibrary('libc.so.6')
except WindowsError:
libc = cdll.LoadLibrary('msvcrt.dll')
libc.strtok.restype = c_char_p
dat = c_char_p("1,,2,3,4")
sep = c_char_p(",\n\t")
result = [libc.strtok(dat, sep)] + list(iter(lambda: libc.strtok(None, sep), None))
print(result)
Why not just wrap in a try except block which catches anything not an integer?
I was desperately in need of strtok equivalent in Python. So I developed a simple one by my own
def strtok(val,delim):
token_list=[]
token_list.append(val)
for key in delim:
nList=[]
for token in token_list:
subTokens = [ x for x in token.split(key) if x.strip()]
nList= nList + subTokens
token_list = nList
return token_list
I'd guess regular expressions are the way to go: http://docs.python.org/library/re.html
Related
I have a function which takes in expressions and replaces the variables with all the permutations of the values that I am using as inputs. This is my code that I have tested and works, however after looking through SO, people have said that nested for loops are a bad idea however I am unsure as to how to make this more efficient. Could somebody help? Thanks.
def replaceVar(expression):
eval_list = list()
a = [1, 8, 12, 13]
b = [1, 2, 3, 4]
c = [5, 9, 2, 7]
for i in expression:
first_eval = [i.replace("a", str(j)) for j in a]
tmp = list()
for k in first_eval:
snd_eval = [k.replace("b", str(l)) for l in b]
tmp2 = list()
for m in snd_eval:
trd_eval = [m.replace("c", str(n)) for n in c]
tmp2.append(trd_eval)
tmp.append(tmp2)
eval_list.append(tmp)
print(eval_list)
return eval_list
print(replaceVar(['b-16+(c-(a+11))', 'a-(c-5)+a-b-10']))
Foreword
Nested loops are not a bad thing per se. They are only bad, if there are used for problems, for which better algorithm have been found (better and bad in terms of efficiency regarding the input size). Sorting of a list of integers for example is such a problem.
Analyzing the Problem
The size
In your case above you have three lists, all of size 4. This makes 4 * 4 * 4 = 64 possible combinations of them, if a comes always before b and b before c. So you need at least 64 iterations!
Your approach
In your approach we have 4 iterations for each possible value of a, 4 iterations for each possible value of b and the same for c. So we have 4 * 4 * 4 = 64 iterations in total. So in fact your solution is quite good!
As there is no faster way of listening all combinations, your way is also the best one.
The style
Regarding the style one can say that you can improve your code by better variable names and combining some of the for loops. E.g. like that:
def replaceVar(expressions):
"""
Takes a list of expressions and returns a list of expressions with
evaluated variables.
"""
evaluatedExpressions = list()
valuesOfA = [1, 8, 12, 13]
valuesOfB = [1, 2, 3, 4]
valuesOfC = [5, 9, 2, 7]
for expression in expressions:
for valueOfA in valuesOfA:
for valueOfB in valuesOfB:
for valueOfC in valuesOfC:
newExpression = expression.\
replace('a', str(valueOfA)).\
replace('b', str(valueOfB)).\
replace('c', str(valueOfC))
evaluatedExpressions.append(newExpression)
print(evaluatedExpressions)
return evaluatedExpressions
print(replaceVar(['b-16+(c-(a+11))', 'a-(c-5)+a-b-10']))
Notice however that the amount of iterations remain the same!
Itertools
As Kevin noticed, you could also use itertools to generate the cartesian product. Internally it will do the same as what you did with the combined for loops:
import itertools
def replaceVar(expressions):
"""
Takes a list of expressions and returns a list of expressions with
evaluated variables.
"""
evaluatedExpressions = list()
valuesOfA = [1, 8, 12, 13]
valuesOfB = [1, 2, 3, 4]
valuesOfC = [5, 9, 2, 7]
for expression in expressions:
for values in itertools.product(valuesOfA, valuesOfB, valuesOfC):
valueOfA = values[0]
valueOfB = values[1]
valueOfC = values[2]
newExpression = expression.\
replace('a', str(valueOfA)).\
replace('b', str(valueOfB)).\
replace('c', str(valueOfC))
evaluatedExpressions.append(newExpression)
print(evaluatedExpressions)
return evaluatedExpressions
print(replaceVar(['b-16+(c-(a+11))', 'a-(c-5)+a-b-10']))
here are some ideas:
as yours list a, b and c are hardcoded, harcode them as strings, therefore you don't have to cast every element to string at each step
use list comprehension, they are a little more faster than a normal for-loop with append
instead of .replace, use .format, it does all the replace for you in a single step
use itertools.product to combine a, b and c
with all that, I arrive to this
import itertools
def replaceVar(expression):
a = ['1', '8', '12', '13' ]
b = ['1', '2', '3', '4' ]
c = ['5', '9', '2', '7' ]
expression = [exp.replace('a','{0}').replace('b','{1}').replace('c','{2}')
for exp in expression] #prepare the expresion so they can be used with format
return [ exp.format(*arg) for exp in expression for arg in itertools.product(a,b,c) ]
the speed gain is marginal, but is something, in my machine it goes from 148 milliseconds to 125
Functionality is the same to the version of R.Q.
"The problem" with nested loops is basically just that the number of levels is hard coded. You wrote nesting for 3 variables. What if you only have 2? What if it jumps to 5? Then you need non-trivial surgery on the code. That's why itertools.product() is recommended.
Relatedly, all suggestions so far hard-code the number of replace() calls. Same "problem": if you don't have exactly 3 variables, the replacement code has to be modified.
Instead of doing that, think about a cleaner way to do the replacements. For example, suppose your input string were:
s = '{b}-16+({c}-({a}+11))'
instead of:
'b-16+(c-(a+11))'
That is, the variables to be replaced are enclosed in curly braces. Then Python can do all the substitutions "at once" for you:
>>> s.format(a=333, b=444, c=555)
'444-16+(555-(333+11))'
That hard-codes the names and number of names too, but the same thing can be accomplished with a dict:
>>> d = dict(zip(["a", "b", "c"], (333, 444, 555)))
>>> s.format(**d)
'444-16+(555-(333+11))'
Now nothing about the number of variables, or their names, is hard-coded in the format() call.
The tuple of values ((333, 444, 555)) is exactly the kind of thing itertools.product() returns. The list of variable names (["a", "b", "c"]) can be created just once at the top, or even passed in to the function.
You just need a bit of code to transform your input expressions to enclose the variable names in curly braces.
So, your current structure addresses one of the inefficiencies that the solutions with itertools.product will not address. Your code is saving the intermediately substituted expressions and reusing them, rather than redoing these substitutions with each itertools.product tuple. This is good and I think your current code is efficient.
However, it is brittle and only works when substituting in exactly three variables. A dynamic programming approach can solve this issue. To do so, I'm going to slightly alter the input parameters. The function will use two inputs:
expressions - The expressions to be substituted into
replacement_map - A dictionary which provides the values to substitute for each variable
The dynamic programming function is given below:
def replace_variable(expressions, replacement_map):
return [list(_replace_variable([e], replacement_map)) for e in expressions]
def _replace_variable(expressions, replacement_map):
if not replacement_map:
for e in expressions:
yield e
else:
map_copy = replacement_map.copy()
key, value_list = map_copy.popitem()
for value in value_list:
substituted = [e.replace(key, value) for e in expressions]
for e in _replace_variable(substituted, map_copy):
yield e
With the example usage:
expressions = ['a+b', 'a-b']
replacement_map = {
'a': ['1', '2'],
'b': ['3', '4']
}
print replace_variable(expressions, replacement_map)
# [['1+3', '1+4', '2+3', '2+4'], ['1-3', '1-4', '2-3', '2-4']]
Note that if you're using Python 3.X, you can use the yield from iterator construct instead of reiterating over e twice in _replace_variables. This function would look like:
def _replace_variable(expressions, replacement_map):
if not replacement_map:
yield from expressions
else:
map_copy = replacement_map.copy()
key, value_list = map_copy.popitem()
for value in value_list:
substituted = [e.replace(key, value) for e in expressions]
yield from _replace_variable(substituted, map_copy)
I would like to convert the following string:
s = '1|2|a|b'
to
[1, 2, 'a', 'b']
Is it possible to do the conversion in one line?
Is it possible to do the conversion in one line?
YES, It is possible. But how?
Algorithm for the approach
Split the string into its constituent parts using str.split. The output of this is
>>> s = '1|2|a|b'
>>> s.split('|')
['1', '2', 'a', 'b']
Now we have got half the problem. Next we need to loop through the split string and then check if each of them is a string or an int. For this we use
A list comprehension, which is for the looping part
str.isdigit for finding if the element is an int or a str.
The list comprehension can be easily written as [i for i in s.split('|')]. But how do we add an if clause there? This is covered in One-line list comprehension: if-else variants. Now that we know which all elements are int and which are not, we can easily call the builtin int on it.
Hence the final code will look like
[int(i) if i.isdigit() else i for i in s.split('|')]
Now for a small demo,
>>> s = '1|2|a|b'
>>> [int(i) if i.isdigit() else i for i in s.split('|')]
[1, 2, 'a', 'b']
As we can see, the output is as expected.
Note that this approach is not suitable if there are many types to be converted.
You cannot do it for negative numbers or lots of mixed types in one line but you could use a function that would work for multiple types using ast.literal_eval:
from ast import literal_eval
def f(s, delim):
for ele in s.split(delim):
try:
yield literal_eval(ele)
except ValueError:
yield ele
s = '1|-2|a|b|3.4'
print(list(f(s,"|")))
[1, -2, 'a', 'b', 3.4]
Another way, is using map built-in method:
>>> s='1|2|a|b'
>>> l = map(lambda x: int(x) if x.isdigit() else x, s.split('|'))
>>> l
[1, 2, 'a', 'b']
If Python3, then:
>>> s='1|2|a|b'
>>> l = list(map(lambda x: int(x) if x.isdigit() else x, s.split('|')))
>>> l
[1, 2, 'a', 'b']
Since map in Python3 would give a generator, so you must convert it to list
It is possible to do arbitrarily many or complex conversions "in a single line" if you're allowed a helper function. Python does not natively have a "convert this string to the type that it should represent" function, because what it "should" represent is vague and may change from application to application.
def convert(input):
converters = [int, float, json.loads]
for converter in converters:
try:
return converter(input)
except (TypeError, ValueError):
pass
# here we assume if all converters failed, it's just a string
return input
s = "1|2.3|a|[4,5]"
result = [convert(x) for x in s.split("|")]
If you have all kinds of data types(more than str and int), I believe this does the job.
s = '1|2|a|b|[1, 2, 3]|(1, 2, 3)'
print [eval(x) if not x.isalpha() else x for x in s.split("|")]
# [1, 2, 'a', 'b', [1, 2, 3], (1, 2, 3)]
This fails if there exists elements such as "b1"
Let's say I have a string
str1 = "TN 81 NZ 0025"
two = first2(str1)
print(two) # -> TN
How do I get the first two letters of this string? I need the first2 function for this.
It is as simple as string[:2]. A function can be easily written to do it, if you need.
Even this, is as simple as
def first2(s):
return s[:2]
In general, you can get the characters of a string from i until j with string[i:j].
string[:2] is shorthand for string[0:2]. This works for lists as well.
Learn about Python's slice notation at the official tutorial
t = "your string"
Play with the first N characters of a string with
def firstN(s, n=2):
return s[:n]
which is by default equivalent to
t[:2]
Heres what the simple function would look like:
def firstTwo(string):
return string[:2]
In python strings are list of characters, but they are not explicitly list type, just list-like (i.e. it can be treated like a list). More formally, they're known as sequence (see http://docs.python.org/2/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange):
>>> a = 'foo bar'
>>> isinstance(a, list)
False
>>> isinstance(a, str)
True
Since strings are sequence, you can use slicing to access parts of the list, denoted by list[start_index:end_index] see Explain Python's slice notation . For example:
>>> a = [1,2,3,4]
>>> a[0]
1 # first element, NOT a sequence.
>>> a[0:1]
[1] # a slice from first to second, a list, i.e. a sequence.
>>> a[0:2]
[1, 2]
>>> a[:2]
[1, 2]
>>> x = "foo bar"
>>> x[0:2]
'fo'
>>> x[:2]
'fo'
When undefined, the slice notation takes the starting position as the 0, and end position as len(sequence).
In the olden C days, it's an array of characters, the whole issue of dynamic vs static list sounds like legend now, see Python List vs. Array - when to use?
All previous examples will raise an exception in case your string is not long enough.
Another approach is to use
'yourstring'.ljust(100)[:100].strip().
This will give you first 100 chars.
You might get a shorter string in case your string last chars are spaces.
For completeness: Instead of using def you could give a name to a lambda function:
first2 = lambda s: s[:2]
Used a loop to add a bunch of elements to a list with
mylist = []
for x in otherlist:
mylist.append(x[0:5])
But instead of the expected result ['x1','x2',...], I got: [u'x1', u'x2',...]. Where did the u's come from and why? Also is there a better way to loop through the other list, inserting the first six characters of each element into a new list?
The u means unicode, you probably will not need to worry about it
mylist.extend(x[:5] for x in otherlist)
The u means unicode. It's Python's internal string representation (from version ... ?).
Most times you don't need to worry about it. (Until you do.)
The answers above me already answered the "u" part - that the string is encoded in Unicode. About whether there's a better way to extract the first 6 letters from the items in a list:
>>> a = ["abcdefgh", "012345678"]
>>> b = map(lambda n: n[0:5], a);
>>> for x in b:
print(x)
abcde
01234
So, map applies a function (lambda n: n[0:5]) to each element of a and returns a new list with the results of the function for every element. More precisely, in Python 3, it returns an iterator, so the function gets called only as many times as needed (i.e. if your list has 5000 items, but you only pull 10 from the result b, lambda n: n[0:5] gets called only 10 times). In Python2, you need to use itertools.imap instead.
>>> a = [1, 2, 3]
>>> def plusone(x):
print("called with {}".format(x))
return x + 1
>>> b = map(plusone, a)
>>> print("first item: {}".format(b.__next__()))
called with 1
first item: 2
Of course, you can apply the function "eagerly" to every element by calling list(b), which will give you a normal list with the function applied to each element on creation.
>>> b = map(plusone, a)
>>> list(b)
called with 1
called with 2
called with 3
[2, 3, 4]
Given a list
a = range(10)
You can slice it using statements such as
a[1]
a[2:4]
However, I want to do this based on a variable set elsewhere in the code. I can easily do this for the first one
i = 1
a[i]
But how do I do this for the other one? I've tried indexing with a list:
i = [2, 3, 4]
a[i]
But that doesn't work. I've also tried using a string:
i = "2:4"
a[i]
But that doesn't work either.
Is this possible?
that's what slice() is for:
a = range(10)
s = slice(2,4)
print a[s]
That's the same as using a[2:4].
Why does it have to be a single variable? Just use two variables:
i, j = 2, 4
a[i:j]
If it really needs to be a single variable you could use a tuple.
With the assignments below you are still using the same type of slicing operations you show, but now with variables for the values.
a = range(10)
i = 2
j = 4
then
print a[i:j]
[2, 3]
>>> a=range(10)
>>> i=[2,3,4]
>>> a[i[0]:i[-1]]
range(2, 4)
>>> list(a[i[0]:i[-1]])
[2, 3]
I ran across this recently, while looking up how to have the user mimic the usual slice syntax of a:b:c, ::c, etc. via arguments passed on the command line.
The argument is read as a string, and I'd rather not split on ':', pass that to slice(), etc. Besides, if the user passes a single integer i, the intended meaning is clearly a[i]. Nevertheless, slice(i) will default to slice(None,i,None), which isn't the desired result.
In any case, the most straightforward solution I could come up with was to read in the string as a variable st say, and then recover the desired list slice as eval(f"a[{st}]").
This uses the eval() builtin and an f-string where st is interpolated inside the braces. It handles precisely the usual colon-separated slicing syntax, since it just plugs in that colon-containing string as-is.