Split string to various data types

Split string to various data types - python

I would like to convert the following string:
s = '1|2|a|b'
to
[1, 2, 'a', 'b']
Is it possible to do the conversion in one line?

Is it possible to do the conversion in one line?
YES, It is possible. But how?
Algorithm for the approach
Split the string into its constituent parts using str.split. The output of this is
>>> s = '1|2|a|b'
>>> s.split('|')
['1', '2', 'a', 'b']
Now we have got half the problem. Next we need to loop through the split string and then check if each of them is a string or an int. For this we use
A list comprehension, which is for the looping part
str.isdigit for finding if the element is an int or a str.
The list comprehension can be easily written as [i for i in s.split('|')]. But how do we add an if clause there? This is covered in One-line list comprehension: if-else variants. Now that we know which all elements are int and which are not, we can easily call the builtin int on it.
Hence the final code will look like
[int(i) if i.isdigit() else i for i in s.split('|')]
Now for a small demo,
>>> s = '1|2|a|b'
>>> [int(i) if i.isdigit() else i for i in s.split('|')]
[1, 2, 'a', 'b']
As we can see, the output is as expected.
Note that this approach is not suitable if there are many types to be converted.

You cannot do it for negative numbers or lots of mixed types in one line but you could use a function that would work for multiple types using ast.literal_eval:
from ast import literal_eval
def f(s, delim):
for ele in s.split(delim):
try:
yield literal_eval(ele)
except ValueError:
yield ele
s = '1|-2|a|b|3.4'
print(list(f(s,"|")))
[1, -2, 'a', 'b', 3.4]

Another way, is using map built-in method:
>>> s='1|2|a|b'
>>> l = map(lambda x: int(x) if x.isdigit() else x, s.split('|'))
>>> l
[1, 2, 'a', 'b']
If Python3, then:
>>> s='1|2|a|b'
>>> l = list(map(lambda x: int(x) if x.isdigit() else x, s.split('|')))
>>> l
[1, 2, 'a', 'b']
Since map in Python3 would give a generator, so you must convert it to list

It is possible to do arbitrarily many or complex conversions "in a single line" if you're allowed a helper function. Python does not natively have a "convert this string to the type that it should represent" function, because what it "should" represent is vague and may change from application to application.
def convert(input):
converters = [int, float, json.loads]
for converter in converters:
try:
return converter(input)
except (TypeError, ValueError):
pass
# here we assume if all converters failed, it's just a string
return input
s = "1|2.3|a|[4,5]"
result = [convert(x) for x in s.split("|")]

If you have all kinds of data types(more than str and int), I believe this does the job.
s = '1|2|a|b|[1, 2, 3]|(1, 2, 3)'
print [eval(x) if not x.isalpha() else x for x in s.split("|")]
# [1, 2, 'a', 'b', [1, 2, 3], (1, 2, 3)]
This fails if there exists elements such as "b1"

Related

Compressing function within comprehension

I am taking l=['1','2','3','rt4','rt5'] as input and I am converting it into l=[1,2,3,'rt4','rt5'] with the following code:
def RepresentsInt(s):
try:
int(s)
return True
except ValueError:
return False
l=['1','2','3','rt4','rt5']
l=[int(l[i]) if RepresentsInt(l[i]) else l[i] for i in range(0,len(l))]
Can I improve above code using a comprehension?

You could change your RepresentsInt function to actually return the integer (if possible) which would make this much easier:
def RepresentsInt(s):
try:
return int(s)
except ValueError:
return s
Then the code to transform the list could be written as (using a for item in l loop is probably better than iterating over the indices):
>>> l = ['1','2','3','rt4','rt5']
>>> [RepresentsInt(item) for item in l]
[1, 2, 3, 'rt4', 'rt5']
Or if you want that as a reusable pattern you still need a helper function (I chose a decorator-like approach here) because you can't use try and/or excepts in comprehensions:
def try_to_apply_func(func, exception):
def newfunc(value):
try:
return func(value)
except exception:
return value
return newfunc
>>> to_int_if_possible = try_to_apply_func(int, ValueError)
>>> [to_int_if_possible(item) for item in l]
[1, 2, 3, 'rt4', 'rt5']
>>> to_float_if_possible = try_to_apply_func(float, ValueError)
>>> [to_float_if_possible(item) for item in l]
[1.0, 2.0, 3.0, 'rt4', 'rt5']

It's really unclear what you want, but maybe something like :
>>> l=['1','2','3','rt4','rt5']
>>> l=[int(i) if i.isdigit() else i for i in l]
>>> l
[1, 2, 3, 'rt4', 'rt5']

You can use the following code to get your desired result.
l = ['1','2','3','rt4','rt5']
l = [int(each) if each.isdigit() else each for each in l]
print l

I don't believe (though I could ultimately be wrong) that there is a cleaner and neater way to achieve this. One solution would be to create an integer parsing lambda expression, such as the following, but I think your current solution is much neater and more robust.
>>> l = ['1','2','3','rt4','rt5']
>>> l = list(map(lambda s : (s.isdigit() or (s[0] == '-' and s[1:].isdigit())) and int(s) or s, l))
>>> l
[1, 2, 3, 'rt4', 'rt5']
This won't correctly catch strings such as '1.0' or ' 1', but it should just about do what you want in two lines.

Get Python tuple in different format

I have a python tuple like so,
((1420455415000L, 2L), (1420545729000L, 3L), (1420653453000L, 2L))
I want to convert it into this format:
[[1420455415000, 2], [1420545729000, 3], [1420653453000, 2]]
Please note that I also want to remove the 'L' that is automatically removed when I convert this tuple to dict. I have converted the tuple of tuples to list using :
def listit(t):
return list(map(listit, t)) if isinstance(t, (list, tuple)) else t
but the L still remains. That is a problem because I am sending the data to Javascript
How can I do this?

If you're passing the data to JavaScript, you can do this trivially with the json (JavaScript Object Notation) module:
>>> import json
>>> json.dumps(((1420455415000L, 2L), (1420545729000L, 3L), (1420653453000L, 2L)))
'[[1420455415000, 2], [1420545729000, 3], [1420653453000, 2]]'

To get the output in your question you could use
t = ((1420455415000L, 2L), (1420545729000L, 3L), (1420653453000L, 2L))
l = [map(int,x) for x in t]
The conversion from long to int would only work if the value was less than or equal to sys.maxint. Otherwise it will stay as a long. The conversion is not necessary though as the L is only really denoting the type and not the value.
If you are passing it to javascript, the conversion to json makes more sense.

'L' merely indicates variable's type, in this case Long Integer. Hence whatever the way you are sending the data it will behave as an Int.
That said, if you really don't want to see that 'L' you would need to change the type into integer with simple int():

L denotes that the numbers is of type long , if you are 100% sure that the number would be less than the limit that int can handle (in python , which means on conversion to int it would remain int and not revert back to long , which can happen if the number is very very large), then you can simply convert by using int(num) . But please note, L is just an internal representation and it would not show up when the number is converted to string (or printed, for which it is internally converted to string) , it will only show up when using repr() .
Example -
>>> i = 2L
>>> i
2L
>>> int(i)
2
>>> print i
2
>>> str(i)
'2'
>>> i
2L
In your case, to convert longs to int inside a list use -
>>> l = [1L , 2L , 3L]
>>> print l
[1L, 2L, 3L]
>>> l = map(int, l)
>>> l
[1, 2, 3]
>>> print l
[1, 2, 3]
If its possible that the lists have sublists, use a recursive function such as -
def convertlist(l):
if isinstance(l , (list, tuple)):
return list(map(convertlist, l))
elif isinstance(l , long):
return int(l)
else:
return l
>>> l = [1L , 2L , [3L]]
>>> convertlist(l)
[1, 2, [3]]

Python: list.sort() query when list contains different element types

Greetings Pythonic world. Day 4 of learning Python 3.3 and I've come across a strange property of list.sort.
I created a list of five elements: four strings, with a number in the middle. Trying to get list.sort to work gave the expected error because of mixing types:
>>> list = ['b', 'a', 3, 'd', 'c']
>>> list.sort()
Traceback (innermost last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()
>>> list
['b', 'a', 3, 'd', 'c']
The list is unchanged.
But then I moved the number to the end, used list.sort again, and got this:
>>> list = ['b', 'a', 'd', 'c', 3]
>>> list.sort()
Traceback (innermost last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()
>>> list
['a', 'b', 'c', 'd', 3]
OK, an error. But the list has sorted itself, kicking the number to the end. I couldn't find any explanation for this on this site or in Langtangen. Is there some underlying reason for this behaviour? Would it be useful in some situation?

From the Python 3 docs:
This method sorts the list in place, using only < comparisons between
items. Exceptions are not suppressed - if any comparison operations
fail, the entire sort operation will fail (and the list will likely be
left in a partially modified state).
The docs don't guarantee any behavior in particular, but the elements will more than likely be left part-way sorted. Whatever order they were in when the exception occurred, and this order can vary between implementations, or possibly (but unlikely) two subsequent runs of the program.
If you want to try to sort the items without worrying about an unfortunate re-ordering, you can use the sorted builtin function, which will return a new list rather than modify the original.
>>> seq = ['b', 'a', 3, 'd', 'c']
>>> try:
... seq = sorted(seq) # if sorted fails, result won't be assigned
... except Exception: # you may only want TypeError
... pass
...
>>> seq
['b', 'a', 3, 'd', 'c'] # list unmodified
EDIT:
to address everyone saying something like
once it sees two different types it raises an exception
I know you are probably aware that this kind of statement is an oversimplification, but I think without being clear, it's going to cause confusion. As an obvious example, you could sort a list with a mix of int and float.
The following example consists of two classes A and B which support comparison with each other through their respective __lt__ methods. It shows a list mixed of these two types sorted with list.sort() and then printed in sorted order with no exceptions raised:
class A:
def __init__(self, value):
self.a = value
def __lt__(self, other):
if isinstance(other, B):
return self.a < other.b
else:
return self.a < other.a
def __repr__(self):
return repr(self.a)
class B:
def __init__(self, value):
self.b = value
def __lt__(self, other):
if isinstance(other, A):
return self.b < other.a
else:
return self.b < other.b
def __repr__(self):
return repr(self.b)
seq = [A(10), B(2), A(8), B(16), B(9)]
seq.sort()
print(seq)
The output of this is:
[2, 8, 9, 10, 16]
it's not vital that you understand every detail of this. It's just to illustrate that a list of mixed types can work with list.sort() if all the pieces are there

I am writing below answer by assuming that I know the data types in the list, might not be efficient. My idea is to partition the given list into sublists based on data type, after that sort each individual list and combine.
input= ['b', 'a', 3, 'd', 'c']
strs = list(filter(lambda x : type(x) ==str,input))
ints = list(filter(lambda x: type(x) == int, input))
output = sorted(strs) + sorted(ints)

This nothing uncommon. Simply sort() do not check whether list contains consistent datatypes, instead it tries to sort. So once your element is at the end, it gets analyzed lately, and so algorithm did sorted part of the list before it found an error.
And no - it is not useful, as it heavily depends on the implemented sort mechanism.

depends on how the data needs to be sorted, but something like this can work
l = ['a',3,4,'b']
sorted([str(x) for x in l])
['3', '4', 'a', 'b']

I came up with the same problem recently, and didn't wanted to cast everything to a string, so I did this, hope it helps :)
list = ["a", 1, False, None, "b", (1,3), (1, 'a'),(1, [None, False]), True, 3, False]
type_weights = {}
for element in list:
if type(element) not in type_weights:
type_weights[type(element)] = len(type_weights)
print(sorted(list, key=lambda element: (type_weights[type(element)], str(element))))
It should return something like this:
['a', 'b', 1, 3, False, False, True, None, (1, 'a'), (1, 3), (1, [None, False])]
It should work with any data type (including custom classes)

Why doesn't list have safe "get" method like dictionary?

Why doesn't list have a safe "get" method like dictionary?
>>> d = {'a':'b'}
>>> d['a']
'b'
>>> d['c']
KeyError: 'c'
>>> d.get('c', 'fail')
'fail'
>>> l = [1]
>>> l[10]
IndexError: list index out of range

Ultimately it probably doesn't have a safe .get method because a dict is an associative collection (values are associated with names) where it is inefficient to check if a key is present (and return its value) without throwing an exception, while it is super trivial to avoid exceptions accessing list elements (as the len method is very fast). The .get method allows you to query the value associated with a name, not directly access the 37th item in the dictionary (which would be more like what you're asking of your list).
Of course, you can easily implement this yourself:
def safe_list_get (l, idx, default):
try:
return l[idx]
except IndexError:
return default
You could even monkeypatch it onto the __builtins__.list constructor in __main__, but that would be a less pervasive change since most code doesn't use it. If you just wanted to use this with lists created by your own code you could simply subclass list and add the get method.

This works if you want the first element, like my_list.get(0)
>>> my_list = [1,2,3]
>>> next(iter(my_list), 'fail')
1
>>> my_list = []
>>> next(iter(my_list), 'fail')
'fail'
I know it's not exactly what you asked for but it might help others.

Probably because it just didn't make much sense for list semantics. However, you can easily create your own by subclassing.
class safelist(list):
def get(self, index, default=None):
try:
return self.__getitem__(index)
except IndexError:
return default
def _test():
l = safelist(range(10))
print l.get(20, "oops")
if __name__ == "__main__":
_test()

Instead of using .get, using like this should be ok for lists. Just a usage difference.
>>> l = [1]
>>> l[10] if 10 < len(l) else 'fail'
'fail'

Credits to jose.angel.jimenez and Gus Bus.
For the "oneliner" fans…
If you want the first element of a list or if you want a default value if the list is empty try:
liste = ['a', 'b', 'c']
value = (liste[0:1] or ('default',))[0]
print(value)
returns a
and
liste = []
value = (liste[0:1] or ('default',))[0]
print(value)
returns default
Examples for other elements…
liste = ['a', 'b', 'c']
print(liste[0:1]) # returns ['a']
print(liste[1:2]) # returns ['b']
print(liste[2:3]) # returns ['c']
print(liste[3:4]) # returns []
With default fallback…
liste = ['a', 'b', 'c']
print((liste[0:1] or ('default',))[0]) # returns a
print((liste[1:2] or ('default',))[0]) # returns b
print((liste[2:3] or ('default',))[0]) # returns c
print((liste[3:4] or ('default',))[0]) # returns default
Possibly shorter:
liste = ['a', 'b', 'c']
value, = liste[:1] or ('default',)
print(value) # returns a
It looks like you need the comma before the equal sign, the equal sign and the latter parenthesis.
More general:
liste = ['a', 'b', 'c']
f = lambda l, x, d: l[x:x+1] and l[x] or d
print(f(liste, 0, 'default')) # returns a
print(f(liste, 1, 'default')) # returns b
print(f(liste, 2, 'default')) # returns c
print(f(liste, 3, 'default')) # returns default
Tested with Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)

Try this:
>>> i = 3
>>> a = [1, 2, 3, 4]
>>> next(iter(a[i:]), 'fail')
4
>>> next(iter(a[i + 1:]), 'fail')
'fail'

A reasonable thing you can do is to convert the list into a dict and then access it with the get method:
>>> my_list = ['a', 'b', 'c', 'd', 'e']
>>> my_dict = dict(enumerate(my_list))
>>> print my_dict
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}
>>> my_dict.get(2)
'c'
>>> my_dict.get(10, 'N/A')

So I did some more research into this and it turns out there isn't anything specific for this. I got excited when I found list.index(value), it returns the index of a specified item, but there isn't anything for getting the value at a specific index. So if you don't want to use the safe_list_get solution which I think is pretty good. Here are some 1 liner if statements that can get the job done for you depending on the scenario:
>>> x = [1, 2, 3]
>>> el = x[4] if len(x) > 4 else 'No'
>>> el
'No'
You can also use None instead of 'No', which makes more sense.:
>>> x = [1, 2, 3]
>>> i = 2
>>> el_i = x[i] if len(x) == i+1 else None
Also if you want to just get the first or last item in the list, this works
end_el = x[-1] if x else None
You can also make these into functions but I still liked the IndexError exception solution. I experimented with a dummied down version of the safe_list_get solution and made it a bit simpler (no default):
def list_get(l, i):
try:
return l[i]
except IndexError:
return None
Haven't benchmarked to see what is fastest.

Dictionaries are for look ups. It makes sense to ask if an entry exists or not. Lists are usually iterated. It isn't common to ask if L[10] exists but rather if the length of L is 11.

If you
want a one liner,
prefer not having try / except in your happy code path where you needn't, and
want the default value to be optional,
you can use this:
list_get = lambda l, x, d=None: d if not l[x:x+1] else l[x]
Usage looks like:
>>> list_get(['foo'], 4) == None
True
>>> list_get(['hootenanny'], 4, 'ho down!')
'ho down!'
>>> list_get([''], 0)
''

For small index values you can implement
my_list.get(index, default)
as
(my_list + [default] * (index + 1))[index]
If you know in advance what index is then this can be simplified, for example if you knew it was 1 then you could do
(my_list + [default, default])[index]
Because lists are forward packed the only fail case we need to worry about is running off the end of the list. This approach pads the end of the list with enough defaults to guarantee that index is covered.

This isn't an extremely general-purpose solution, but I had a case where I expected a list of length 3 to 5 (with a guarding if), and I was breaking out the values to named variables. A simple and concise way I found for this involved:
foo = (argv + [None, None])[3]
bar = (argv + [None, None])[4]
Now foo and bar are either the 4th and 5th values in the list, or None if there weren't that many values.

Your usecase is basically only relevant for when doing arrays and matrixes of a fixed length, so that you know how long they are before hand. In that case you typically also create them before hand filling them up with None or 0, so that in fact any index you will use already exists.
You could say this: I need .get() on dictionaries quite often. After ten years as a full time programmer I don't think I have ever needed it on a list. :)

How do I do what strtok() does in C, in Python?

I am learning Python and trying to figure out an efficient way to tokenize a string of numbers separated by commas into a list. Well formed cases work as I expect, but less well formed cases not so much.
If I have this:
A = '1,2,3,4'
B = [int(x) for x in A.split(',')]
B results in [1, 2, 3, 4]
which is what I expect, but if the string is something more like
A = '1,,2,3,4,'
if I'm using the same list comprehension expression for B as above, I get an exception. I think I understand why (because some of the "x" string values are not integers), but I'm thinking that there would be a way to parse this still quite elegantly such that tokenization of the string a works a bit more directly like strtok(A,",\n\t") would have done when called iteratively in C.
To be clear what I am asking; I am looking for an elegant/efficient/typical way in Python to have all of the following example cases of strings:
A='1,,2,3,\n,4,\n'
A='1,2,3,4'
A=',1,2,3,4,\t\n'
A='\n\t,1,2,3,,4\n'
return with the same list of:
B=[1,2,3,4]
via some sort of compact expression.

How about this:
A = '1, 2,,3,4 '
B = [int(x) for x in A.split(',') if x.strip()]
x.strip() trims whitespace from the string, which will make it empty if the string is all whitespace. An empty string is "false" in a boolean context, so it's filtered by the if part of the list comprehension.

Generally, I try to avoid regular expressions, but if you want to split on a bunch of different things, they work. Try this:
import re
result = [int(x) for x in filter(None, re.split('[,\n,\t]', A))]

Mmm, functional goodness (with a bit of generator expression thrown in):
a = "1,2,,3,4,"
print map(int, filter(None, (i.strip() for i in a.split(','))))
For full functional joy:
import string
a = "1,2,,3,4,"
print map(int, filter(None, map(string.strip, a.split(','))))

For the sake of completeness, I will answer this seven year old question:
The C program that uses strtok:
int main()
{
char myLine[]="This is;a-line,with pieces";
char *p;
for(p=strtok(myLine, " ;-,"); p != NULL; p=strtok(NULL, " ;-,"))
{
printf("piece=%s\n", p);
}
}
can be accomplished in python with re.split as:
import re
myLine="This is;a-line,with pieces"
for p in re.split("[ ;\-,]",myLine):
print("piece="+p)

This will work, and never raise an exception, if all the numbers are ints. The isdigit() call is false if there's a decimal point in the string.
>>> nums = ['1,,2,3,\n,4\n', '1,2,3,4', ',1,2,3,4,\t\n', '\n\t,1,2,3,,4\n']
>>> for n in nums:
... [ int(i.strip()) for i in n if i.strip() and i.strip().isdigit() ]
...
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]

How about this?
>>> a = "1,2,,3,4,"
>>> map(int,filter(None,a.split(",")))
[1, 2, 3, 4]
filter will remove all false values (i.e. empty strings), which are then mapped to int.
EDIT: Just tested this against the above posted versions, and it seems to be significantly faster, 15% or so compared to the strip() one and more than twice as fast as the isdigit() one

Why accept inferior substitutes that cannot segfault your interpreter? With ctypes you can just call the real thing! :-)
# strtok in Python
from ctypes import c_char_p, cdll
try: libc = cdll.LoadLibrary('libc.so.6')
except WindowsError:
libc = cdll.LoadLibrary('msvcrt.dll')
libc.strtok.restype = c_char_p
dat = c_char_p("1,,2,3,4")
sep = c_char_p(",\n\t")
result = [libc.strtok(dat, sep)] + list(iter(lambda: libc.strtok(None, sep), None))
print(result)

Why not just wrap in a try except block which catches anything not an integer?

I was desperately in need of strtok equivalent in Python. So I developed a simple one by my own
def strtok(val,delim):
token_list=[]
token_list.append(val)
for key in delim:
nList=[]
for token in token_list:
subTokens = [ x for x in token.split(key) if x.strip()]
nList= nList + subTokens
token_list = nList
return token_list

I'd guess regular expressions are the way to go: http://docs.python.org/library/re.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split string to various data types - python

I would like to convert the following string: s = '1|2|a|b' to [1, 2, 'a', 'b'] Is it possible to do the conversion in one line?

If you have all kinds of data types(more than str and int), I believe this does the job. s = '1|2|a|b|[1, 2, 3]|(1, 2, 3)' print [eval(x) if not x.isalpha() else x for x in s.split("|")] # [1, 2, 'a', 'b', [1, 2, 3], (1, 2, 3)] This fails if there exists elements such as "b1"

Related

Compressing function within comprehension

Get Python tuple in different format

Python: list.sort() query when list contains different element types

Why doesn't list have safe "get" method like dictionary?

How do I do what strtok() does in C, in Python?

Categories

Resources