Split string by comma, ignoring comma inside string. Am trying CSV - python

I have a string like this:
s = '1,2,"hello, there"'
And I want to turn it into a list:
[1,2,"hello, there"]
Normally I'd use split:
my_list = s.split(",")
However, that doesn't work if there's a comma in a string.
So, I've read that I need to use cvs, but I don't really see how. I've tried:
from csv import reader
s = '1,2,"hello, there"'
ll = reader(s)
print ll
for row in ll:
print row
Which writes:
<_csv.reader object at 0x020EBC70>
['1']
['', '']
['2']
['', '']
['hello, there']
I've also tried with
ll = reader(s, delimiter=',')

It is that way because you provide the csv reader input as a string. If you do not want to use a file or a StringIO object just wrap your string in a list as shown below.
>>> import csv
>>> s = ['1,2,"hello, there"']
>>> ll = csv.reader(s, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> list(ll)
[['1', '2', 'hello, there']]

It sounds like you probably want to use the csv module. To use the reader on a string, you want a StringIO object.
As an example:
>> import csv, StringIO
>> print list(csv.reader(StringIO.StringIO(s)))
[['1', '2', 'hello, there']]
To clarify, csv.reader expects a buffer object, not a string. So StringIO does the trick. However, if you're reading this csv from a file object, (a typical use case) you can just as easily give the file object to the reader and it'll work the same way.

It's usually easier to re-use than to invent a bicycle... You just to use csv library properly. If you can't for some reason, you can always check the source code out and learn how's the parsing done there.
Example for parsing a single string into a list. Notice that the string in wrapped in list.
>>> import csv
>>> s = '1,2,"hello, there"'
>>> list(csv.reader([s]))[0]
['1', '2', 'hello, there']

You can split first by the string delimiters, then by the commas for every even index (The ones not in the string)
import itertools
new_data = s.split('"')
for i in range(len(new_data)):
if i % 2 == 1: # Skip odd indices, making them arrays
new_data[i] = [new_data[i]]
else:
new_data[i] = new_data[i].split(",")
data = itertools.chain(*new_data)
Which goes something like:
'1,2,"hello, there"'
['1,2,', 'hello, there']
[['1', '2'], ['hello, there']]
['1', '2', 'hello, there']
But it's probably better to use the csv library if that's what you're working with.

You could also use ast.literal_eval if you want to preserve the integers:
>>> from ast import literal_eval
>>> literal_eval('[{}]'.format('1,2,"hello, there"'))
[1, 2, 'hello, there']

Related

CSV delimiter doesn't work properly [Python]

import csv
base='eest1#mail.ru,username1\
test2#gmail.com,username2\
test3#gmail.com,username3\
test4#rambler.ru,username4\
test5#ya.ru,username5'
parsed=csv.reader(base, delimiter=',')
for p in parsed:
print p
Returns:
['e']
['e']
['s']
['t']
['1']
['#']
['m']
['a']
['i']
['l']
['.']
['r']
['u']
['', '']
etc...
How I can get data separated by comma ?
('test1#gmail.com', 'username1'),
('test2#gmail.com', 'username2'),
...
I think csv only works with file like objects. You can use StringIO in this case.
import csv
import StringIO
base='''eest1#mail.ru,username
test2#gmail.com,username2
test3#gmail.com,username3
test4#rambler.ru,username4
test5#ya.ru,username5'''
parsed=csv.reader(StringIO.StringIO(base), delimiter=',')
for p in parsed:
print p
OUTPUT
['eest1#mail.ru', 'username']
['test2#gmail.com', 'username2']
['test3#gmail.com', 'username3']
['test4#rambler.ru', 'username4']
['test5#ya.ru', 'username5']
Also, your example string does not have newlines, so you would get
['eest1#mail.ru', 'usernametest2#gmail.com', 'username2test3#gmail.com', 'username3test4#rambler.ru', 'username4test5#ya.ru', 'username5']
You can use the ''' like I did, or change your base like
base='eest1#mail.ru,username\n\
test2#gmail.com,username2\n\
test3#gmail.com,username3\n\
test4#rambler.ru,username4\n\
test5#ya.ru,username5'
EDIT
According to the docs, the argument can be either a file-like objet OR a list. So this works too
parsed=csv.reader(base.splitlines(), delimiter=',')
Quoting official docs on csv module (emphasis mine):
csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given
csvfile. csvfile can be any object which supports the iterator
protocol and returns a string each time its __next__() method is
called — file objects and list objects are both suitable.
Strings supports iterator, but it yields characters from string one by one, not lines from multi-line string.
>>> s = "abcdef"
>>> i = iter(s)
>>> next(i)
'a'
>>> next(i)
'b'
>>> next(i)
'c'
So the task is to create iterator, which would yield lines and not characters on each iterations. Unfortunately, your string literal is not a multiline string.
base='eest1#mail.ru,username1\
test2#gmail.com,username2\
test3#gmail.com,username3\
test4#rambler.ru,username4\
test5#ya.ru,username5'
is equivalent to:
base = 'eest1#mail.ru,username1test2#gmail.com,username2test3#gmail.com,username3test4#rambler.ru,username4test5#ya.ru,username5
Esentially you do not have information required to parse that string correctly. Try using multiline string literal instead:
base='''eest1#mail.ru,username1
test2#gmail.com,username2
test3#gmail.com,username3
test4#rambler.ru,username4
test5#ya.ru,username5'''
After this change you may split your string by newlines characters and everything should work fine:
parsed=csv.reader(base.splitlines(), delimiter=',')
for p in parsed:
print(p)

Extract list from a string

I am extracting data from the Google Adwords Reporting API via Python. I can successfully pull the data and then hold it in a variable data.
data = get_report_data_from_google()
type(data)
str
Here is a sample:
data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
I need to process this data more, and ultimately output a processed flat file (Google Adwords API can return a CSV, but I need to pre-process the data before loading it into a database.).
If I try to turn data into a csv object, and try to print each line, I get one character per line like:
c = csv.reader(data, delimiter=',')
for i in c:
print(i)
['I']
['D']
['', '']
['L']
['a']
['b']
['e']
['l']
['s']
['', '']
['D']
['a']
['t']
['e']
So, my idea was to process each column of each line into a list, then add that to a csv object. Trying that:
for line in data.splitlines():
print(line)
3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016
What I actually find is that inside of the str, there is a list: "[""SKWS"",""Exact""]"
This value is a "label" documentation
This list is formatted a bit weird - it has numerous parentheses in the value, so trying to use a quote char, like ", will return something like this: [ SKWS Exact ]. If I could get to [""SKWS"",""Exact""], that would be acceptable.
Is there a good way to extract a list object within a str? Is there a better way to process and output this data to a csv?
You need to split the string first. csv.reader expects something that provides a single line on each iteration, like a standard file object does. If you have a string with newlines in it, split it on the newline character with splitlines():
>>> import csv
>>> data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
>>> c = csv.reader(data.splitlines(), delimiter=',')
>>> for line in c:
... print(line)
...
['ID', 'Labels', 'Date', 'Year']
['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']
This has to do with how csv.reader works.
According to the documentation:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called
The issue here is that if you pass a string, it supports the iterator protocol, and returns a single character for each call to next. The csv reader will then consider each character as a line.
You need to provide a list of line, one for each line of your csv. For example:
c = csv.reader(data.split(), delimiter=',')
for i in c:
print i
# ['ID', 'Labels', 'Date', 'Year']
# ['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
# ['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
# ['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']
Now, your list looks like a JSON list. You can use the json module to read it.

how to easily read Python built-in types from a file

I have a file which lists values of some Python built-in types: None, integers, and strings, with proper Python syntax, including escaping. For example, the file might look like this:
2
"""\\nfoo
bar
""" 'foo bar'
None
I then want to read that file into the array of the values. For the above example, the array would be:
[2, '\\nfoo\nbar\n', 'foo bar', None]
I can do this by carefully parsing and/or using split function.
Is there an easy way to do it?
I would recommend changing your file format. That said, what you have is parseable. It might get harder to parse if you have multi-token values like lists, but with only None, ints, and strings, you can tokenize the input with tokenize and parse it with something like ast.literal_eval:
import tokenize
import ast
values = []
with open('input_file') as f:
for token_type, token_string, _, _, _ in tokenize.generate_tokens(f.readline):
# Ignore newlines and the file-ending dummy token.
if token_type in (tokenize.ENDMARKER, tokenize.NEWLINE, tokenize.NL):
continue
values.append(ast.literal_eval(token_string))
You can use ast.literal_val
>>> import ast
>>> ast.literal_eval('2')
2
>>> type(ast.literal_eval('2')
<type 'int'>
>>> ast.literal_eval('[1,2,3]')
[1, 2, 3]
>>> type(ast.literal_eval('[1,2,3]')
<type 'list'>
>>> ast.literal_eval('"a"')
'a'
>>> type(ast.literal_eval('"a"')
<type 'str'>
This almost gets you there, but due to the way strings work, it ends up combining the two strings:
import ast
with open('tokens.txt') as in_file:
current_string = ''
tokens = []
for line in in_file:
current_string += line.strip()
try:
new_token = ast.literal_eval(current_string)
tokens.append(new_token)
current_string = ''
except SyntaxError:
print("Couldn't parse current line, combining with next")
tokens
Out[8]: [2, '\\nfoobarfoo bar', None]
The problem is that in Python, if you have two string literals sitting next to each other, they concatenate even if you don't use +, e.g.:
x = 'string1' 'string2'
x
Out[10]: 'string1string2'
I apologize for posting an answer to my own question, but it looks like, what works, is that I replace unquoted whitespace (including newlines), with commas, and then put [] around the whole thing and import.

python parse csv to lists

I have a csv file thru which I want to parse the data to the lists.
So I am using the python csv module to read that
so basically the following:
import csv
fin = csv.reader(open(path,'rb'),delimiter=' ',quotechar='|')
print fin[0]
#gives the following
['"1239","2249.00","1","3","2011-02-20"']
#lets say i do the following
ele = str(fin[0])
ele = ele.strip().split(',')
print ele
#gives me following
['[\'"1239"', '"2249.00"', '"1"', '"3"', '"2011-02-20"\']']
now
ele[0] gives me --> output---> ['"1239"
How do I get rid of that ['
In the end, I want to do is get 1239 and convert it to integer.. ?
Any clues why this is happening
Thanks
Edit:*Never mind.. resolved thanks to the first comment *
Change your delimiter to ',' and you will get a list of those values from the csv reader.
It's because you are converting a list to a string, there is no need to do this. Grab the first element of the list (in this case it is a string) and parse that:
>>> a = ['"1239","2249.00","1","3","2011-02-20"']
>>> a
['"1239","2249.00","1","3","2011-02-20"']
>>> a[0]
'"1239","2249.00","1","3","2011-02-20"'
>>> b = a[0].replace('"', '').split(',')
>>> b[-1]
'2011-02-20'
of course before you do replace and split string methods you should check if the type is string or handle the exception if it isn't.
Also Blahdiblah is correct your delimiter is probably wrong.

Python strings / match case

I have a CSV file which has the following format:
id,case1,case2,case3
Here is a sample:
123,null,X,Y
342,X,X,Y
456,null,null,null
789,null,null,X
For each line I need to know which of the cases is not null. Is there an easy way to find out which case(s) are not null without splitting the string and going through each element?
This is what the result should look like:
123,case2:case3
342,case1:case2:case3
456:None
789:case3
You probably want to take a look at the CSV module, which has readers and writers that will enable you to create transforms.
>>> from StringIO import StringIO
>>> from csv import DictReader
>>> fh = StringIO("""
... id,case1,case2,case3
...
... 123,null,X,Y
...
... 342,X,X,Y
...
... 456,null,null,null
...
... 789,null,null,X
... """.strip())
>>> dr = DictReader(fh)
>>> dr.next()
{'case1': 'null', 'case3': 'Y', 'case2': 'X', 'id': '123'}
At which point you can do something like:
>>> from csv import DictWriter
>>> out_fh = StringIO()
>>> writer = DictWriter(fh, fieldnames=dr.fieldnames)
>>> for mapping in dr:
... writer.write(dict((k, v) for k, v in mapping.items() if v != 'null'))
...
The last bit is just pseudocode -- not sure dr.fieldnames is actually a property. Replace out_fh with the filehandle that you'd like to output to.
Anyway you slice it, you are still going to have to go through the list. There are more and less elegant ways to do it. Depending on the python version you are using, you can use list comprehensions.
ids=line.split(",")
print "%s:%s" % (ids[0], ":".join(["case%d" % x for x in range(1, len(ids)) if ids[x] != "null"])
Why do you treat spliting as a problem? For performance reasons?
Literally you could avoid splitting with smart regexps (like:
\d+,null,\w+,\w+
\d+,\w+,null,\w+
...
but I find it a worse solution than reparsing the data into lists.
You could use the Python csv module, comes in with the standard installation of python... It will not be much easier, though...

Categories