CSV delimiter doesn't work properly [Python]

CSV delimiter doesn't work properly [Python] - python

import csv
base='eest1#mail.ru,username1\
test2#gmail.com,username2\
test3#gmail.com,username3\
test4#rambler.ru,username4\
test5#ya.ru,username5'
parsed=csv.reader(base, delimiter=',')
for p in parsed:
print p
Returns:
['e']
['e']
['s']
['t']
['1']
['#']
['m']
['a']
['i']
['l']
['.']
['r']
['u']
['', '']
etc...
How I can get data separated by comma ?
('test1#gmail.com', 'username1'),
('test2#gmail.com', 'username2'),
...

I think csv only works with file like objects. You can use StringIO in this case.
import csv
import StringIO
base='''eest1#mail.ru,username
test2#gmail.com,username2
test3#gmail.com,username3
test4#rambler.ru,username4
test5#ya.ru,username5'''
parsed=csv.reader(StringIO.StringIO(base), delimiter=',')
for p in parsed:
print p
OUTPUT
['eest1#mail.ru', 'username']
['test2#gmail.com', 'username2']
['test3#gmail.com', 'username3']
['test4#rambler.ru', 'username4']
['test5#ya.ru', 'username5']
Also, your example string does not have newlines, so you would get
['eest1#mail.ru', 'usernametest2#gmail.com', 'username2test3#gmail.com', 'username3test4#rambler.ru', 'username4test5#ya.ru', 'username5']
You can use the ''' like I did, or change your base like
base='eest1#mail.ru,username\n\
test2#gmail.com,username2\n\
test3#gmail.com,username3\n\
test4#rambler.ru,username4\n\
test5#ya.ru,username5'
EDIT
According to the docs, the argument can be either a file-like objet OR a list. So this works too
parsed=csv.reader(base.splitlines(), delimiter=',')

Quoting official docs on csv module (emphasis mine):
csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given
csvfile. csvfile can be any object which supports the iterator
protocol and returns a string each time its __next__() method is
called — file objects and list objects are both suitable.
Strings supports iterator, but it yields characters from string one by one, not lines from multi-line string.
>>> s = "abcdef"
>>> i = iter(s)
>>> next(i)
'a'
>>> next(i)
'b'
>>> next(i)
'c'
So the task is to create iterator, which would yield lines and not characters on each iterations. Unfortunately, your string literal is not a multiline string.
base='eest1#mail.ru,username1\
test2#gmail.com,username2\
test3#gmail.com,username3\
test4#rambler.ru,username4\
test5#ya.ru,username5'
is equivalent to:
base = 'eest1#mail.ru,username1test2#gmail.com,username2test3#gmail.com,username3test4#rambler.ru,username4test5#ya.ru,username5
Esentially you do not have information required to parse that string correctly. Try using multiline string literal instead:
base='''eest1#mail.ru,username1
test2#gmail.com,username2
test3#gmail.com,username3
test4#rambler.ru,username4
test5#ya.ru,username5'''
After this change you may split your string by newlines characters and everything should work fine:
parsed=csv.reader(base.splitlines(), delimiter=',')
for p in parsed:
print(p)

Related

Extract list from a string

I am extracting data from the Google Adwords Reporting API via Python. I can successfully pull the data and then hold it in a variable data.
data = get_report_data_from_google()
type(data)
str
Here is a sample:
data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
I need to process this data more, and ultimately output a processed flat file (Google Adwords API can return a CSV, but I need to pre-process the data before loading it into a database.).
If I try to turn data into a csv object, and try to print each line, I get one character per line like:
c = csv.reader(data, delimiter=',')
for i in c:
print(i)
['I']
['D']
['', '']
['L']
['a']
['b']
['e']
['l']
['s']
['', '']
['D']
['a']
['t']
['e']
So, my idea was to process each column of each line into a list, then add that to a csv object. Trying that:
for line in data.splitlines():
print(line)
3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016
What I actually find is that inside of the str, there is a list: "[""SKWS"",""Exact""]"
This value is a "label" documentation
This list is formatted a bit weird - it has numerous parentheses in the value, so trying to use a quote char, like ", will return something like this: [ SKWS Exact ]. If I could get to [""SKWS"",""Exact""], that would be acceptable.
Is there a good way to extract a list object within a str? Is there a better way to process and output this data to a csv?

You need to split the string first. csv.reader expects something that provides a single line on each iteration, like a standard file object does. If you have a string with newlines in it, split it on the newline character with splitlines():
>>> import csv
>>> data = 'ID,Labels,Date,Year\n3179799191,"[""SKWS"",""Exact""]",2016-05-16,2016\n3179461237,"[""SKWS"",""Broad""]",2016-05-16,2016\n3282565342,"[""SKWS"",""Broad""]",2016-05-16,2016\n'
>>> c = csv.reader(data.splitlines(), delimiter=',')
>>> for line in c:
... print(line)
...
['ID', 'Labels', 'Date', 'Year']
['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']

This has to do with how csv.reader works.
According to the documentation:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called
The issue here is that if you pass a string, it supports the iterator protocol, and returns a single character for each call to next. The csv reader will then consider each character as a line.
You need to provide a list of line, one for each line of your csv. For example:
c = csv.reader(data.split(), delimiter=',')
for i in c:
print i
# ['ID', 'Labels', 'Date', 'Year']
# ['3179799191', '["SKWS","Exact"]', '2016-05-16', '2016']
# ['3179461237', '["SKWS","Broad"]', '2016-05-16', '2016']
# ['3282565342', '["SKWS","Broad"]', '2016-05-16', '2016']
Now, your list looks like a JSON list. You can use the json module to read it.

Split string by comma, ignoring comma inside string. Am trying CSV

I have a string like this:
s = '1,2,"hello, there"'
And I want to turn it into a list:
[1,2,"hello, there"]
Normally I'd use split:
my_list = s.split(",")
However, that doesn't work if there's a comma in a string.
So, I've read that I need to use cvs, but I don't really see how. I've tried:
from csv import reader
s = '1,2,"hello, there"'
ll = reader(s)
print ll
for row in ll:
print row
Which writes:
<_csv.reader object at 0x020EBC70>
['1']
['', '']
['2']
['', '']
['hello, there']
I've also tried with
ll = reader(s, delimiter=',')

It is that way because you provide the csv reader input as a string. If you do not want to use a file or a StringIO object just wrap your string in a list as shown below.
>>> import csv
>>> s = ['1,2,"hello, there"']
>>> ll = csv.reader(s, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> list(ll)
[['1', '2', 'hello, there']]

It sounds like you probably want to use the csv module. To use the reader on a string, you want a StringIO object.
As an example:
>> import csv, StringIO
>> print list(csv.reader(StringIO.StringIO(s)))
[['1', '2', 'hello, there']]
To clarify, csv.reader expects a buffer object, not a string. So StringIO does the trick. However, if you're reading this csv from a file object, (a typical use case) you can just as easily give the file object to the reader and it'll work the same way.

It's usually easier to re-use than to invent a bicycle... You just to use csv library properly. If you can't for some reason, you can always check the source code out and learn how's the parsing done there.
Example for parsing a single string into a list. Notice that the string in wrapped in list.
>>> import csv
>>> s = '1,2,"hello, there"'
>>> list(csv.reader([s]))[0]
['1', '2', 'hello, there']

You can split first by the string delimiters, then by the commas for every even index (The ones not in the string)
import itertools
new_data = s.split('"')
for i in range(len(new_data)):
if i % 2 == 1: # Skip odd indices, making them arrays
new_data[i] = [new_data[i]]
else:
new_data[i] = new_data[i].split(",")
data = itertools.chain(*new_data)
Which goes something like:
'1,2,"hello, there"'
['1,2,', 'hello, there']
[['1', '2'], ['hello, there']]
['1', '2', 'hello, there']
But it's probably better to use the csv library if that's what you're working with.

You could also use ast.literal_eval if you want to preserve the integers:
>>> from ast import literal_eval
>>> literal_eval('[{}]'.format('1,2,"hello, there"'))
[1, 2, 'hello, there']

how to easily read Python built-in types from a file

I have a file which lists values of some Python built-in types: None, integers, and strings, with proper Python syntax, including escaping. For example, the file might look like this:
2
"""\\nfoo
bar
""" 'foo bar'
None
I then want to read that file into the array of the values. For the above example, the array would be:
[2, '\\nfoo\nbar\n', 'foo bar', None]
I can do this by carefully parsing and/or using split function.
Is there an easy way to do it?

I would recommend changing your file format. That said, what you have is parseable. It might get harder to parse if you have multi-token values like lists, but with only None, ints, and strings, you can tokenize the input with tokenize and parse it with something like ast.literal_eval:
import tokenize
import ast
values = []
with open('input_file') as f:
for token_type, token_string, _, _, _ in tokenize.generate_tokens(f.readline):
# Ignore newlines and the file-ending dummy token.
if token_type in (tokenize.ENDMARKER, tokenize.NEWLINE, tokenize.NL):
continue
values.append(ast.literal_eval(token_string))

You can use ast.literal_val
>>> import ast
>>> ast.literal_eval('2')
2
>>> type(ast.literal_eval('2')
<type 'int'>
>>> ast.literal_eval('[1,2,3]')
[1, 2, 3]
>>> type(ast.literal_eval('[1,2,3]')
<type 'list'>
>>> ast.literal_eval('"a"')
'a'
>>> type(ast.literal_eval('"a"')
<type 'str'>

This almost gets you there, but due to the way strings work, it ends up combining the two strings:
import ast
with open('tokens.txt') as in_file:
current_string = ''
tokens = []
for line in in_file:
current_string += line.strip()
try:
new_token = ast.literal_eval(current_string)
tokens.append(new_token)
current_string = ''
except SyntaxError:
print("Couldn't parse current line, combining with next")
tokens
Out[8]: [2, '\\nfoobarfoo bar', None]
The problem is that in Python, if you have two string literals sitting next to each other, they concatenate even if you don't use +, e.g.:
x = 'string1' 'string2'
x
Out[10]: 'string1string2'

I apologize for posting an answer to my own question, but it looks like, what works, is that I replace unquoted whitespace (including newlines), with commas, and then put [] around the whole thing and import.

Python Regex: find all lines that start with '{' and end with '}'

I am receiving data over a socket, a bunch of JSON strings. However, I receive a set amount of bytes, so sometimes the last of my JSON strings is cut-off. I will typically get the following:
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}
{"pitch":-30.816765,"yaw":-125
With Python, I would like to create a string array of the first 18 complete { data... } strings.
Here is what I have tried: cleanData = re.search('{.*}', data) but it seems like this is only giving me the very first { data... } entry. How can I get the full string array of complete { } sets?

To get all, you can use re.finditer or re.findall.
>>> re.findall(r'{.*}', s)
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>
OR
>>> [x.group() for x in re.finditer(r'{.*}', s)]
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>

You need re.findall() (or re.finditer)
>>> import re
>>> for r in re.findall(r'{.*}', data)[:18]:
print r
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}

Extracting lines that start and end with a specific character can be done without any regex, use str.startswith and str.endswith methods when iterating through the lines in a file:
results = []
with open(filepath, 'r') as f:
for line in f:
if line.startswith('{') and line.rstrip('\n').endswith('}'):
results.append(line.rstrip('\n'))
Note the .rstrip('\n') is used before .endswith to make sure the final newline does not interfere with the } check at the end of the string.

python parse csv to lists

I have a csv file thru which I want to parse the data to the lists.
So I am using the python csv module to read that
so basically the following:
import csv
fin = csv.reader(open(path,'rb'),delimiter=' ',quotechar='|')
print fin[0]
#gives the following
['"1239","2249.00","1","3","2011-02-20"']
#lets say i do the following
ele = str(fin[0])
ele = ele.strip().split(',')
print ele
#gives me following
['[\'"1239"', '"2249.00"', '"1"', '"3"', '"2011-02-20"\']']
now
ele[0] gives me --> output---> ['"1239"
How do I get rid of that ['
In the end, I want to do is get 1239 and convert it to integer.. ?
Any clues why this is happening
Thanks
Edit:*Never mind.. resolved thanks to the first comment *

Change your delimiter to ',' and you will get a list of those values from the csv reader.

It's because you are converting a list to a string, there is no need to do this. Grab the first element of the list (in this case it is a string) and parse that:
>>> a = ['"1239","2249.00","1","3","2011-02-20"']
>>> a
['"1239","2249.00","1","3","2011-02-20"']
>>> a[0]
'"1239","2249.00","1","3","2011-02-20"'
>>> b = a[0].replace('"', '').split(',')
>>> b[-1]
'2011-02-20'
of course before you do replace and split string methods you should check if the type is string or handle the exception if it isn't.
Also Blahdiblah is correct your delimiter is probably wrong.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

CSV delimiter doesn't work properly [Python] - python

Related

Extract list from a string

Split string by comma, ignoring comma inside string. Am trying CSV

how to easily read Python built-in types from a file

Python Regex: find all lines that start with '{' and end with '}'

python parse csv to lists

Categories

Resources