Why does csv.DictReader skip empty lines?

Why does csv.DictReader skip empty lines? - python

It seems that csv.DictReader skips empty lines, even when restval is set. Using the following, empty lines in the input file are skipped:
import csv
CSV_FIELDS = ("field1", "field2", "field3")
for row in csv.DictReader(open("f"), fieldnames=CSV_FIELDS, restval=""):
if not row or not row[CSV_FIELDS[0]]:
sys.exit("never reached, why?")
Where file f is:
1,2,3
a,b,c

Inside the csv.DictReader class:
# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == []:
row = self.reader.next()
So empty rows are skipped.
If you don't want to skip empty lines, you could instead use csv.reader.
Another option is to subclass csv.DictReader:
import csv
CSV_FIELDS = ("field1", "field2", "field3")
class MyDictReader(csv.DictReader):
def next(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = self.reader.next()
self.line_num = self.reader.line_num
d = dict(zip(self.fieldnames, row))
lf = len(self.fieldnames)
lr = len(row)
if lf < lr:
d[self.restkey] = row[lf:]
elif lf > lr:
for key in self.fieldnames[lr:]:
d[key] = self.restval
return d
for row in MyDictReader(open("f", 'rb'), fieldnames=CSV_FIELDS, restval=""):
print(row)
yields
{'field2': '2', 'field3': '3', 'field1': '1'}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': 'b', 'field3': 'c', 'field1': 'a'}

Unutbu already pointed out to the reason why this is happening, anyways a quick fix will be replace empty lines with ',' before passing them to DictReader then restval will take care of the rest of the things.
CSV_FIELDS = ("field1", "field2", "field3")
with open('test.csv') as f:
lines = (',' if line.isspace() else line for line in f)
for row in csv.DictReader(lines, fieldnames=CSV_FIELDS, restval=""):
print row
#output
{'field2': '2', 'field3': '3', 'field1': '1'}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': 'b', 'field3': 'c', 'field1': 'a'}
Update:
In case of multi-line empty values the above code won't do it, in that case you can use csv.reader like this:
RESTVAL = ''
with open('test.csv') as f:
for row in csv.reader(f, quotechar='"'):
if not row:
# Don't use `dict.fromkeys` if RESTVAL is a mutable object
# {k: RESTVAL for k in CSV_FIELDS}
print dict.fromkeys(CSV_FIELDS, RESTVAL)
else:
print {k: v if v else RESTVAL for k, v in zip(CSV_FIELDS, row)}
If file contains:
1,2,"
4"
a,b,c
then the output will be:
{'field2': '2', 'field3': '\n\n\n4', 'field1': '1'}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': '', 'field3': '', 'field1': ''}
{'field2': 'b', 'field3': 'c', 'field1': 'a'}

This is your file :
1,2,3
,,
,,
a,b,c
I add coma and now he takes two empty lines {'field2': '', 'field3': '', 'field1': ''}
For restval argument, it just say if you have set fields but one is missing, the other values go to this value.
So you set three fields and there are three values each time. But we talk about 'columns' right here and not lines.
Your lines were empty so he skipped it, unless you specify with comas he needs to take empty values, for dictreader.

Related

Convert tab delimited data into dictionary

I'm trying to convert an array with a dictionary to a flattened dictionary and export it to a JSON file. I have an initial tab-delimited file, and have tried multiple ways but not coming to the final result. If there is more than one row present then save these as arrays in the dictionary
Name file code file_location
TESTLIB1 443 123 location1
TESTLIB2 444 124 location2
Current Output:
{'library': 'TESTLIB2', 'file': '444', 'code': '124', 'file_location': 'location2'}
Desired Output if num_lines > 1:
{'library': ['TEST1', 'TEST2'], 'file': ['443', '444'], 'code': ['123', 123], 'file_location': ['location1', 'location2]}
Code Snippet
data_dict = {}
with open('file.tmp') as input:
reader = csv.DictReader(input, delimiter='\t')
num_lines = sum(1 for line in open('write_object.tmp'))
for row in reader:
data_dict.update(row)
if num_lines > 1:
data_dict.update(row)
with open('output.json', 'w') as output:
output.write(json.dumps(data_dict))
print(data_dict)

create list for each column and iterate to append row by row
import csv
import json
# read file
d = {}
with open('write_object.tmp') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
for head in headers:
d[head] = []
for row in reader:
for i, head in enumerate(headers):
d[head].append(row[i])
# save as json file
with open('output.json', 'w') as f:
json.dump(d, f)
output:
{'Name': ['TESTLIB1', 'TESTLIB2'],
'file': ['443', '444'],
'code': ['123', '124'],
'file_location': ['location1', 'location2']}

from collections import defaultdict
data_dict = defaultdict(list)
with open('input-file') as inp:
for row in csv.DictReader(inp, delimiter='\t'):
for key, val in row.items():
data_dict[key].append(val)
print(data_dict)
# output
{'Name': ['TESTLIB1', 'TESTLIB2'],
'file': ['443', '444'],
'code': ['123', '124'],
'file_location': ['location1', 'location2']}

Remove HTML tags and parse JSON array into key/value object

I'm working with a JSON array payload where I want to extract it into a separate object for processing downstream.
The payload is dynamic and can have multiple nested levels in the JSON array, but the first level will always have an id field which is the unique identifier.
[{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':1}
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':2}
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':3},
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':4}]
The payload is not limited to this structure in terms of more fields or more nested fields with a different type of data. But the id field will always be attached to each of the objects in the payload. I want to create a dictionary(open to other suggestions for the data type) with the id field and everything else in that object as a cleaned-up string, without any of the brackets or HTML tags, etc.
The output should be(depending on data type) something like this:
{1: string string string strings,
2: string string string strings,
3: string string string strings,
4: string string string strings}
This is a very generic example. I'm having trouble navigating the JSON array with all the nesting and content and would just like to extract the id and the rest of the content in a clean manner. Any help is appreciated!

You can use beautifulsoup to clean the string form all tags. For example:
from bs4 import BeautifulSoup
lst = [{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':1},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':2},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':3},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':4}]
def flatten(d):
if isinstance(d, dict):
for v in d.values():
yield from flatten(v)
elif isinstance(d, list):
for v in d:
yield from flatten(v)
elif isinstance(d, str):
yield d
out = {}
for d in lst:
out[d['id']] = ' '.join(map(str.strip, BeautifulSoup(' '.join(flatten(d)), 'html.parser').find_all(text=True)))
print(out)
Prints:
{1: 'string1 string2 string3 strings4', 2: 'string1 string2 string3 strings4', 3: 'string1 string2 string3 strings4', 4: 'string1 string2 string3 strings4'}

Loading values to nested dicts from a list

I have a CSV files which has a header like this:
cpus/0/compatible clocks/HSE/compatible ../frequency memories/flash/compatible ../address ../size [and so on...]
I'm able to parse that header into a nested dictionaries which may look like this:
{'clocks': {'HSE': {'compatible': '[1]',
'frequency': '[2]'}},
'cpus': {'0': {'compatible': '[0]'}},
'memories': {'bkpsram': {'address': '[13]',
'compatible': '[12]',
'size': '[14]'},
'ccm': {'address': '[7]',
'compatible': '[6]',
'size': '[8]'},
'flash': {'address': '[4]',
'compatible': '[3]',
'size': '[5]'},
'sram': {'address': '[10]',
'compatible': '[9]',
'size': '[11]'}},
'pin-controller': {'GPIOA': {'enabled': '[16]'},
'GPIOB': {'enabled': '[17]'},
'GPIOC': {'enabled': '[18]'},
'GPIOD': {'enabled': '[19]'},
'GPIOE': {'enabled': '[20]'},
'GPIOF': {'enabled': '[21]'},
'GPIOG': {'enabled': '[22]'},
'GPIOH': {'enabled': '[23]'},
'GPIOI': {'enabled': '[24]'},
'GPIOJ': {'enabled': '[25]'},
'GPIOK': {'enabled': '[26]'},
'compatible': '[15]'}}
(it is a dict object, printed with pprint())
The values of keys which look like '[<number>]' reflect the index of column in the CSV file from which the data should be loaded.
As I mainly use C/C++ I would actually love to have pointers/references in Python, as then I would just put a pointer to a list element in each value and for each row I could modify list contents, but I think there's no way to obtain such behaviour easily in Python.
So now I plan to dump this dictionary into a string and perform following 3 modifications in a row:
replace { with {{,
replace } with }},
replace '[<number>]' with {<number>}.
After that I will be able to "load" the data with something like this ast.literal_eval(dictAsStr.format(*rowFromCsv)), but it seems like a waste of time to convert the whole dict to a string and then back to a dict...
Am I missing some other obvious solution here? The format of the CSV and the way I load the header is not fixed, I may alter that easily, but I would really like a solution which would not boil down to "visit each key recursively and load appropriate value from current row manually".
From the CSV file I load each row as a list of strings, for example:
['["ARM,Cortex-M4", "ARM,ARMv7-M"]',
'["ST,STM32-HSE", "fixed-clock"]',
'0',
'["on-chip-flash"]',
'0x8000000',
'131072',
'',
'',
'',
'["on-chip-ram"]',
'0x20000000',
'65536',
'',
'',
'',
'["ST,STM32-GPIOv2-pin-controller"]',
'False',
'False',
'False',
'',
'',
'',
'',
'False',
'',
'',
'']
Now I would like to insert the values from each loaded row (list of strings) into appropriate keys in the nested dictionary, so following with the examples above I would like to get:
{'clocks': {'HSE': {'compatible': '["ST,STM32-HSE", "fixed-clock"]',
'frequency': '0'}},
'cpus': {'0': {'compatible': '["ARM,Cortex-M4", "ARM,ARMv7-M"]'}},
'memories': {'bkpsram': {'address': '',
'compatible': '',
'size': ''},
'ccm': {'address': '',
'compatible': '',
'size': ''},
'flash': {'address': '0x8000000',
'compatible': '["on-chip-flash"]',
'size': '131072'},
'sram': {'address': '0x20000000',
'compatible': '["on-chip-ram"]',
'size': '65536'}},
'pin-controller': {'GPIOA': {'enabled': 'False'},
'GPIOB': {'enabled': 'False'},
'GPIOC': {'enabled': 'False'},
'GPIOD': {'enabled': ''},
'GPIOE': {'enabled': ''},
'GPIOF': {'enabled': ''},
'GPIOG': {'enabled': ''},
'GPIOH': {'enabled': 'False'},
'GPIOI': {'enabled': ''},
'GPIOJ': {'enabled': ''},
'GPIOK': {'enabled': ''},
'compatible': '["ST,STM32-GPIOv2-pin-controller"]'}}
For completeness, here are a few first lines from the CSV file I would like to load. The first column is not part of the dictionary presented above, as it is used for indexing.
chip,cpus/0/compatible,clocks/HSE/compatible,../frequency,memories/flash/compatible,../address,../size,memories/ccm/compatible,../address,../size,memories/sram/compatible,../address,../size,memories/bkpsram/compatible,../address,../size,pin-controller/compatible,pin-controller/GPIOA/enabled,pin-controller/GPIOB/enabled,pin-controller/GPIOC/enabled,pin-controller/GPIOD/enabled,pin-controller/GPIOE/enabled,pin-controller/GPIOF/enabled,pin-controller/GPIOG/enabled,pin-controller/GPIOH/enabled,pin-controller/GPIOI/enabled,pin-controller/GPIOJ/enabled,pin-controller/GPIOK/enabled
STM32F401CB,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,131072,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CC,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,262144,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CD,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,393216,,,,"[""on-chip-ram""]",0x20000000,98304,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
The code used to parse the header:
import csv
with open("some-path-to-CSV-file") as csvFile:
csvReader = csv.reader(csvFile)
header = next(csvReader)
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = '[{}]'.format(index)

What about just using the actual row index (as integer) as value in the "parsed" header, ie:
{'clocks': {'HSE': {'compatible': 1,
'frequency': 2}},
# etc
Then using recursion on a parsed header copy to populate it from the row values ?:
import csv
import sys
import copy
import pprint
def parse_header(header):
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = index
return dictionary
def _rparse(d, k, v, row):
if isinstance(v, dict):
for subk, subv in v.items():
_rparse(v, subk, subv, row)
elif isinstance(v, int):
d[k] = row[v]
else:
raise ValueError("'v' should be either a dict or an int (got : %s(%s))" % (type(v), v))
def parse_row(header, row):
struct = copy.deepcopy(header)
for k, v in struct.items():
_rparse(struct, k, v, row)
return struct
def main(*args):
path = args[0]
with open(path) as f:
reader = csv.reader(f)
header = parse_header(next(reader))
results = [parse_row(header, row[1:]) for row in reader]
pprint.pprint(results)
if __name__ == "__main__":
main(*sys.argv[1:])
Another solution (that might actually be faster) would be to build a reverse mapping with row indices as keys and dict "path" as values ie:
{0: ("cpus", "0", "compatible"),
1: ("clocks", "HSE", "compatible"),
2: ("clocks", "HSE", "frequency"),
# etc
}
and then:
def parse_row(template, map, row):
# 'template' is your parsed header dict
struct = copy.deepcopy(template)
target = struct
for index, path in map.items():
for key in path[:-1]:
target = target[key]
target[key[-1] = row[index]
Oh and yes, as an added bonus, you may want to use ast.literal_eval() to turn your values into proper python types:
>>> import ast
>>> ast.literal_eval("False")
False
>>> ast.literal_eval('["on-chip-flash"]')
['on-chip-flash']
>>> ast.literal_eval('0x8000000')
134217728
>>> ast.literal_eval('["ARM,Cortex-M4", "ARM,ARMv7-M"]')
['ARM,Cortex-M4', 'ARM,ARMv7-M']
>>> ast.literal_eval("this should fail")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
this should fail
^
SyntaxError: invalid syntax
>>> def to_python(value):
... try:
... return ast.literal_eval(value)
... except Exception as e:
... return value
...
>>> to_python('["on-chip-flash"]')
['on-chip-flash']
>>> to_python('wtf')
'wtf'
>>>

creating a python dictionary from a list and comma separated line

Is there a more elegant way in python to create a dictionary from a list and a cs line besides a loop?
my_master_list = ["ABC", "DEF", "GHI"]
my_list = ["field1", "field2", "field3"]
my_line = "test1,test2,test3"
my_dict = {}
for x in my_master_list:
my_dict[x] = {}
line_parts = my_line.split(",")
n = 0
for y in my_list:
my_dict[x][y] = line_parts[n]
n +=1
print my_dict
# {'ABC': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}, 'GHI': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}, 'DEF': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}}

You can use zip with a dictionary comprehension:
# construct the inner dictionary
d = dict(zip(my_list, my_line.split(",")))
# construct the outer dictionary, if you don't want to make copies, you can use
# {master_key: d ... } directly here just keep in mind they are referring to the same
# object in this way
{master_key: d.copy() for master_key in my_master_list}
#{'ABC': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'},
# 'DEF': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'},
# 'GHI': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'}}

d = {x:dict(zip(my_list, my_line.split(','))) for x in my_master_list}
^ ^ ^
| | [1]--- creates a list from the string
| |
| [2]--- creates a tuple from two lists
|
[3]--- creates a dictionary from the tuples (key, value)
^
|
[4] The overall expression is a dictionary comprehension.
Read about dict comprehensions in PEP274.

Python script reading from a csv file [duplicate]

This question already has answers here:
How do I read and write CSV files with Python?
(7 answers)
Closed 3 months ago.
"Type","Name","Description","Designation","First-term assessment","Second-term assessment","Total"
"Subject","Nick","D1234","F4321",10,19,29
"Unit","HTML","D1234-1","F4321",18,,
"Topic","Tags","First Term","F4321",18,,
"Subtopic","Review of representation of HTML",,,,,
All the above are the value from an excel sheet , which is converted to csv and that is the one shown above
The header as you notice contains seven coulmns,the data below them vary,
I have this script to generate these from python script,the script is below
from django.db import transaction
import sys
import csv
import StringIO
file = sys.argv[1]
no_cols_flag=0
flag=0
header_arr=[]
print file
f = open(file, 'r')
while (f.readline() != ""):
for i in [line.split(',') for line in open(file)]: # split on the separator
print "==========================================================="
row_flag=0
row_d=""
for j in i: # for each token in the split string
row_flag=1
print j
if j:
no_cols_flag=no_cols_flag+1
data=j.strip()
print j
break
How to modify the above script to say that this data belongs to a particular column header..
thanks..

You're importing the csv module but never use it. Why?
If you do
import csv
reader = csv.reader(open(file, "rb"), dialect="excel") # Python 2.x
# Python 3: reader = csv.reader(open(file, newline=""), dialect="excel")
you get a reader object that will contain all you need; the first row will contain the headers, and the subsequent rows will contain the data in the corresponding places.
Even better might be (if I understand you correctly):
import csv
reader = csv.DictReader(open(file, "rb"), dialect="excel") # Python 2.x
# Python 3: reader = csv.DictReader(open(file, newline=""), dialect="excel")
This DictReader can be iterated over, returning a sequence of dicts that use the column header as keys and the following data as values, so
for row in reader:
print(row)
will output
{'Name': 'Nick', 'Designation': 'F4321', 'Type': 'Subject', 'Total': '29', 'First-term assessment': '10', 'Second-term assessment': '19', 'Description': 'D1234'}
{'Name': 'HTML', 'Designation': 'F4321', 'Type': 'Unit', 'Total': '', 'First-term assessment': '18', 'Second-term assessment': '', 'Description': 'D1234-1'}
{'Name': 'Tags', 'Designation': 'F4321', 'Type': 'Topic', 'Total': '', 'First-term assessment': '18', 'Second-term assessment': '', 'Description': 'First Term'}
{'Name': 'Review of representation of HTML', 'Designation': '', 'Type': 'Subtopic', 'Total': '', 'First-term assessment': '', 'Second-term assessment': '', 'Description': ''}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does csv.DictReader skip empty lines? - python

Related

Convert tab delimited data into dictionary

Remove HTML tags and parse JSON array into key/value object

Loading values to nested dicts from a list

creating a python dictionary from a list and comma separated line

Python script reading from a csv file [duplicate]

Categories

Resources