DictReader, No quotes, tabbed file - python

I have a csv file that looks like this:
Please note, there are no quotes, a tab (\t) is the delimiter, and there is a blank line between the header and the actual content.
Facility No Testing No Name Age
252 2351 Jackrabbit, Jazz 15
345 257 Aardvark, Ethel 41
I think I've tried nearly every possible combination of ideas and parameters
f = open('/tmp/test', 'r')
csvFile = f.read()
reader = csv.DictReader(csvFile, delimiter='\t', quoting=csv.QUOTE_NONE)
print reader.fieldnames
the result of the print is:
['F']
How can I get this into something I can parse to put into a database?
Getting it into a dictionary would be helpful.

What is your csvFile? Is it a string representing your filename starting with 'F'?
csv.DictReader needs an opened file object, not a filename.
Try:
with open(csvFile, 'rb') as f:
reader = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
print reader.fieldnames
EDIT
If your csvFile is a string containing the whole data, you will have to convert it into a StringIO (because csv can access only file-like objects, not strings).
Try:
from cStringIO import StringIO
# csvFile = 'Facility No\tTesting No\tName\tAge\n\n252\t2351\tJackrabbit, Jazz\t15\n345\t257\tAardvark, Ethel\t41\n'
reader = csv.DictReader(StringIO(csvFile), delimiter='\t', quoting=csv.QUOTE_NONE)
print reader.fieldnames
Or, if your edited question opens and reads a file:
with open('/tmp/test', 'rb') as f:
reader = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
print reader.fieldnames
This works for me.

this might work for you, at least as a start:
>>> import csv
>>> input = open('/tmp/csvtemp.csv')
>>> csvin = csv.reader(input, delimiter='\t')
>>> data = [row for row in csvin]
>>> header = data.pop(0)
>>> data.pop(0) # skip blank line
[]
>>> for row in data:
... rowdict = dict(zip(header, row))
... print rowdict
...
{'Age': '15', 'Testing No': '2351', 'Name': 'Jackrabbit, Jazz', 'Facility No': '252'}
{'Age': '41', 'Testing No': '257', 'Name': 'Aardvark, Ethel', 'Facility No': '345'}

From the comments I understand that you get your data via urllib2. response is a file-like object; you could pass it directly to csv.DictReader:
response = urllib2.urlopen(URL)
reader = csv.DictReader(response, dialect=csv.excel_tab)

Related

Within a file to add data

I am trying to attempt something that I have not before within python.
The code below collects data from my test database and put it into a text under my headers of 'Test1','Test2','Test3'. This is working fine.
What I am trying to attempt now is to add a header (on top of the current header) and footer to the file.
python code:
file = 'file.txt'
header_names = {'t1':'Test1', 't2': 'Test2','t3':'Test3'}
with open(file, 'w', newline='') as f:
w = csv.DictWriter(f, fieldnames=header_names.keys(), restval='', extrasaction='ignore')
w.writerow(header_names)
for doc in res['test']['test']:
my_dict = doc['test']
w.writerow(my_dict)
current file output using the above code.
file.txt
Test1,Test2,Test3
Bob,john,Male
Cat,Long,female
Dog,Short,Male
Case,Fast,Male
Nice,who,Male
ideal txt output.
{header}
Filename:file.txt
date:
{data}
Test1,Test2,Test3
Bob,john,Male
Cat,Long,female
Dog,Short,Male
Case,Fast,Male
Nice,who,Male
{Footer}
this file was generated by using python.
the {header}, {data} and {footer} is not needed within the file that is just to make clear what is needed. i hope this makes sense.
Something like this
import csv
from datetime import date
# prepare some sample data
data = [['Bob', 'John', 'Male'],
['Cat', 'Long', 'Female']]
fieldnames = ['test1', 'test2', 'test3']
data = [dict(zip(fieldnames, row)) for row in data]
# actual part that writes to a file
with open('spam.txt', 'w', newline='') as f:
f.write('filename:spam.txt\n')
f.write(f'date:{date.today().strftime("%Y%m%d")}\n\n')
wrtr = csv.DictWriter(f, fieldnames = fieldnames)
wrtr.writeheader()
wrtr.writerows(data)
f.write('\nwritten with python\n')
Output in the file:
filename:spam.txt
date:20190321
test1,test2,test3
Bob,John,Male
Cat,Long,Female
written with python
Now, all that said, do you really need to write header and footer. It will just break a nicely formatted csv file and would require extra effort later on when reading it.
Or if you prefer - is the csv format what best suits your needs? Maybe using json would be better...
vardate= datetime.datetime.now().strftime("%x")
file = 'file.txt'
header_names = {'t1':'Test1', 't2': 'Test2','t3':'Test3'}
with open(file, 'w', newline='') as f:
f.seek(0,0) //This will move cursor to start position of file
f.writelines("File Name: ", file)
f.writelines("date: ", vardate)
f.writelines(".Try out next..")
w = csv.DictWriter(f, fieldnames=header_names.keys(), restval='',
extrasaction='ignore')
w.writerow(header_names)
for doc in res['test']['test']:
my_dict = doc['test']
w.writerow(my_dict)
f.seek(0,2)
f.writelines("This is generated using Python")

python csv, writing headers only once

So I have a program that creates CSV from .Json.
First I load the json file.
f = open('Data.json')
data = json.load(f)
f.close()
Then I go through it, looking for a specific keyword, if I find that keyword. I'll write everything related to that in a .csv file.
for item in data:
if "light" in item:
write_light_csv('light.csv', item)
This is my write_light_csv function :
def write_light_csv(filename,dic):
with open (filename,'a') as csvfile:
headers = ['TimeStamp', 'light','Proximity']
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n',fieldnames=headers)
writer.writeheader()
writer.writerow({'TimeStamp': dic['ts'], 'light' : dic['light'],'Proximity' : dic['prox']})
I initially had wb+ as the mode, but that cleared everything each time the file was opened for writing. I replaced that with a and now every time it writes, it adds a header. How do I make sure that header is only written once?.
You could check if file is already exists and then don't call writeheader() since you're opening the file with an append option.
Something like that:
import os.path
file_exists = os.path.isfile(filename)
with open (filename, 'a') as csvfile:
headers = ['TimeStamp', 'light', 'Proximity']
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n',fieldnames=headers)
if not file_exists:
writer.writeheader() # file doesn't exist yet, write a header
writer.writerow({'TimeStamp': dic['ts'], 'light': dic['light'], 'Proximity': dic['prox']})
Just another way:
with open(file_path, 'a') as file:
w = csv.DictWriter(file, my_dict.keys())
if file.tell() == 0:
w.writeheader()
w.writerow(my_dict)
You can check if the file is empty
import csv
import os
headers = ['head1', 'head2']
for row in interator:
with open('file.csv', 'a') as f:
file_is_empty = os.stat('file.csv').st_size == 0
writer = csv.writer(f, lineterminator='\n')
if file_is_empty:
writer.writerow(headers)
writer.writerow(row)
I would use some flag and run a check before writing headers! e.g.
flag=0
def get_data(lst):
for i in lst:#say list of url
global flag
respons = requests.get(i)
respons= respons.content.encode('utf-8')
respons=respons.replace('\\','')
print respons
data = json.loads(respons)
fl = codecs.open(r"C:\Users\TEST\Desktop\data1.txt",'ab',encoding='utf-8')
writer = csv.DictWriter(fl,data.keys())
if flag==0:
writer.writeheader()
writer.writerow(data)
flag+=1
print "You have written % times"%(str(flag))
fl.close()
get_data(urls)
Can you change the structure of your code and export the whole file at once?
def write_light_csv(filename, data):
with open (filename, 'w') as csvfile:
headers = ['TimeStamp', 'light','Proximity']
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n',fieldnames=headers)
writer.writeheader()
for item in data:
if "light" in item:
writer.writerow({'TimeStamp': item['ts'], 'light' : item['light'],'Proximity' : item['prox']})
write_light_csv('light.csv', data)
You can use the csv.Sniffer Class and
with open('my.csv', newline='') as csvfile:
if csv.Sniffer().has_header(csvfile.read(1024))
# skip writing headers
While using Pandas: (for storing Dataframe data to CSV file)
just add this check before setting header property if you are using an index to iterate over API calls to add data in CSV file.
if i > 0:
dataset.to_csv('file_name.csv',index=False, mode='a', header=False)
else:
dataset.to_csv('file_name.csv',index=False, mode='a', header=True)
Here's another example that only depends on Python's builtin csv package. This method checks that the header is what's expected or it throws an error. It also handles the case where the file doesn't exist or does exist but is empty by writing the header. Hope this helps:
import csv
import os
def append_to_csv(path, fieldnames, rows):
is_write_header = not os.path.exists(path) or _is_empty_file(path)
if not is_write_header:
_assert_field_names_match(path, fieldnames)
_append_to_csv(path, fieldnames, rows, is_write_header)
def _is_empty_file(path):
return os.stat(path).st_size == 0
def _assert_field_names_match(path, fieldnames):
with open(path, 'r') as f:
reader = csv.reader(f)
header = next(reader)
if header != fieldnames:
raise ValueError(f'Incompatible header: expected {fieldnames}, '
f'but existing file has {header}')
def _append_to_csv(path, fieldnames, rows, is_write_header: bool):
with open(path, 'a') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
if is_write_header:
writer.writeheader()
writer.writerows(rows)
You can test this with the following code:
file_ = 'countries.csv'
fieldnames_ = ['name', 'area', 'country_code2', 'country_code3']
rows_ = [
{'name': 'Albania', 'area': 28748, 'country_code2': 'AL', 'country_code3': 'ALB'},
{'name': 'Algeria', 'area': 2381741, 'country_code2': 'DZ', 'country_code3': 'DZA'},
{'name': 'American Samoa', 'area': 199, 'country_code2': 'AS', 'country_code3': 'ASM'}
]
append_to_csv(file_, fieldnames_, rows_)
If you run this once you get the following in countries.csv:
name,area,country_code2,country_code3
Albania,28748,AL,ALB
Algeria,2381741,DZ,DZA
American Samoa,199,AS,ASM
And if you run it twice you get the following (note, no second header):
name,area,country_code2,country_code3
Albania,28748,AL,ALB
Algeria,2381741,DZ,DZA
American Samoa,199,AS,ASM
Albania,28748,AL,ALB
Algeria,2381741,DZ,DZA
American Samoa,199,AS,ASM
If you then change the header in countries.csv and run the program again, you'll get a value error, like this:
ValueError: Incompatible header: expected ['name', 'area', 'country_code2', 'country_code3'], but existing file has ['not', 'right', 'fieldnames']

Leaking memory parsing TSV and writing CSV in Python

I'm writing a simple script in Python as a learning exercise. I have a TSV file I've downloaded from the Ohio Board of Elections, and I want to manipulate some of the data and write out a CSV file for import into another system.
My issue is that it's leaking memory like a sieve. On a single run of a 154MB TSV file it consumes 2GB of memory before I stop it.
The code is below, can someone please help me identify what I'm missing with Python?
import csv
import datetime
import re
def formatAddress(row):
address = ''
if str(row['RES_HOUSE']).strip():
address += str(row['RES_HOUSE']).strip()
if str(row['RES_FRAC']).strip():
address += '-' + str(row['RES_FRAC']).strip()
if str(row['RES STREET']).strip():
address += ' ' + str(row['RES STREET']).strip()
if str(row['RES_APT']).strip():
address += ' APT ' + str(row['RES_APT']).strip()
return address
vote_type_map = {
'G': 'General',
'P': 'Primary',
'L': 'Special'
}
def formatRow(row, fieldnames):
basic_dict = {
'Voter ID': str(row['VOTER ID']).strip(),
'Date Registered': str(row['REGISTERED']).strip(),
'First Name': str(row['FIRSTNAME']).strip(),
'Last Name': str(row['LASTNAME']).strip(),
'Middle Initial': str(row['MIDDLE']).strip(),
'Name Suffix': str(row['SUFFIX']).strip(),
'Voter Status': str(row['STATUS']).strip(),
'Current Party Affiliation': str(row['PARTY']).strip(),
'Year Born': str(row['DATE OF BIRTH']).strip(),
#'Voter Address': formatAddress(row),
'Voter Address': formatAddress({'RES_HOUSE': row['RES_HOUSE'], 'RES_FRAC': row['RES_FRAC'], 'RES STREET': row['RES STREET'], 'RES_APT': row['RES_APT']}),
'City': str(row['RES_CITY']).strip(),
'State': str(row['RES_STATE']).strip(),
'Zip Code': str(row['RES_ZIP']).strip(),
'Precinct': str(row['PRECINCT']).strip(),
'Precinct Split': str(row['PRECINCT SPLIT']).strip(),
'State House District': str(row['HOUSE']).strip(),
'State Senate District': str(row['SENATE']).strip(),
'Federal Congressional District': str(row['CONGRESSIONAL']).strip(),
'City or Village Code': str(row['CITY OR VILLAGE']).strip(),
'Township': str(row['TOWNSHIP']).strip(),
'School District': str(row['SCHOOL']).strip(),
'Fire': str(row['FIRE']).strip(),
'Police': str(row['POLICE']).strip(),
'Park': str(row['PARK']).strip(),
'Road': str(row['ROAD']).strip()
}
for field in fieldnames:
m = re.search('(\d{2})(\d{4})-([GPL])', field)
if m:
vote_type = vote_type_map[m.group(3)] or 'Other'
#print { 'k1': m.group(1), 'k2': m.group(2), 'k3': m.group(3)}
d = datetime.date(year=int(m.group(2)), month=int(m.group(1)), day=1)
csv_label = d.strftime('%B %Y') + ' ' + vote_type + ' Ballot Requested'
d = None
basic_dict[csv_label] = row[field]
m = None
return basic_dict
output_rows = []
output_fields = []
with open('data.tsv', 'r') as f:
r = csv.DictReader(f, delimiter='\t')
#f.seek(0)
fieldnames = r.fieldnames
for row in r:
output_rows.append(formatRow(row, fieldnames))
f.close()
if output_rows:
output_fields = sorted(output_rows[0].keys())
with open('data_out.csv', 'wb') as f:
w = csv.DictWriter(f, output_fields, quotechar='"')
w.writeheader()
for row in output_rows:
w.writerow(row)
f.close()
You are accumulating all the data into a huge list, output_rows. You need to process each row as you read it, instead of saving all of them into a memory-expensive Python list.
with open('data.tsv', 'rb') as fin, with open('data_out.csv', 'wb') as fout:
reader = csv.DictReader(fin, delimiter='\t')
firstrow = next(r)
fieldnames = reader.fieldnames
basic_dict = formatRow(firstrow, fieldnames)
output_fields = sorted(basic_dict.keys())
writer = csv.DictWriter(fout, output_fields, quotechar='"')
writer.writeheader()
writer.writerow(basic_dict)
for row in reader:
basic_dict = formatRow(row, fieldnames)
writer.writerow(basic_dict)
You're not leaking any memory, you're just using a ton of memory.
You're turning each line of text into a dict of Python strings, which takes considerably more memory than a single string. For full details, see Why does my 100MB file take 1GB of memory?
The solution is to do this iteratively. You don't actually need the whole list, because you never refer back to any previous values. So:
with open('data.tsv', 'r') as fin, open('data_out.csv', 'w') as fout:
r = csv.DictReader(fin, delimiter='\t')
output_fields = sorted(r.fieldnames)
w = csv.DictWriter(fout, output_fields, quotechar='"')
w.writeheader()
for row in r:
w.writerow(formatRow(row, fieldnames))
Or, even more simply:
w.writerows(formatRow(row, fieldnames) for row in r)
Of course this is slightly different from you original code in that it creates the output file even if the input file is empty. You can fix that pretty easily if it's important:
with open('data.tsv', 'r') as fin:
r = csv.DictReader(fin, delimiter='\t')
first_row = next(r)
if row:
with open('data_out.csv', 'wb') as fout:
output_fields = sorted(r.fieldnames)
w = csv.DictWriter(fout, output_fields, quotechar='"')
w.writeheader()
w.writerow(formatRow(row, fieldnames))
for row in r:
w.writerow(formatRow(row, fieldnames))
maybe it helps some with an similar problem..
While reading a plain CSV file line by line and deciding by a field if it should be saved in file A or file B, a memory overflow occurred and my kernel died. I therefore analyzed my memory usage and this small change 1. tripled the iterations by a cut 2. fixed the problem with the memory leackage
That was my Code with memory leakage and long runtime
with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
condition_row = row[1]
if condition_row == 'condition':
file_A.writerow(row)
else:
file_B.write(row)
BUT if you don't declare the variable (or more variables of your reading file) before like this:
with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
if row[1] == 'condition':
file_A.writerow(row)
else:
file_B.write(row)
I can not explain why this is so, but after some tests I could determine that I am on average 3 times as fast and my RAM is close to zero.

Appending dictionary results to csv in python

I am looking to append a dictionary I have to a CSV file were I already have a header line
and if a value doesn't exist I want to write '-999':
SDict ={T1:'A',T2:'B',T4:'D')
where CSV file has header of
T1,T2,T3,T4,T5
7,8,9,10,11
and the expected results are
T1,T2,T3,T4,T5
7,8,9,10,11
A,B,-999,D,-999
I am trying to do so with the code:
import sys
import os
import csv
def GetFileHeader(Fpath):
i=10
ResFile=open (Fpath, 'r+')
HeaderDict={}
r=csv.reader(ResFile)
HeaderList = r.next()
for Header in HeaderList:
HeaderDict[Header]=i+1
print HeaderDict
ResFile.close()
return HeaderDict
Fpath= r'Z:\temp\assaf\S2TTP\S2T_TP\modules\results\Y124\res.csv'
Header= GetFileHeader(Fpath)
with open(Fpath,'rb') as fin:
dr = csv.DictReader(fin, dialect='excel')
print dr
print dr.fieldnames
# dr.fieldnames contains values from first row of `f`.
with open(Fpath,'ab+') as fou:
dw = csv.DictWriter(fou, dialect='excel', fieldnames=dr.fieldnames)
fieldnames=dr.fieldnames
for K in fieldnames:
dw.writerow(Header[k])
I think you can simply do:
import csv
SDict = {'T1': 'A', 'T2': 'B', 'T4': 'D'}
with open('file.csv', 'r+b') as f:
header = next(csv.reader(f))
dict_writer = csv.DictWriter(f, header, -999)
dict_writer.writerow(SDict)
This is assuming you're on Python 2.X. Also, be wary of files which don't end in a newline, or you could end up with a row like 7,8,9,10,11A,B,-999,D,-999.

Attempting to merge three columns in CSV, updating original CSV

Some example data:
title1|title2|title3|title4|merge
test|data|here|and
test|data|343|AND
",3|data|343|and
My attempt at coding this:
import csv
import StringIO
storedoutput = StringIO.StringIO()
fields = ('title1', 'title2', 'title3', 'title4', 'merge')
with open('file.csv', 'rb') as input_csv:
reader = csv.DictReader(input_csv, fields, delimiter='|')
for counter, row in enumerate(reader):
counter += 1
#print row
if counter != 1:
for field in fields:
if field == "merge":
row['merge'] = ("%s%s%s" % (row["title1"], row["title3"], row["title4"]))
print row
storedoutput.writelines(','.join(map(str, row)) + '\n')
contents = storedoutput.getvalue()
storedoutput.close()
print "".join(contents)
with open('file.csv', 'rb') as input_csv:
input_csv = input_csv.read().strip()
output_csv = []
output_csv.append(contents.strip())
if "".join(output_csv) != input_csv:
with open('file.csv', 'wb') as new_csv:
new_csv.write("".join(output_csv))
Output should be
title1|title2|title3|title4|merge
test|data|here|and|testhereand
test|data|343|AND|test343AND
",3|data|343|and|",3343and
For your reference upon running this code the first print it prints the rows as I would hope then to appear in the output csv. However the second print prints the title row x times where x is the number of rows.
Any input or corrections or working code would be appreciated.
I think we can make this a lot simpler. Dealing with the rogue " was a bit of a nuisance, I admit, because you have to work hard to tell Python you don't want to worry about it.
import csv
with open('file.csv', 'rb') as input_csv, open("new_file.csv", "wb") as output_csv:
reader = csv.DictReader(input_csv, delimiter='|', quoting=csv.QUOTE_NONE)
writer = csv.DictWriter(output_csv, reader.fieldnames, delimiter="|",quoting=csv.QUOTE_NONE, quotechar=None)
merge_cols = "title1", "title3", "title4"
writer.writeheader()
for row in reader:
row["merge"] = ''.join(row[col] for col in merge_cols)
writer.writerow(row)
produces
$ cat new_file.csv
title1|title2|title3|title4|merge
test|data|here|and|testhereand
test|data|343|AND|test343AND
",3|data|343|and|",3343and
Note that even though you wanted the original file updated, I refused. Why? It's a bad idea, because then you can destroy your data while working on it.
How can I be so sure? Because that's exactly what I did when I first ran your code, and I know better. ;^)
That double quote in the last line is definitely messing up the csv.DictReader().
This works:
new_lines = []
with open('file.csv', 'rb') as f:
# skip the first line
new_lines.append(f.next().strip())
for line in f:
# the newline and split the fields
line = line.strip().split('|')
# exctract the field data you want
title1, title3, title4 = line[0], line[2], line[3]
# turn the field data into a string and append in to the rest
line.append(''.join([title1, title3, title4]))
# save the new line for later
new_lines.append('|'.join(line))
with open('file.csv', 'w') as f:
# make one long string and write it to the new file
f.write('\n'.join(new_lines))
import csv
import StringIO
stored_output = StringIO.StringIO()
with open('file.csv', 'rb') as input_csv:
reader = csv.DictReader(input_csv, delimiter='|', quoting=csv.QUOTE_NONE)
writer = csv.DictWriter(stored_output, reader.fieldnames, delimiter="|",quoting=csv.QUOTE_NONE, quotechar=None)
merge_cols = "title1", "title3", "title4"
writer.writeheader()
for row in reader:
row["merge"] = ''.join(row[col] for col in merge_cols)
writer.writerow(row)
contents = stored_output.getvalue()
stored_output.close()
print contents
with open('file.csv', 'rb') as input_csv:
input_csv = input_csv.read().strip()
if input_csv != contents.strip():
with open('file.csv', 'wb') as new_csv:
new_csv.write("".join(contents))

Categories