Parse csv to dict - python

I am trying to parse csv financial data from the web into a dict that I can navigate through by key.
I am failing using csv.DictReader.
I have:
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.next()
csvio = (csv.DictReader(response))
print csvio.fieldnames
print len(csvio.fieldnames)
Edited to reflect changes from answer below.
This almost gets me there, but I need to strip the leading 'Fiscal year...share data' before feeding it to DictReader. How best to do this? I have looked at converting to string and stripping lead chars with str.lstrip() as the docs say here with no luck.

To use a DictReader you need to either specify the field names, or the field names need to be the first row of the csv data (ie. a header row).
In the csv file that your code retrieves, the field names are in the second row of data, not the first. What I've done is to throw out the first line of data before passing the csv file to the DictReader constructor.
In response to your updated question:
Stripping the leading text from the header row probably isn't desirable as this is acting as the field name for the first column of data. Probably better to discard the first 2 rows of data and then supply the desired field names directly to the DictReader. I have updated the example below to reflect this.
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.readline() # This reads (and discards) the first row of data which is not needed.
response.readline() # skip the
myFieldnames = ["firstColName", "TTM", "2012", "2011", "2010", "2009", "2008"]
csvio = csv.DictReader(response, fieldnames=myFieldnames)
print csvio.fieldnames
for row in csvio:
print row

Related

Download "csv-like" text data file and convert it to CSV in python

First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0
There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT
Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.

Parse Specific Text File to CSV Format with Headers

I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!
You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.

KeyError while executing Python Code

I am getting below keyError while running my python script which import data from one csv,modify it and write to another csv.
Code snippet:
import csv
Ty = 'testy'
Tx = 'testx'
ifile = csv.DictReader(open('test.csv'))
cdata = [x for x in ifile]
for row in cdata:
row['Test'] = row.pop(Ty)
Error seen while executing :
row['Test'] = row.pop(Ty)
KeyError: 'testy'
Any idea?
Thanks
Probably your csv don't have a header, where the specification of the key is done, since you didn't define the key names. The DictReader requires the parameter fieldnames so it can map accordingly it as keys (header) to values.
So you should do something like to read your csv file:
ifile = csv.DictReader(open('test.csv'), fieldnames=['testx', 'testy'])
If you don't want to pass the fieldnames parameter, try to understand from where the csv define its header, see the wikipedia article:
The first record may be a "header", which contains column names in
each of the fields (there is no reliable way to tell whether a file
does this or not; however, it is uncommon to use characters other than
letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
You can put your 'testy' and 'testx' in your csv and don't pass the fieldnames to DictReader
According to the error message, there is missing testy on the first line of test.csv
Try such content in test.csv
col_name1,col_name2,testy
a,b,c
c,d,e
Note that there should not be any spaces/tabs around the testy.

Special case to grab the headers for a DictReader in Python

Normally the csv.DictReader will use the first line of a .csv file as the column headers, i.e. the keys to the dictionary:
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
However, I am faced with something like this for my first line:
#Format: header1 header2 header3 ...etc.
The #Format: needs to be skipped, as it is not a column header. I could do something like:
column_headers = ['header1', 'header2', 'header3']
reader = csv.dictReader(my_file, delimiter='\t', fieldnames=column_headers)
But I would rather have the DictReader handle this for two reason.
There are a lot of columns
The column names may change over time, and this is a quarterly-run process.
Is there some way to have the DictReader still use the first line as the column headers, but skip that first #Format: word? Or really any word that starts with a # would probably suffice.
As DictReader wraps an open file, you could read the first line of the file, parse the headers from there (headers = my_file.readline().split(delimiter)[1:], or something like that), and then pass them to DictReader() as the fieldnames argument. The DictReader constructor does not reset the file, so you don't have to worry about it reading in the header list after you've parsed that.

How to separate comma separated data from csv file?

I have opened a csv file and I want to sort each string which is comma separated and are in same line:
ex:: file :
name,sal,dept
tom,10000,it
o/p :: each string in string variable
I have a file which is already open, so I can not use "open" API, I have to use "csv.reader" which have to read one line at a time.
If the file open for reading is bound to a variable name, say fin; and assuming you're using Python 2.6, and you know the file's not empty (has at least the row with headers):
import csv
rd = csv.reader(fin)
headers = next(rd)
for data in rd:
...process data and headers...
In Python 2.5, use headers = rd.next() instead of headers = next(rd).
These versions use a list of fields data, which is a completely general solution (i.e., you don't need to know in advance how many columns the file has: you'll access them as data[0], data[1], etc, and the current row has len(data) fields at each leg of the loop).
If you know the file has exactly three columns and prefer to use separate names for a variable per column, change the loop header to:
for name, sales, department in rd:
The field data as returned by the reader (just like the headers) are all strings. If you know for example that the second column is an int and want to treat it as such, start the loop with
for data in rd:
data[1] = int(data[1])
or, if you're using the named-variables variant:
for name, sales, department in rd:
sales = int(sales)
I don't know if I have properly understood your question. You may want to have a look at the split function described in the Python documentation anyway.

Categories