Special case to grab the headers for a DictReader in Python - python

Normally the csv.DictReader will use the first line of a .csv file as the column headers, i.e. the keys to the dictionary:
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
However, I am faced with something like this for my first line:
#Format: header1 header2 header3 ...etc.
The #Format: needs to be skipped, as it is not a column header. I could do something like:
column_headers = ['header1', 'header2', 'header3']
reader = csv.dictReader(my_file, delimiter='\t', fieldnames=column_headers)
But I would rather have the DictReader handle this for two reason.
There are a lot of columns
The column names may change over time, and this is a quarterly-run process.
Is there some way to have the DictReader still use the first line as the column headers, but skip that first #Format: word? Or really any word that starts with a # would probably suffice.

As DictReader wraps an open file, you could read the first line of the file, parse the headers from there (headers = my_file.readline().split(delimiter)[1:], or something like that), and then pass them to DictReader() as the fieldnames argument. The DictReader constructor does not reset the file, so you don't have to worry about it reading in the header list after you've parsed that.

Related

Python's CSVReader seems to be seperating on periods

Interesting problem, I'm using python's CSVreader to read comma delimited data from a UTF-8 formatted CSV file. It appears the reader is truncating column names when it encounters a period.
For example, here is a sample of my column names.
time,b12.76org2101.xz,b12.75org2001.xz,b11.72ogg8090.xy
Here's how I'm reading this data
def parseCSV(inputData):
file_to_open = inputData
with open(file_to_open) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
headerLine = True
line = []
for row in csv_reader:
//column manipulation code here
And here's how CSVReader interprets those column names
time,76org2101,75org2001,72ogg8090
Here's the important bit, the code I shared is the first thing in the program that touches that CSV file. After the code has finished execution I can also verify that the CSV file itself is unchanged. The problem must lie with how CSVReader interprets periods but I'm not sure what the fix is
Here's another interesting find. Later in the program I use Pandas to read a list of identical names from a column in another file.
The data is formatted as follows
COLUMN_NAMES
b12.76org2101.xz,
b12.75org2001.xz,
b11.72ogg8090.xy,
Where COLUMN_NAMES is the CSV's header and the items below are rows.
You can see the code I use to read these values here.
data = pandas.read_csv(file_to_open)
Headers = data['COLUMN_NAMES'].tolist()
And this is how Pandas interprets those rows
76org2101
75org2001
72ogg8090
The Data is exactly the same, and we see exactly the same behavior! The column names with periods are truncated in exactly the same way.
So what's up? Because both Pandas and CSVReader have identical issues I'm tempted to think this is a python problem, but I'm not sure how to resolve it. Any ideas are appreciated!
EDIT: The issue was with my code, I was reading the wrong files which incidentally happened to have the same column names as my expected files, just without anything before or after the periods. What're the odds!
Using pd.__version__ '0.23.0' and python version 3.6.5, I get the expected results:
print(pd.read_csv('test.csv'))
COLUMN_NAMES
0 b12.76org2101.xz
1 b12.75org2001.xz
2 b11.72ogg8090.xy
headers = pd.read_csv('test.csv')['COLUMN_NAMES'].tolist()
print(headers)
['b12.76org2101.xz', 'b12.75org2001.xz', 'b11.72ogg8090.xy']
It also works if those values are columns:
pd.DataFrame(columns=headers).to_csv('test1.csv', index=None)
print(pd.read_csv('test1.csv'))
Empty DataFrame
Columns: [b12.76org2101.xz, b12.75org2001.xz, b11.72ogg8090.xy]
Index: []
Maybe try updating your version of python?

CSV file with comma delimiter and quotes, but not on every line

Having an issue with reading a csv file that delimits everything with commas, but the first one in the csv file does not contain quotes. Example:
Symbol,"Name","LastSale","MarketCap","IPOyear","Sector","industry","Summary Quote",
the code used to try and read this is as follows:
from ystockquote import *
import csv
with open('companylist.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in readCSV:
print(row[0])
What I get is the following:
Symbol,"Name","LastSale","MarketCap","IPOyear","Sector","industry","Summary Quote",;
However, I just want to get all of the symbols from this list. Anyone an idea on how to do this?
edit
More data:
Symbol,"Name","LastSale","MarketCap","IPOyear","Sector","industry","Summary Quote",;
PIH,"1347 Property Insurance Holdings, Inc.","7.505","$45.23M","2014","Finance","Property-Casualty Insurers","http://www.nasdaq.com/symbol/pih",;
FLWS,"1-800 FLOWERS.COM, Inc.","9.59","$623.46M","1999","Consumer Services","Other Specialty Stores","http://www.nasdaq.com/symbol/flws",;
So my expected output would be:
Symbol
PIH
FLWS
This would happen if the csv.reader read my file as each of the rows as a seperate list, and within each of these lists all of the items (delimited by commas) would be their seperate values. (e.g. symbol would be the value of [0], "name" would be the value of [1], etc.)
I hope this clears up what I'm looking for
Found the easy way out:
Replaced all of the
"
with nothing in my csv file, this made it so that the csv.reader could read the csv file normally again.
If print(row[0]) is giving you a list, it might be because each row of your csv file is being read in as a list.
try print(row[0][0]) maybe?

Parse Specific Text File to CSV Format with Headers

I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!
You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.

KeyError while executing Python Code

I am getting below keyError while running my python script which import data from one csv,modify it and write to another csv.
Code snippet:
import csv
Ty = 'testy'
Tx = 'testx'
ifile = csv.DictReader(open('test.csv'))
cdata = [x for x in ifile]
for row in cdata:
row['Test'] = row.pop(Ty)
Error seen while executing :
row['Test'] = row.pop(Ty)
KeyError: 'testy'
Any idea?
Thanks
Probably your csv don't have a header, where the specification of the key is done, since you didn't define the key names. The DictReader requires the parameter fieldnames so it can map accordingly it as keys (header) to values.
So you should do something like to read your csv file:
ifile = csv.DictReader(open('test.csv'), fieldnames=['testx', 'testy'])
If you don't want to pass the fieldnames parameter, try to understand from where the csv define its header, see the wikipedia article:
The first record may be a "header", which contains column names in
each of the fields (there is no reliable way to tell whether a file
does this or not; however, it is uncommon to use characters other than
letters, digits, and underscores in such column names).
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
You can put your 'testy' and 'testx' in your csv and don't pass the fieldnames to DictReader
According to the error message, there is missing testy on the first line of test.csv
Try such content in test.csv
col_name1,col_name2,testy
a,b,c
c,d,e
Note that there should not be any spaces/tabs around the testy.

Parse csv to dict

I am trying to parse csv financial data from the web into a dict that I can navigate through by key.
I am failing using csv.DictReader.
I have:
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.next()
csvio = (csv.DictReader(response))
print csvio.fieldnames
print len(csvio.fieldnames)
Edited to reflect changes from answer below.
This almost gets me there, but I need to strip the leading 'Fiscal year...share data' before feeding it to DictReader. How best to do this? I have looked at converting to string and stripping lead chars with str.lstrip() as the docs say here with no luck.
To use a DictReader you need to either specify the field names, or the field names need to be the first row of the csv data (ie. a header row).
In the csv file that your code retrieves, the field names are in the second row of data, not the first. What I've done is to throw out the first line of data before passing the csv file to the DictReader constructor.
In response to your updated question:
Stripping the leading text from the header row probably isn't desirable as this is acting as the field name for the first column of data. Probably better to discard the first 2 rows of data and then supply the desired field names directly to the DictReader. I have updated the example below to reflect this.
import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.readline() # This reads (and discards) the first row of data which is not needed.
response.readline() # skip the
myFieldnames = ["firstColName", "TTM", "2012", "2011", "2010", "2009", "2008"]
csvio = csv.DictReader(response, fieldnames=myFieldnames)
print csvio.fieldnames
for row in csvio:
print row

Categories