CSV to Python Dictionary with all column names?

CSV to Python Dictionary with all column names? - python

I'm still pretty new to using python to program from scratch so as an exercise I though I'd take a file that I process using SQL an try to duplicate the functionality using Python. It seems that I want to take my (compressed, zip) csv file and create a Dict of it (or maybe a dict of dicts?). When I use dict reader I get the 1st row as a key rather than each column as its own key? E.g.
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file,dialect='excel'):
pass
Yields:
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=MLS_ACCT PARCEL_ID AREA COUNTY STREET_NUM STREET_NAME CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE,
value=492859 28-15-3-009-001.0000 200 JEFF 3828 ORLEANS RD MOUNTAIN BROOK 35243 A SFR 324900 3/3/2011 2 3/4/2011 12:04:11 AM 0 2 3 0 1968 EXIST SPARKS 3/3/2011 11:54:56 PM KNOLLWOOD
So what I'm looking for is a column for MLS_ACCT and a separate one for PARCEL_ID etc so I can then do things like average prices by all items that contain KNOLLWOOD in the SUBDIVISION field With a further sub section by date range, date sold etc.
I know well how to do it with SQL but As I said I'm tying to gain some Python skills here.
I have been reading for the last few days but have yet to find any very simple illustrations on this sort of use case. Pointers to said docs would be appreciated. I realize I could use memory resident SQL-lite but again my desire is to get the Python approach learned.I've read some on Numpy and Scipy and have sage loaded but still can't find some useful illustrations since those tools seem focussed on arrays with only numbers as elements and I have a lot of string matching I need to do as well as date range calculations and comparisons.
Eventually I'll need to substitute values in the table (since I have dirty data), I do this now by having a "translate table" which contains all dirty variants and provides a "clean" answer for final use.

Are you sure that this is a file with comma-separated values? It seems like the lines are being delimited by tabs.
If this is correct, specify a tab delimiter in the DictReader constructor.
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
for key in row:
print 'key=%s, value=%s' % (key, row[key])
Source: http://docs.python.org/library/csv.html

Writing the operation in pure Python is certainly possible, but you'll have to choose your algorithms then. The row output you've posted above looks a whole lot like the parsing has gone wrong; in fact, it seems not to be a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime.

At first glance it seems like your input might not actually be CSV, but maybe is tab delimited instead. Check out the docs at python.org, you can create a Dialect and use that to change the delimiter.
import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
pass

Related

How to find duplicates in a csv with python, and then alter the row

For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.

If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D

I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.

Python: Removing duplicates from a huge csv file (memory issues)

I have a csv file that is very big, containing a load of different people. Some of these people come up twice. Something like this:
Name,Colour,Date
John,Red,2017
Dave,Blue,2017
Tom,Blue,2017
Amy,Green,2017
John,Red,2016
Dave,Green,2016
Tom,Blue,2016
John,Green,2015
Dave,Green,2015
Tom,Blue,2015
Rebecca,Blue,2015
I want a csv file that contains only the most recent colour for each person. For example, for John, Dave, Tom and Amy I am only interested in the row for 2017. For Rebecca I will need the value from 2015.
The csv file is huge, containing over 10 million records (all people have a unique ID so repeated names don't matter). I've tried something along the lines of the following:
Open csv file
Read line 1.
If person is not in "seen" list, add to csv file 2
Add person to "Seen" list.
Read line 2...
The problem is the "seen" list gets massive and I run out of memory. The other issue is sometimes the dates are not in order so an old entry gets into the "seen" list and then the new entry won't overwrite it. This would be easy to solve if I could sort the data by descending date, but I'm struggling to sort it with the size of the file.
Any suggestions?

If the whole csv file can be stored in a list like:
csv_as_list = [
(unique_id, color, year),
…
]
then you can sort this list by:
import operator
# first sort by year descending
csv_as_list.sort(key=operator.itemgetter(2), reverse=True)
# then, since the Python sort is stable, by unique_id
csv_as_list.sort(key=operator.itemgetter(0))
and then you can:
from __future__ import print_function
import operator, itertools
for unique_id, group in itertools.groupby(csv_as_list, operator.itemgetter(0)):
latest_color = next(group)[1]
print(unique_id, latest_color)
(I just used print here, but you get the gist.)
If the csv file cannot be loaded in-memory as a list, you'll have to go through an intermediate step that uses disk (e.g. SQLite).

Open your csv file to read.
Read line by line, append user to final_list if his ID is not already found in there. If it is found, check the year of your current_data, with your final_list data. If the current data has a more recent entry, just change the date of your user in final_list, along with the color associated with it.
Only then, when your final_list is done, will you write a new csv file.
If you want this task to be faster, you want to...
Optimize your loops.
Use standard python functions and/or libraries coded in C.
If this is still not optimized enough... learn C. Reading a csv file in C, parsing it with a separator, and iterating through an array is not hard, even in C.

I see two obvious ways to solve this that don't involve keeping huge amounts of data in memory:
Use a database instead of CSV files
Reorganise your CSV files to facilitate sorting.
Using a database is fairly straightforward. I expect you could could even use the SQLite that comes with Python. This would be my preferred option, I think. To get the best performance, create an index of (person, date).
The second involves letting the first column of your CSV file be the person ID and the second column be the date. Then you could sort the CSV file from the commandline, i.e. sort myfile.csv. This will group all entries for a particular person together, and provided your date is in a proper format (e.g. YYYY-MM-DD), the entry of interest will be the last one. The Unix sort command is not known for its speed, but it's very robust.

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?

Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Python or bash: Merging two csv files based on several matching field values, formatting, the outputting CSV

My preference would be for this to be in Python since I am working on learning more. If you can provide help in bash that would still be helpful, though.
I've looked around Stack Overflow and found some helpful things but not enough for me to finish this.
I have two CSV files with some shared fields. The data is not INT. I would like to join based on matching 3 specific fields and write it out to a new output.csv when all the processing is done.
sourceA.csv looks like this:
fieldname_1,fieldname_2,fieldname_3,fieldname_4,fieldname_5,fieldname_6,fieldname_7,fieldname_8,fieldname_9,fieldname_10,fieldname_11,fieldname_12,fieldname_13,fieldname_14,fieldname_15,fieldname_16
sourceB.csv looks like this:
fieldname_4,fieldname_5,fieldname_OTHER,fieldname_8,fieldname_16
As you can see, sourceB.csv has 4 field names that are also in sourceA.csv and one field name that does not. The data in fieldname_OTHER will need to replace the data in sourceA[fieldname_6].
The whole process should go like this:
Replace data in sourceA[fieldname_6] with data from sourceB[fieldname_OTHER] if all of the following criteria are met:
data in sourceA[fieldname_4]=sourceB[fieldname_4]
data in sourceA[fieldname_8]=sourceB[fieldname_8]
data in sourceA[fieldname_16]=sourceB[fieldname_16]
(The data in sourceB[fieldname_5] does not need to be evaluated.)
If the above criteria aren't met, just replace sourceA[fieldname_6] with the text ANY.
Write each processed line out to output.csv.
A sample of what I would like the output to be based on the input CSVs and processing outlined above:
dataA,dataB,dataC,dataD,dataE,dataOTHER,dataG,dataH,dataI,dataJ,dataK,dataL,dataM,dataN,dataO,dataP
I hope the details I've provided haven't made it more confusing than it needs to be. Thank you for all your help!

I'm not sure I'd bother with SQL for a one-off merger like this. It's straightforward in python.
Read in both files with the csv module, to get two lists. Index sourceA into a dictionary whose key is the tuple of fields that need to be matched. You can then loop over sourceB, find the matching row instantly, and merge into it from sourceB.
When you're done, you can just output the list you read from sourceA: the dict and the list point to the same values, which you've now updated.

Parse a CSV file using python (to make a decision tree later) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.
The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).
The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:
# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14
At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:
Read in each line, character by character
If the character is not a comma or a space
Append character to temporary string
If the character is a comma
Append the temporary string to a list
Empty string
Once a line has been read
Create a dictionary using the header row as the key (somehow!)
Append that dictionary to a list
However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.
Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?

Python has some pretty powerful language constructs builtin. You can read lines from a file like:
with open(name_of_file,"r") as file:
for line in file:
# process the line
You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.
To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:
mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary
You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:
if key in mydict:
print "Present"
else:
print "Absent"
To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:
for val in mylist:
# do something with val
for key in mydict:
# do something with key or with mydict[key]
Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:
for idx, val in enumerate(mylist):
# do something with val or with idx. Note that val=mylist[idx]
The code above is identical in function to:
idx=0
for val in mylist:
# process val, idx
idx += 1
You could also iterate over the indices if you so chose:
for idx in xrange(len(mylist)):
# Do something with idx and possibly mylist[idx]
Also, you can get the number of elements in a list or the number of keys in a dictionary using len.
It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:
>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

Example using the csv module from docs.python.org:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Instead of printing the rows, you could just save each row into a list, and then process it in the ID3 later.
database.append(row)

Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!

Look at csv.DictReader.
Example:
import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
print d # this will print out a dictionary with keys equal to the first row of the file.

Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...
If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the str.split() function and use that, reading the file line-by-line.

Parse the CSV correctly
I'd avoid using str.split() to parse the fields because str.split() will not recognize quoted values. And many real-world CSV files use quotes.
http://en.wikipedia.org/wiki/Comma-separated_values
Example record using quoted values:
1997,Ford,E350,"Super, luxurious truck"
If you use str.split(), you'll get a record like this with 5 fields:
('1997', 'Ford', 'E350', '"Super', ' luxurious truck"')
But what you really want are records like this with 4 fields:
('1997', 'Ford', 'E350', 'Super, luxurious truck')
Also, besides commas being in the data, you may have to deal with newlines "\r\n" or just "\n" in the data. For example:
1997,Ford,E350,"Super
luxurious truck"
1997,Ford,E250,"Ok? Truck"
So be careful using:
file = open('filename.csv', 'r')
for line in file:
# problem here, "line" may contain partial data
Also, like John mentioned, the CSV standard is, that in quotes, if you get a double-double quote, then it turns into one quote.
1997,Ford,E350,"Super ""luxurious"" truck"
('1997', 'Ford', 'E350', 'Super "luxurious" truck')
So I'd suggest to modify your finite state machine like this:
Parse each character at a time.
Check to see if it's a quote, then set the state to "in quote"
If "in quote", store all the characters in the current field until there's another quote.
If "in quote", and there's another quote, store the quote character in the field data. (not the end, because a blank field shouldn't be `data,"",data` but instead `data,,data`)
If not "in quote", store the characters until you find a comma or newline.
If comma, save field and start a new field.
If newline, save field, save record, start a new record and a new field.
On a side note, interestingly, I've never seen a header commented out using # in a CSV. So to me, that would imply that you may have to look for commented lines in the data too. Using # to comment out a line in a CSV file is not standard.
Adding found fields into a record dictionary using header keys
Depending on memory requirements, if the CSV is small enough (maybe 10k to 100k records), using a dictionary is fine. Just store a list of all the column names so you can access the column name by index (or number). Then in the finite state machine, increment the column index when you find a comma, and reset to 0 when you find a newline.
So if your header is header = ['Column1', 'Column2'] Then when you find a data character, add it like this:
record[header[column_index]] += character

I don't know too much about the builtin csv module that #Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:
for line in file:
columns = line.split(',')
for column in columns:
print column.strip()
This will print all the entries of each line without the leading a tailing whitespaces.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.