python csv to dictionary columnwise - python

Is it possible to read data from a csv file into a dictionary, such that the first row of a column is the key and the remaining rows of that same column constitute the value as a list?
E.g. I have a csv file
strings, numbers, colors
string1, 1, blue
string2, 2, red
string3, 3, green
string4, 4, yellow
using
with open(file,'rU') as f:
reader = csv.DictReader(f)
for row in reader:
print row
I obtain
{'color': 'blue', 'string': 'string1', 'number': '1'}
{'color': 'red', 'string': 'string2', 'number': '2'}
{'color': 'green', 'string': 'string3', 'number': '3'}
{'color': 'yellow', 'string': 'string4', 'number': '4'}
or using
with open(file,'rU') as f:
reader = csv.reader(f)
mydict = {rows[0]:rows[1:] for rows in reader}
print(mydict)
I obtain the following dictionary
{'string3': ['3', 'green'], 'string4': ['4', 'yellow'], 'string2': ['2', 'red'], 'string': ['number', 'color'], 'string1': ['1', 'blue']}
However, I would like to obtain
{'strings': ['string1', 'string2', 'string3', 'string4'], 'numbers': [1, 2, 3,4], 'colors': ['red', 'blue', 'green', 'yellow']}

You need to parse the first row, create the columns, and then progress to the rest of the rows.
For example:
columns = []
with open(file,'rU') as f:
reader = csv.reader(f)
for row in reader:
if columns:
for i, value in enumerate(row):
columns[i].append(value)
else:
# first row
columns = [[value] for value in row]
# you now have a column-major 2D array of your file.
as_dict = {c[0] : c[1:] for c in columns}
print(as_dict)
output:
{
' numbers': [' 1', ' 2', ' 3', ' 4'],
' colors ': [' blue', ' red', ' green', ' yellow'],
'strings': ['string1', 'string2', 'string3', 'string4']
}
(some weird spaces, which were in your input "file". Remove spaces before/after commas, or use value.strip() if they're in your real input.)

This is why we have the defaultdict
from collections import defaultdict
from csv import DictReader
columnwise_table = defaultdict(list)
with open(file, 'rU') as f:
reader = DictReader(f)
for row in reader:
for col, dat in row.items():
columnwise_table[col].append(dat)
print columnwise_table

Yes it is possible: Try it this way:
import csv
from collections import defaultdict
D=defaultdict(list)
csvfile=open('filename.csv')
reader= csv.DictReader(csvfile) # Dictreader uses the first row as dictionary keys
for l in reader: # each row is in the form {k1 : v1, ... kn : vn}
for k,v in l.items():
D[k].append(v)
...................
...................
Assuming filename.csv has some data like
strings,numbers,colors
string1,1,blue
string2,2,red
string3,3,green
string4,4,yellow
then D will result in
defaultdict(<class 'list'>,
{'numbers': ['1', '2', '3', '4'],
'strings': ['string1', 'string2', 'string3', 'string4'],
'colors': ['blue', 'red', 'green', 'yellow']})

Related

Using csv.DictReader with a slice

I was wondering whether there was a way read all columns in a row except the first one as ints using csv.DictReader, kind of like this:
filename = sys.argv[1]
database = []
with open(filename) as file:
reader = csv.DictReader(file)
for row in reader:
row[1:] = int(row[1:])
database.append(row)
I know this isn't a correct way to do this as it gives out the error of being unable to hash slices. I have a way to circumvent having to do this at all, but for future reference, I'm curious whether, using slices or not, I can selectively interact with columns in a row without hardcoding each one?
You can do it by using the key() dictionary method to get a list of the keys in each dictionary and the slice that for doing the conversion:
import csv
from pprint import pprint
import sys
filename = sys.argv[1]
database = []
with open(filename) as file:
reader = csv.DictReader(file)
for row in reader:
for key in list(row.keys())[1:]:
row[key] = int(row[key])
database.append(row)
pprint(database)
Output:
[{'name': 'John', 'number1': 1, 'number2': 2, 'number3': 3, 'number4': 4},
{'name': 'Alex', 'number1': 4, 'number2': 3, 'number3': 2, 'number4': 1},
{'name': 'James', 'number1': 1, 'number2': 3, 'number3': 2, 'number4': 4}]
Use this:
import csv
filename = 'test.csv'
database = []
with open(filename) as file:
reader = csv.DictReader(file)
for row in reader:
new_d = {} # Create new dictionary to be appended into database
for i, (k, v) in enumerate(row.items()): # Loop through items of the row (i = index, k = key, v = value)
new_d[k] = int(v) if i > 0 else v # If i > 0, add int(v) to the dictionary, else add v
database.append(new_d) # Append to database
print(database)
test.csv:
Letter,Num1,Num2
A,123,456
B,789,012
C,345,678
D,901,234
E,567,890
Output:
[{'Letter': 'A', 'Num1': 123, 'Num2': 456},
{'Letter': 'B', 'Num1': 789, 'Num2': 12},
{'Letter': 'C', 'Num1': 345, 'Num2': 678},
{'Letter': 'D', 'Num1': 901, 'Num2': 234},
{'Letter': 'E', 'Num1': 567, 'Num2': 890}]

How can I loop through this dictionary instead of hardcoding the keys

So far, I have this code (from cs50/pset6/DNA):
import csv
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
for record in reader:
# `record` is a dictionary of column-name & value
name = record["name"]
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
data_dict[name] = data
print(data_dict)
Output
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}
Here is the csv file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
But my goal is to achieve the exact same thing, but instead of hardcoding the keys AATG, etc., and also because I'll use a much much bigger database that contains more values, I want to be able to loop through the data, instead of doing this:
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
Could you please help me? Thanks
You could also try using pandas.
Using your example data as .csv file:
pandas.read_csv('example.csv', index_col = 0).transpose().to_dict()
Outputs:
{'Alice': {'AGATC': 2, 'AATG': 8, 'TATC': 3},
'Bob': {'AGATC': 4, 'AATG': 1, 'TATC': 5},
'Charlie': {'AGATC': 3, 'AATG': 2, 'TATC': 5}}
index_col = 0 because you have names column which I set as index (so that later becomes top level keys in dictionary)
.transpose() so top level keys are names and not features (AGATC, AATG, etc.)
.to_dict() to transform pandas.DataFrame to python dictionary
you can simply use pandas:
import csv
import pandas as pd
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
df = pd.DataFrame(reader)
df = df.set_index('name') # set name column as index
data_dict = df.transpose().to_dict() # transpose to make dict with indexes
print(data_dict)
You can loop through a dictionary in python simply enough like this:
for key in dictionary:
print(key, dictionary[key])
You are on the right track using csv.DictReader.
import csv
from pprint import pprint
data_dict = {}
with open('fasta.csv', 'r') as f:
reader = csv.DictReader(f)
for record in reader:
name = record.pop('name')
data_dict[name] = record
pprint(data_dict)
Prints
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}

Python unique values per column in csv file row

Crunching on this for a long time. Is there an easy way using Numpy or Pandas or fixing my code to get the unique values for the column in a row separated by "|"
I.e the data:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL"
"2","john","doe","htw","2000","dev"
Output should be:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
My broken code:
import csv
import pprint
your_list = csv.reader(open('out.csv'))
your_list = list(your_list)
#pprint.pprint(your_list)
string = "|"
cols_no=6
for line in your_list:
i=0
for col in line:
if i==cols_no:
print "\n"
i=0
if string in col:
values = col.split("|")
myset = set(values)
items = list()
for item in myset:
items.append(item)
print items
else:
print col+",",
i=i+1
It outputs:
id, fname, lname, education, gradyear, attributes, 1, john, smith, ['harvard', 'ft', 'mit']
['2003', '212', '207']
['qa', 'admin,co', 'NULL', 'master']
2, john, doe, htw, 2000, dev,
Thanks in advance!
numpy/pandas is a bit overkill for what you can achieve with csv.DictReader and csv.DictWriter with a collections.OrderedDict, eg:
import csv
from collections import OrderedDict
# If using Python 2.x - use `open('output.csv', 'wb') instead
with open('input.csv') as fin, open('output.csv', 'w') as fout:
csvin = csv.DictReader(fin)
csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL)
csvout.writeheader()
for row in csvin:
for k, v in row.items():
row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
csvout.writerow(row)
Gives you:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
If you don't care about the order when you have many items separated with |, this will work:
lst = ["id","fname","lname","education","gradyear","attributes",
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL",
"2","john","doe","htw","2000","dev"]
def no_duplicate(string):
return "|".join(set(string.split("|")))
result = map(no_duplicate, lst)
print result
result:
['id', 'fname', 'lname', 'education', 'gradyear', 'attributes', '1', 'john', 'smith', 'ft|harvard|mit', '2003|207|212', 'NULL|admin,co|master|qa', '2', 'john', 'doe', 'htw', '2000', 'dev']

How to create dict using csv file with first row as keys

I'd like to create a list of dictionaries reading from a large csv file that uses the entries from the first row as keys. for example, test.csv
Header1, Header2, Header3
A, 1, 10
B, 2, 20
C, 3, 30
The resulting dict would look like:
MyList = [{'Header1': A, 'Header2': 1, 'Header3: 10}, {'Header1': B, 'Header2': 2, 'Header3: 20}, {'Header1': C, 'Header2': 3, 'Header3: 30}]
I know how to read a file, and think maybe a defaultdict from collections might be a good way, but can't get the syntax right.
This is exactly what csv.DictReader was made for.
import csv
with open('data.csv') as f:
reader = csv.DictReader(f)
for row in reader:
print row
For the data.csv containing:
Header1,Header2,Header3
A,1,10
B,2,20
C,3,30
It prints:
{'Header2': '1', 'Header3': '10', 'Header1': 'A'}
{'Header2': '2', 'Header3': '20', 'Header1': 'B'}
{'Header2': '3', 'Header3': '30', 'Header1': 'C'}

How to turn CSV data into dictionary [duplicate]

This question already has answers here:
Convert .csv table to dictionary [duplicate]
(4 answers)
Closed 9 years ago.
I have a CSV file which I am opening through this code:
open(file,"r")
When I read the file I get the output:
['hello', 'hi', 'bye']
['jelly', 'belly', 'heli']
['red', 'black', 'blue']
I want the otput something like this:
{hello:['jelly','red'], hi:['belly','black'], 'bye':['heli','blue']}
but I have no idea how
You can use collections.defaultdict and csv.DictReader:
>>> import csv
>>> from collections import defaultdict
>>> with open('abc.csv') as f:
reader = csv.DictReader(f)
d = defaultdict(list)
for row in reader:
for k, v in row.items():
d[k].append(v)
...
>>> d
defaultdict(<type 'list'>,
{'hi': ['belly', 'black'],
'bye': ['heli', 'blue'],
'hello': ['jelly', 'red']})
csv = [
['hello', 'hi', 'bye'],
['jelly', 'belly', 'heli'],
['red', 'black', 'blue'],
]
csv = zip(*csv)
result = {}
for row in csv:
result[row[0]] = row[1:]
yourHash = {}
with open(yourFile, 'r') as inFile:
for line in inFile:
line = line.rstrip().split(',')
yourHash[line[0]] = line[1:]
This assumes that each key is unique to one line. If not, this would have to be modified to:
yourHash = {}
with open(yourFile, 'r') as inFile:
for line in inFile:
line = line.rstrip().split(',')
if line[0] in yourHash:
yourHash[line[0]] += line[1:]
else:
yourHash[line[0]] = line[1:]
Of course, you can use csv, but I figured that someone would definitely post that, so I gave an alternative way to do it. Good luck!
You can use csv, read the first line to get the header, create the number of lists corresponding to the header and then create the dict:
import csv
with open(ur_csv) as fin:
reader=csv.reader(fin, quotechar="'", skipinitialspace=True)
header=[[head] for head in next(reader)]
for row in reader:
for i, e in enumerate(row):
header[i].append(e)
data={l[0]:l[1:] for l in header}
print(data)
# {'hi': ['belly', 'black'], 'bye': ['heli', 'blue'], 'hello': ['jelly', 'red']}
If you want something more terse, you can use Jon Clements excellent solution:
with open(ur_csv) as fin:
csvin = csv.reader(fin, quotechar="'", skipinitialspace=True)
header = next(csvin, [])
data=dict(zip(header, zip(*csvin)))
# {'bye': ('heli', 'blue'), 'hello': ('jelly', 'red'), 'hi': ('belly', 'black')}
But that will produce a dictionary of tuples if that matters...
And if you csv file is huge, you may want to rewrite this to generate a dictionary row by row (similar to DictReader):
import csv
def key_gen(fn):
with open(fn) as fin:
reader=csv.reader(fin, quotechar="'", skipinitialspace=True)
header=next(reader, [])
for row in reader:
yield dict(zip(header, row))
for e in key_gen(ur_csv):
print(e)
# {'hi': 'belly', 'bye': 'heli', 'hello': 'jelly'}
{'hi': 'black', 'bye': 'blue', 'hello': 'red'} etc...

Categories