I'd like to create a list of dictionaries reading from a large csv file that uses the entries from the first row as keys. for example, test.csv
Header1, Header2, Header3
A, 1, 10
B, 2, 20
C, 3, 30
The resulting dict would look like:
MyList = [{'Header1': A, 'Header2': 1, 'Header3: 10}, {'Header1': B, 'Header2': 2, 'Header3: 20}, {'Header1': C, 'Header2': 3, 'Header3: 30}]
I know how to read a file, and think maybe a defaultdict from collections might be a good way, but can't get the syntax right.
This is exactly what csv.DictReader was made for.
import csv
with open('data.csv') as f:
reader = csv.DictReader(f)
for row in reader:
print row
For the data.csv containing:
Header1,Header2,Header3
A,1,10
B,2,20
C,3,30
It prints:
{'Header2': '1', 'Header3': '10', 'Header1': 'A'}
{'Header2': '2', 'Header3': '20', 'Header1': 'B'}
{'Header2': '3', 'Header3': '30', 'Header1': 'C'}
Related
I want to create a "dictionary of dictionaries" for each row of the following csv file
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
So the idea is, that mydict["Alice"] should be {'AGATC': 2, 'AATG': 8, 'TATC': 3} etc.
I really do not understand the .reader and .DictReader functions sufficiently. https://docs.python.org/3/library/csv.html#csv.DictReader
Because I am a newbie and cannot quite understand the docs. Do you have other 'easier' resources, that you can recommend?
First, I have to get the first column, i.e. names and put them as keys. How can I access that first column?
Second, I want to create a dictionary inside that name (as the value), with the keys being AGATC,AATG,TATC. Do you understand what I mean? Is that possible?
Edit, made progess:
# Open the CSV file and read its contents into memory.
with open(argv[1]) as csvfile:
reader = list(csv.reader(csvfile))
# Each row read from the csv file is returned as a list of strings.
# Establish dicts.
mydict = {}
for i in range(1, len(reader)):
print(reader[i][0])
mydict[reader[i][0]] = reader[i][1:]
print(mydict)
Out:
{'Alice': ['2', '8', '3'], 'Bob': ['4', '1', '5'], 'Charlie': ['3', '2', '5']}
But how to implement nested dictionaries as described above?
Edit #3:
# Open the CSV file and read its contents into memory.
with open(argv[1]) as csvfile:
reader = list(csv.reader(csvfile))
# Each row read from the csv file is returned as a list of strings.
# Establish dicts.
mydict = {}
for i in range(1, len(reader)):
print(reader[i][0])
mydict[reader[i][0]] = reader[i][1:]
print(mydict)
print(len(reader))
dictlist = [dict() for x in range(1, len(reader))]
#for i in range(1, len(reader))
for i in range(1, len(reader)):
dictlist[i-1] = dict(zip(reader[0][1:], mydict[reader[i][0]]))
#dictionary = dict(zip(reader[0][1:], mydict[reader[1][0]]))
print(dictlist)
Out:
[{'AGATC': '2', 'AATG': '8', 'TATC': '3'}, {'AGATC': '4', 'AATG': '1', 'TATC': '5'}, {'AGATC': '3', 'AATG': '2', 'TATC': '5'}]
{'AGATC': 1, 'AATG': 1, 'TATC': 5}
So I solved it for myself:)
The following code will give you what you've asked for in terms of dict struture.
import csv
with open('file.csv', newline='') as csvfile:
mydict = {}
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
# Create the dictionary structure as desired.
# This uses a comprehension
# Foreach item in the row get the key and the value except if the key
# is 'name' (k != 'name')
mydict[row['name']] = { k: v for k, v in row.items() if k != 'name' }
print(mydict)
This will give you
{
'Alice': {'AGATC': '2', 'AATG': '8', 'TATC': '3'},
'Bob': {'AGATC': '4', 'AATG': '1', 'TATC': '5'},
'Charlie': {'AGATC': '3', 'AATG': '2', 'TATC': '5'}
}
There are plenty of videos and articles covering comprehensions on the net if you need more information on these.
So far, I have this code (from cs50/pset6/DNA):
import csv
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
for record in reader:
# `record` is a dictionary of column-name & value
name = record["name"]
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
data_dict[name] = data
print(data_dict)
Output
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}
Here is the csv file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
But my goal is to achieve the exact same thing, but instead of hardcoding the keys AATG, etc., and also because I'll use a much much bigger database that contains more values, I want to be able to loop through the data, instead of doing this:
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
Could you please help me? Thanks
You could also try using pandas.
Using your example data as .csv file:
pandas.read_csv('example.csv', index_col = 0).transpose().to_dict()
Outputs:
{'Alice': {'AGATC': 2, 'AATG': 8, 'TATC': 3},
'Bob': {'AGATC': 4, 'AATG': 1, 'TATC': 5},
'Charlie': {'AGATC': 3, 'AATG': 2, 'TATC': 5}}
index_col = 0 because you have names column which I set as index (so that later becomes top level keys in dictionary)
.transpose() so top level keys are names and not features (AGATC, AATG, etc.)
.to_dict() to transform pandas.DataFrame to python dictionary
you can simply use pandas:
import csv
import pandas as pd
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
df = pd.DataFrame(reader)
df = df.set_index('name') # set name column as index
data_dict = df.transpose().to_dict() # transpose to make dict with indexes
print(data_dict)
You can loop through a dictionary in python simply enough like this:
for key in dictionary:
print(key, dictionary[key])
You are on the right track using csv.DictReader.
import csv
from pprint import pprint
data_dict = {}
with open('fasta.csv', 'r') as f:
reader = csv.DictReader(f)
for record in reader:
name = record.pop('name')
data_dict[name] = record
pprint(data_dict)
Prints
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}
I have an input list of the form:
d=[{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
I want the output to look like:
S=[{'CLIENT':'A','ROW':'1','KP':'ROM'},
{'CLIENT':'B','ROW':'2','KP':'MON'},
{'CLIENT':'C','ROW':'3','KP':'SUN'},]
How can i do this in python?
the input element dictionaries' keys may change, so I don't want to hardcode them in the code as well.
With a little cheating by letting pandas do the work:
Setup:
from collections import ChainMap
import pandas as pd
d = [{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
Solution:
result = pd.DataFrame(dict(ChainMap(*d))).to_dict(orient='records')
Result:
[{'KP': 'ROM', 'ROW': '1', 'CLIENT': 'A'},
{'KP': 'MON', 'ROW': '2', 'CLIENT': 'B'},
{'KP': 'SUN', 'ROW': '3', 'CLIENT': 'C'}]
Manually it would look like this
S=[{} for i in range(len(d))]
i = 0
for dict in d:
for k, v in dict.items(): # Always 1
for value in v:
S[i][k]=value
i+=1
i=0
print(S)
Extract the key from the each dictionary and the values, then zip() them together into a new dict():
data = [{'CLIENT': ['A', 'B', 'C']}, {'ROW': ['1', '2', '3']}, {'KP': ['ROM', 'MON', 'SUN']}]
new_keys = [list(d.keys())[0] for d in data]
new_values = zip(*[val for d in data for val in d.values()])
s = [dict(zip(new_keys, val)) for val in new_values]
print(s)
Output:
[{'CLIENT': 'A', 'ROW': '1', 'KP': 'ROM'},
{'CLIENT': 'B', 'ROW': '2', 'KP': 'MON'},
{'CLIENT': 'C', 'ROW': '3', 'KP': 'SUN'}]
This is another way of doing it, by making use of the builtin zip() function a couple of times, and the chain() function of the itertools module.
The idea is to use zip() first to group together the lists' items (('A', '1', 'ROM'), ('B', '2', 'MON'), ('C', '3', 'SUN')) as we desire, and the keys of each dictionary ('CLIENT', 'ROW', 'KP')).
Then, we can use a list comprehension, iterating over the just created values list, and zipping its content together with the keys tuple, to finally produce the dictionaries that will be stored within the s list
from itertools import chain
d = [{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
keys, *values = zip(*[chain(dict_.keys(), *dict_.values()) for dict_ in d])
s = [dict(zip(keys, tuple_)) for tuple_ in values]
The content of s will be:
[
{'CLIENT': 'A', 'ROW': '1', 'KP': 'ROM'},
{'CLIENT': 'B', 'ROW': '2', 'KP': 'MON'},
{'CLIENT': 'C', 'ROW': '3', 'KP': 'SUN'}
]
keys = ['key1', 'key2', 'key3', 'key4']
list1 = ['a1', 'b3', 'c4', 'd2', 'h0', 'k1', 'p2', 'o3']
list2 = ['1', '2', '25', '23', '4', '5', '6', '210', '8', '02', '92', '320']
abc = dict(zip(keys[:4], [list1,list2]))
with open('myfilecsvs.csv', 'wb') as f:
[f.write('{0},{1}\n'.format(key, value)) for key, value in abc.items()]
I am getting all keys in 1st column with this and values in other column respectively.
What I am trying to achieve is all keys in first row i-e each key in specific column of first row and then their values below. Something like transpose
I willbe much grateful for your assist on this
You can use join and zip_longest to do this.
",".join(abc.keys()) will return first row (the keys) like key1,key2,and then use zip_longest(Python2.x use izip_longest) to aggregate elements.And use the same way append , and \n to the string.
zip_longest
Make an iterator that aggregates elements from each of the iterables.
If the iterables are of uneven length, missing values are filled-in
with fillvalue.
from itertools import zip_longest
with open('myfilecsvs.csv', 'w') as f:
f.write("\n".join([",".join(abc.keys()),*(",".join(i) for i in zip_longest(*abc.values(),fillvalue=''))]))
Output:
key1,key2
a1,1
b3,2
...
,02
,92
,320
I've got a csv file in something of an entity-attribute-value format (i.e., my event_id is non-unique and repeats k times for the k associated attributes):
event_id, attribute_id, value
1, 1, a
1, 2, b
1, 3, c
2, 1, a
2, 2, b
2, 3, c
2, 4, d
Are there any handy tricks to transform a variable number of attributes (i.e., rows) into columns? The key here is that the output ought to be an m x n table of structured data, where m = max(k); filling in missing attributes with NULL would be optimal:
event_id, 1, 2, 3, 4
1, a, b, c, null
2, a, b, c, d
My plan was to (1) convert the csv to a JSON object that looks like this:
data = [{'value': 'a', 'id': '1', 'event_id': '1', 'attribute_id': '1'},
{'value': 'b', 'id': '2', 'event_id': '1', 'attribute_id': '2'},
{'value': 'a', 'id': '3', 'event_id': '2', 'attribute_id': '1'},
{'value': 'b', 'id': '4', 'event_id': '2', 'attribute_id': '2'},
{'value': 'c', 'id': '5', 'event_id': '2', 'attribute_id': '3'},
{'value': 'd', 'id': '6', 'event_id': '2', 'attribute_id': '4'}]
(2) extract unique event ids:
events = set()
for item in data:
events.add(item['event_id'])
(3) create a list of lists, where each inner list is a list the of attributes for the corresponding parent event.
attributes = [[k['value'] for k in j] for i, j in groupby(data, key=lambda x: x['event_id'])]
(4) create a dictionary that brings events and attributes together:
event_dict = dict(zip(events, attributes))
which looks like this:
{'1': ['a', 'b'], '2': ['a', 'b', 'c', 'd']}
I'm not sure how to get all inner lists to be the same length with NULL values populated where necessary. It seems like something that needs to be done in step (3). Also, creating n lists full of m NULL values had crossed my mind, then iterate through each list and populate the value using attribute_id as the list location; but that seems janky.
Your basic idea seems right, though I would implement it as follows:
import itertools
import csv
events = {} # we're going to keep track of the events we read in
with open('path/to/input') as infile:
for event, _att, val in csv.reader(infile):
if event not in events:
events[event] = []
events[int(event)].append(val) # track all the values for this event
maxAtts = max(len(v) for _k,v in events.items()) # the maximum number of attributes for any event
with open('path/to/output', 'w') as outfile):
writer = csv.writer(outfile)
writer.writerow(["event_id"] + list(range(1, maxAtts+1))) # write out the header row
for k in sorted(events): # let's look at the events in sorted order
writer.writerow([k] + events[k] + ['null']*(maxAtts-len(events[k]))) # write out the event id, all the values for that event, and pad with "null" for any attributes without values