I've got a csv file in something of an entity-attribute-value format (i.e., my event_id is non-unique and repeats k times for the k associated attributes):
event_id, attribute_id, value
1, 1, a
1, 2, b
1, 3, c
2, 1, a
2, 2, b
2, 3, c
2, 4, d
Are there any handy tricks to transform a variable number of attributes (i.e., rows) into columns? The key here is that the output ought to be an m x n table of structured data, where m = max(k); filling in missing attributes with NULL would be optimal:
event_id, 1, 2, 3, 4
1, a, b, c, null
2, a, b, c, d
My plan was to (1) convert the csv to a JSON object that looks like this:
data = [{'value': 'a', 'id': '1', 'event_id': '1', 'attribute_id': '1'},
{'value': 'b', 'id': '2', 'event_id': '1', 'attribute_id': '2'},
{'value': 'a', 'id': '3', 'event_id': '2', 'attribute_id': '1'},
{'value': 'b', 'id': '4', 'event_id': '2', 'attribute_id': '2'},
{'value': 'c', 'id': '5', 'event_id': '2', 'attribute_id': '3'},
{'value': 'd', 'id': '6', 'event_id': '2', 'attribute_id': '4'}]
(2) extract unique event ids:
events = set()
for item in data:
events.add(item['event_id'])
(3) create a list of lists, where each inner list is a list the of attributes for the corresponding parent event.
attributes = [[k['value'] for k in j] for i, j in groupby(data, key=lambda x: x['event_id'])]
(4) create a dictionary that brings events and attributes together:
event_dict = dict(zip(events, attributes))
which looks like this:
{'1': ['a', 'b'], '2': ['a', 'b', 'c', 'd']}
I'm not sure how to get all inner lists to be the same length with NULL values populated where necessary. It seems like something that needs to be done in step (3). Also, creating n lists full of m NULL values had crossed my mind, then iterate through each list and populate the value using attribute_id as the list location; but that seems janky.
Your basic idea seems right, though I would implement it as follows:
import itertools
import csv
events = {} # we're going to keep track of the events we read in
with open('path/to/input') as infile:
for event, _att, val in csv.reader(infile):
if event not in events:
events[event] = []
events[int(event)].append(val) # track all the values for this event
maxAtts = max(len(v) for _k,v in events.items()) # the maximum number of attributes for any event
with open('path/to/output', 'w') as outfile):
writer = csv.writer(outfile)
writer.writerow(["event_id"] + list(range(1, maxAtts+1))) # write out the header row
for k in sorted(events): # let's look at the events in sorted order
writer.writerow([k] + events[k] + ['null']*(maxAtts-len(events[k]))) # write out the event id, all the values for that event, and pad with "null" for any attributes without values
Related
I want to create a "dictionary of dictionaries" for each row of the following csv file
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
So the idea is, that mydict["Alice"] should be {'AGATC': 2, 'AATG': 8, 'TATC': 3} etc.
I really do not understand the .reader and .DictReader functions sufficiently. https://docs.python.org/3/library/csv.html#csv.DictReader
Because I am a newbie and cannot quite understand the docs. Do you have other 'easier' resources, that you can recommend?
First, I have to get the first column, i.e. names and put them as keys. How can I access that first column?
Second, I want to create a dictionary inside that name (as the value), with the keys being AGATC,AATG,TATC. Do you understand what I mean? Is that possible?
Edit, made progess:
# Open the CSV file and read its contents into memory.
with open(argv[1]) as csvfile:
reader = list(csv.reader(csvfile))
# Each row read from the csv file is returned as a list of strings.
# Establish dicts.
mydict = {}
for i in range(1, len(reader)):
print(reader[i][0])
mydict[reader[i][0]] = reader[i][1:]
print(mydict)
Out:
{'Alice': ['2', '8', '3'], 'Bob': ['4', '1', '5'], 'Charlie': ['3', '2', '5']}
But how to implement nested dictionaries as described above?
Edit #3:
# Open the CSV file and read its contents into memory.
with open(argv[1]) as csvfile:
reader = list(csv.reader(csvfile))
# Each row read from the csv file is returned as a list of strings.
# Establish dicts.
mydict = {}
for i in range(1, len(reader)):
print(reader[i][0])
mydict[reader[i][0]] = reader[i][1:]
print(mydict)
print(len(reader))
dictlist = [dict() for x in range(1, len(reader))]
#for i in range(1, len(reader))
for i in range(1, len(reader)):
dictlist[i-1] = dict(zip(reader[0][1:], mydict[reader[i][0]]))
#dictionary = dict(zip(reader[0][1:], mydict[reader[1][0]]))
print(dictlist)
Out:
[{'AGATC': '2', 'AATG': '8', 'TATC': '3'}, {'AGATC': '4', 'AATG': '1', 'TATC': '5'}, {'AGATC': '3', 'AATG': '2', 'TATC': '5'}]
{'AGATC': 1, 'AATG': 1, 'TATC': 5}
So I solved it for myself:)
The following code will give you what you've asked for in terms of dict struture.
import csv
with open('file.csv', newline='') as csvfile:
mydict = {}
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
# Create the dictionary structure as desired.
# This uses a comprehension
# Foreach item in the row get the key and the value except if the key
# is 'name' (k != 'name')
mydict[row['name']] = { k: v for k, v in row.items() if k != 'name' }
print(mydict)
This will give you
{
'Alice': {'AGATC': '2', 'AATG': '8', 'TATC': '3'},
'Bob': {'AGATC': '4', 'AATG': '1', 'TATC': '5'},
'Charlie': {'AGATC': '3', 'AATG': '2', 'TATC': '5'}
}
There are plenty of videos and articles covering comprehensions on the net if you need more information on these.
So far, I have this code (from cs50/pset6/DNA):
import csv
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
for record in reader:
# `record` is a dictionary of column-name & value
name = record["name"]
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
data_dict[name] = data
print(data_dict)
Output
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}
Here is the csv file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
But my goal is to achieve the exact same thing, but instead of hardcoding the keys AATG, etc., and also because I'll use a much much bigger database that contains more values, I want to be able to loop through the data, instead of doing this:
data = {
"AGATC": record["AGATC"],
"AATG": record["AATG"],
"TATC": record["TATC"],
}
Could you please help me? Thanks
You could also try using pandas.
Using your example data as .csv file:
pandas.read_csv('example.csv', index_col = 0).transpose().to_dict()
Outputs:
{'Alice': {'AGATC': 2, 'AATG': 8, 'TATC': 3},
'Bob': {'AGATC': 4, 'AATG': 1, 'TATC': 5},
'Charlie': {'AGATC': 3, 'AATG': 2, 'TATC': 5}}
index_col = 0 because you have names column which I set as index (so that later becomes top level keys in dictionary)
.transpose() so top level keys are names and not features (AGATC, AATG, etc.)
.to_dict() to transform pandas.DataFrame to python dictionary
you can simply use pandas:
import csv
import pandas as pd
data_dict = {}
with open(argv[1]) as data_file:
reader = csv.DictReader(data_file)
df = pd.DataFrame(reader)
df = df.set_index('name') # set name column as index
data_dict = df.transpose().to_dict() # transpose to make dict with indexes
print(data_dict)
You can loop through a dictionary in python simply enough like this:
for key in dictionary:
print(key, dictionary[key])
You are on the right track using csv.DictReader.
import csv
from pprint import pprint
data_dict = {}
with open('fasta.csv', 'r') as f:
reader = csv.DictReader(f)
for record in reader:
name = record.pop('name')
data_dict[name] = record
pprint(data_dict)
Prints
{'Alice': {'AATG': '8', 'AGATC': '2', 'TATC': '3'},
'Bob': {'AATG': '1', 'AGATC': '4', 'TATC': '5'},
'Charlie': {'AATG': '2', 'AGATC': '3', 'TATC': '5'}}
I have an input list of the form:
d=[{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
I want the output to look like:
S=[{'CLIENT':'A','ROW':'1','KP':'ROM'},
{'CLIENT':'B','ROW':'2','KP':'MON'},
{'CLIENT':'C','ROW':'3','KP':'SUN'},]
How can i do this in python?
the input element dictionaries' keys may change, so I don't want to hardcode them in the code as well.
With a little cheating by letting pandas do the work:
Setup:
from collections import ChainMap
import pandas as pd
d = [{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
Solution:
result = pd.DataFrame(dict(ChainMap(*d))).to_dict(orient='records')
Result:
[{'KP': 'ROM', 'ROW': '1', 'CLIENT': 'A'},
{'KP': 'MON', 'ROW': '2', 'CLIENT': 'B'},
{'KP': 'SUN', 'ROW': '3', 'CLIENT': 'C'}]
Manually it would look like this
S=[{} for i in range(len(d))]
i = 0
for dict in d:
for k, v in dict.items(): # Always 1
for value in v:
S[i][k]=value
i+=1
i=0
print(S)
Extract the key from the each dictionary and the values, then zip() them together into a new dict():
data = [{'CLIENT': ['A', 'B', 'C']}, {'ROW': ['1', '2', '3']}, {'KP': ['ROM', 'MON', 'SUN']}]
new_keys = [list(d.keys())[0] for d in data]
new_values = zip(*[val for d in data for val in d.values()])
s = [dict(zip(new_keys, val)) for val in new_values]
print(s)
Output:
[{'CLIENT': 'A', 'ROW': '1', 'KP': 'ROM'},
{'CLIENT': 'B', 'ROW': '2', 'KP': 'MON'},
{'CLIENT': 'C', 'ROW': '3', 'KP': 'SUN'}]
This is another way of doing it, by making use of the builtin zip() function a couple of times, and the chain() function of the itertools module.
The idea is to use zip() first to group together the lists' items (('A', '1', 'ROM'), ('B', '2', 'MON'), ('C', '3', 'SUN')) as we desire, and the keys of each dictionary ('CLIENT', 'ROW', 'KP')).
Then, we can use a list comprehension, iterating over the just created values list, and zipping its content together with the keys tuple, to finally produce the dictionaries that will be stored within the s list
from itertools import chain
d = [{'CLIENT': ['A','B','C']},{'ROW':['1','2','3']},{'KP':['ROM','MON','SUN']}]
keys, *values = zip(*[chain(dict_.keys(), *dict_.values()) for dict_ in d])
s = [dict(zip(keys, tuple_)) for tuple_ in values]
The content of s will be:
[
{'CLIENT': 'A', 'ROW': '1', 'KP': 'ROM'},
{'CLIENT': 'B', 'ROW': '2', 'KP': 'MON'},
{'CLIENT': 'C', 'ROW': '3', 'KP': 'SUN'}
]
While reading a csv file using csv.DictReader
I get
[{'id': 1, 'status1': '1', 'status2': '2', 'status3': '3' }]
How can I manuplate while reading or later to get:
[{'id': 1, 'status': ['1', '2', '3']}]
TLDR;
I want to group similar fields into a list.
and/or - how can i do this in pandas pd.read_csv() too?
Thanks in advance!
If it is certain that the only fields you want to group are those who end with digits, you can use a regex to identify them, and append their corresponding values to a list:
import re
def compact_dict(d):
compact = {}
for key, value in d.items():
# a group containing anything, followed by at least one digit
m = re.match(r'(.*)\d+', key)
if m:
# the key that we use is the original one without the final digits
compact.setdefault(m.group(1), []).append(value)
else:
# not part of a group of similar keys, we just store the key:value pair
compact[key] = value
return compact
data = [{'id': 1, 'status1': '1', 'status2': '2', 'status3': '3' }]
out = [compact_dict(d) for d in data]
print(out)
# [{'id': 1, 'status': ['1', '2', '3']}]
I have a list (tags) of integers. I want to map the list items to the value items of a dictionary (classes) and get the corresponding dictionary keys as output.
I am using:
h = classes.items()
for x in tags:
for e in h:
# print x, e, # uncomment this line to make a diagnosis
if x == e[1]:
print e[0]
else:
print "No Match"
Classes is the dictionary.
Tags is the list with items that I want to map with the classes. When I run this code, I am getting 2616 time No Match at the output.
2616 = 8 (no. of tuples)*327 (no. of items of tags list)
If I understand what you are trying to do maybe this will help
>>> tags
['0', '2', '1', '3', '4', '7', '2', '0', '1', '6', '3', '2', '8', '4', '1', '2', '0', '7', '5', '4', '1']
>>> classes
{'Tesla': 7, 'Nissan': 0, 'Honda': 5, 'Toyota': 6, 'Ford': 1, 'Mazda': 4, 'Ferrari': 2, 'Suzuki': 3}
tags is a list of strings, not integers - so let's convert it to a list of ints.
>>> tags = map(int, tags)
classes is a dictionary mapping car makes to ints, but we want to use the value as the lookup. We can invert the dictionary (swap keys and values)
>>> classes_inverse = {v: k for k, v in classes.items()}
Now this is what tags and classes_inverse look like
>>> tags
[0, 2, 1, 3, 4, 7, 2, 0, 1, 6, 3, 2, 8, 4, 1, 2, 0, 7, 5, 4, 1]
>>> classes_inverse
{0: 'Nissan', 1: 'Ford', 2: 'Ferrari', 3: 'Suzuki', 4: 'Mazda', 5: 'Honda', 6: 'Toyota', 7: 'Tesla'}
Now we can collect the values of the inverse dictionary for each item in the list.
>>> [classes_inverse.get(t, "No Match") for t in tags]
['Nissan', 'Ferrari', 'Ford', 'Suzuki', 'Mazda', 'Tesla', 'Ferrari', 'Nissan', 'Ford', 'Toyota', 'Suzuki', 'Ferrari', 'No Match', 'Mazda', 'Ford', 'Ferrari', 'Nissan', 'Tesla', 'Honda', 'Mazda', 'Ford']
For each tag, you iterate through all the keys, and print whether it was a match or not, when you except to have at most one hit. For example, if you have 10 items, for each tag, you'll print 1 hit and 9 misses.
Since you want to store this data, the easiest way is to invert the dictionary map, i.e. make key -> value to value -> key. However, this assumes that all values are unique, which your example implies so.
def map_tags(tags, classes):
tag_map = {value: key for key, value in classes.items()}
return [tag_map.get(t, 'No match') for t in tags]
However, be careful. In your classes examples the values are integers, while the tags are strings. You want the two to match when making a map out of them. If the tags are intended to be strings, then change
tag_map.get(t, 'No match')
to
tag_map.get(int(t), 'No match')