Group certain columns and summing up another column from a CSV - python

I have data in a csv that needs to be parsed. It looks like:
Date,Tag,Amount
13/06/2018,ABC,6750000
13/06/2018,ABC,159800
24/05/2018,ABC,-1848920
16/05/2018,AB,-1829700
16/05/2018,AB,3600000
28/06/2018,A,15938000
16/05/2018,AB,3748998
28/06/2018,A,1035000
28/06/2018,A,1035000
14/06/2018,ABC,2122717
You can see each date has a tag and number next to it.
what i am trying to achieve is to make the date and tag the key and group by the date and tag and to sum up the amount.
expected result
Date,Tag,Amount
13/06/2018,ABC,5220680
16/05/2018,AB,5519298
28/06/2018,A,18008000
14/06/2018,ABC,2122717
the code i am using now is below which is not working.
from collections import defaultdict
import csv
d = defaultdict(int)
with open("file.csv") as f:
for line in f:
tokens = [t.strip() for t in line.split(",")]
try:
date = int(tokens[0])
tag = int(tokens[1])
amount = int(tokens[2])
except ValueError:
continue
d[date] += amount
print d
could someone show me how to aheive this please without using pandas

You should definitely use pandas. With the exception that you have to code this by yourself, you can just install the pandas module, import it (import pandas as pd) as solve this problem with 2 simple and intuitive lines of code
>>> df = pd.read_csv('file.csv')
>>> df.groupby(['Date', 'Tag']).Amount.sum()
Date Tag
13/06/2018 ABC 6909800
14/06/2018 ABC 2122717
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
28/06/2018 A 18008000
If you really need to code this yourself, you can use a nested defaultdict so you can have two layers of groupby. Also, why you try to cast to int the date and the tag? Makes no sense at all. Just remove it.
d = defaultdict(lambda: defaultdict(int))
for line in z:
tokens = [t.strip() for t in line.split(",")]
try:
date = tokens[0]
tag = tokens[1]
amount = int(tokens[2])
except ValueError as e:
continue
d[date][tag] += amount
The output is:
13/06/2018 ABC 6909800
24/05/2018 ABC -1848920
16/05/2018 AB 5519298
28/06/2018 A 18008000
14/06/2018 ABC 2122717
To output the result above, just iterate through the items:
for k,v in d.items():
for k2, v2 in v.items():
print(k,k2,v2)
To make your code even better, read the first line only, and then iterate from the second line til the end. That way, your try/except can be removed and you'd get a simpler and cleaner code. But you can pick up from here, right? ;)
To write to a csv, simply
s = '\n'.join(['{0} {1} {2}'.format(k, k2, v2) for k,v in d.items() for k2,v2 in v.items()])
with open('output.txt', 'w') as f:
f.write(s)

This is one approach using a simple iteration.
Ex:
from collections import defaultdict
import csv
result = defaultdict(int)
with open(filename) as infile:
reader = csv.reader(infile)
header = next(reader)
for line in reader:
result[tuple(line[:2])] += int(line[2])
print(header)
for k, v in result.items():
print(k[0], k[1], v)
Output:
14/06/2018 ABC 2122717
13/06/2018 ABC 6909800
28/06/2018 A 18008000
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
To CSV
with open(filename, "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(header)
for k, v in result.items():
writer.writerow([k[0], k[1], v])

You can use itertools.groupby:
from itertools import groupby
import csv
header, *data = csv.reader(open('filename.csv'))
new_data = [[a, list(b)] for a, b in groupby(sorted(data, key=lambda x:x[:2]), key=lambda x:x[:2])]
results = [[*a, sum(int(c) for *_, c in b)] for a, b in new_data]
with open('calc_results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header, *results])
Output:
Date,Tag,Amount
13/06/2018,ABC,6909800
14/06/2018,ABC,2122717
16/05/2018,AB,5519298
24/05/2018,ABC,-1848920
28/06/2018,A,18008000

Related

Python convert child dicts with same key name to csv (DictWriter)

I have a list of json file like below:
[
{"A":{"value":1}, "B":{"value":2}},
{"A":{"value":9}, "B":{"value":3}}
]
Which I want to turn to csv like so:
A.value,B.value
1,2
9,3
The issue is that I have nested keys which have the same name : value but should be in a separate column. I could not find an elegant solution to this anywhere yet. I would like to be able to do something like:
data = json.load(open(file, 'r'))
with open("output.csv", "w") as f:
columns = ["A.value","B.value"]
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(data)
Which I know would work if I did not have any nested keys. I found other questions similar to this but I don't think this applies to my situation.
As an extra challenge:
I'd rather keep a generic approach. Later I might have a list of jsons like:
[
{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}
]
Meaning not all keys I want to add to csv will have a nested value key!
**output ^ **
A.value,B.value,key
1,2,90
9,3,91
Flattening the dicts worked. Since there is a list of dicts, the flattening has to be done for each dict:
import collections
def flatten(d, parent_key='', sep='.'):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_json_list(list):
flattened_list = []
for d in list:
flattened_list.append(flatten(d))
return flattened_list
Then the code should work as implemented in the question:
with open("out.csv","w") as f:
columns = ["A.value","B.value","key"]
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(data)
This should do the job:
cw.writerows([{f'{key}.value':val['value'] for key, val in row.items()} for row in data])
or as regular loop:
for row in data:
cw.writerow({f'{key}.value':val['value'] for key, val in row.items()})
EDIT
import csv
data = [
{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}
]
def parse(row):
for key, value in row.items():
try:
yield f'{key}.value', value['value']
except TypeError:
yield key, value
with open("output.csv", "w") as f:
columns = ['A.value', 'B.value', 'key']
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(dict(parse(row)) for row in data)
The only thing I don't like is the hardcoded headers
jsonpath-ng can parse even such a nested json object very easily. It can be installed by the following command:
pip install --upgrade jsonpath-ng
Code:
from collections import defaultdict
import jsonpath_ng as jp
import pandas as pd
import re
jp.jsonpath.auto_id_field = 'json_path'
def json_to_df(json):
expr = jp.parse(f'$..*.{jp.jsonpath.auto_id_field}')
d = defaultdict(list)
for m in expr.find(json):
if not isinstance(m.datum.value, (dict, list)):
d[re.sub(r'\[\d+]\.', '', m.value)].append(m.datum.value)
return pd.DataFrame(d)
data = [{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}]
df = json_to_df(data)
df.to_csv('output.csv', index=False)
Output:
key
A.value
B.value
90
1
2
91
9
3
Another complicated example:
data = [{"A":{"value":1}, "B":{"value":2, "C":{"value":6}}, "key":90},
{"A":{"value":9}, "B":{"value":3, "C":{"value":8}}, "key":91}]
json_to_df(data).to_csv('output.csv', index=False)
key
A.value
B.value
B.C.value
90
1
2
6
91
9
3
8

How to convert CSV data into a dictionary using itertools.groupby

I have a text file, job.txt, which is below
job,salary
Developer,29000
Developer,28000
Tester,27000
Tester,26000
My code is
with open(r'C:\Users\job.txt') as f:
file_content = f.readlines()
data = {}
for i, line in enumerate(file_content):
if i == 0:
continue
job, salary = line.split(",")
job = job.strip()
salary = int(salary.strip())
if not job in data:
data[job] = []
data[job].append(salary)
print("data =", data)
My expected result is below
data = {'Developer': [29000, 28000], 'Tester': [27000, 26000]}
How can I convert my code to use itertools.groupby?
Here is the code that will generate the dictionary you wanted.
from itertools import groupby
data = [
["Developer",29000],
["Developer",28000],
["Tester",27000],
["Tester",26000]
]
def keyfunc(e):
return e[0]
unique_keys = {}
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
unique_keys[k] = [i[1] for i in g]
>>> print(unique_keys)
{'Developer': [29000, 28000], 'Tester': [27000, 26000]}
P.S: I would suggest using the csv module to read the file instead of doing it yourself.
Try this if pandas is an option:
from collections import defaultdict
import pandas as pd
d = pd.read_csv('job.txt').to_numpy().tolist()
res = defaultdict(list)
for v, k in d: res[v].append(k)
d = dict(res)
d
# {'Developer': [29000, 28000], 'Tester': [27000, 26000]}
You can only rely on groupby if your data is already chunked into categories.
from itertools import groupby
with open("job.txt") as f:
rows = [x.split(",") for x in f.readlines()[1:]]
data = {
k.strip(): [int(y[1]) for y in v]
for k, v in groupby(rows, key=lambda x: x[0])
}
With that in mind, I think a defaultdict is more appropriate here. Ordering is automatically handled and it's just less clever. Additionally, there's no need to slurp the file into memory or sort it (if unordered). Use dict(data) at the end if you don't like the defaultdict subclass.
from collections import defaultdict
data = defaultdict(list)
with open("job.txt") as f:
for i, line in enumerate(f):
if i:
job, salary = [x.strip() for x in line.split(",")]
data[job].append(int(salary))
As mentioned in the accepted answer, do prefer a CSV module if your actual data is at all more complicated than your example. CSVs can be difficult to parse and there's no reason to reinvent the wheel.

Making python dictionary from a text file with multiple keys

I have a text file named file.txt with some numbers like the following :
1 79 8.106E-08 2.052E-08 3.837E-08
1 80 -4.766E-09 9.003E-08 4.812E-07
1 90 4.914E-08 1.563E-07 5.193E-07
2 2 9.254E-07 5.166E-06 9.723E-06
2 3 1.366E-06 -5.184E-06 7.580E-06
2 4 2.966E-06 5.979E-07 9.702E-08
2 5 5.254E-07 0.166E-02 9.723E-06
3 23 1.366E-06 -5.184E-03 7.580E-06
3 24 3.244E-03 5.239E-04 9.002E-08
I want to build a python dictionary, where the first number in each row is the key, the second number is always ignored, and the last three numbers are put as values. But in a dictionary, a key can not be repeated, so when I write my code (attached at the end of the question), what I get is
'1' : [ '90' '4.914E-08' '1.563E-07' '5.193E-07' ]
'2' : [ '5' '5.254E-07' '0.166E-02' '9.723E-06' ]
'3' : [ '24' '3.244E-03' '5.239E-04' '9.002E-08' ]
All the other numbers are removed, and only the last row is kept as the values. What I need is to have all the numbers against a key, say 1, to be appended in the dictionary. For example, what I need is :
'1' : ['8.106E-08' '2.052E-08' '3.837E-08' '-4.766E-09' '9.003E-08' '4.812E-07' '4.914E-08' '1.563E-07' '5.193E-07']
Is it possible to do it elegantly in python? The code I have right now is the following :
diction = {}
with open("file.txt") as f:
for line in f:
pa = line.split()
diction[pa[0]] = pa[1:]
with open('file.txt') as f:
diction = {pa[0]: pa[1:] for pa in map(str.split, f)}
You can use a defaultdict.
from collections import defaultdict
data = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
line = line.split()
data[line[0]].extend(line[2:])
Try this:
from collections import defaultdict
diction = defaultdict(list)
with open("file.txt") as f:
for line in f:
key, _, *values = line.strip().split()
diction[key].extend(values)
print(diction)
This is a solution for Python 3, because the statement a, *b = tuple1 is invalid in Python 2. Look at the solution of #cha0site if you are using Python 2.
Make the value of each key in diction be a list and extend that list with each iteration. With your code as it is written now when you say diction[pa[0]] = pa[1:] you're overwriting the value in diction[pa[0]] each time the key appears, which describes the behavior you're seeing.
with open("file.txt") as f:
for line in f:
pa = line.split()
try:
diction[pa[0]].extend(pa[1:])
except KeyError:
diction[pa[0]] = pa[1:]
In this code each value of diction will be a list. In each iteration if the key exists that list will be extended with new values from pa giving you a list of all the values for each key.
To do this in a very simple for loop:
with open('file.txt') as f:
return_dict = {}
for item_list in map(str.split, f):
if item_list[0] not in return_dict:
return_dict[item_list[0]] = []
return_dict[item_list[0]].extend(item_list[1:])
return return_dict
Or, if you wanted to use defaultdict in a one liner-ish:
from collections import defaultdict
with open('file.txt') as f:
return_dict = defaultdict(list)
[return_dict[item_list[0]].extend(item_list[1:]) for item_list in map(str.split, f)]
return return_dict

Converting text file to dictionary in python

So lets say I want to convert the following to a dictionary where the 1st column is keys, and 2nd column is values.
http://pastebin.com/29bXkYhd
The following code works for this (assume romEdges.txt is the name of the file):
f = open('romEdges.txt')
dic = {}
for l in f:
k, v = l.split()
if k in dic:
dic[k].extend(v)
else:
dic[k] = [v]
f.close()
OK
But why doesn't the code work for this file?
http://pastebin.com/Za0McsAM
If anyone can tell me the correct code for the 2nd text file to work as well I would appreciate it.
Thanks in advance.
You should use append instead of extend
from collections import defaultdict
d = defaultdict(list)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].append(v)
print d
or using sets to prevent duplicates
d = defaultdict(set)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].add(v)
print d
If you want to append the data to dictionary, then you can use update in python. Please use following code:
f = open('your file name')
dic = {}
for l in f:
k,v = l.split()
if k in dic:
dict.update({k:v })
else:
dic[k] = [v]
print dic
f.close()
output:
{'0100464': ['0100360'], '0100317': ['0100039'], '0100405': ['0100181'], '0100545': ['0100212'], '0100008': ['0000459'], '0100073': ['0100072'], '0100044': ['0100426'], '0100062': ['0100033'], '0100061': ['0000461'], '0100066': ['0100067'], '0100067': ['0100164'], '0100064': ['0100353'], '0100080': ['0100468'], '0100566': ['0100356'], '0100048': ['0100066'], '0100005': ['0100448'], '0100007': ['0100008'], '0100318': ['0100319'], '0100045': ['0100046'], '0100238': ['0100150'], '0100040': ['0100244'], '0100024': ['0100394'], '0100025': ['0100026'], '0100022': ['0100419'], '0100009': ['0100010'], '0100020': ['0100021'], '0100313': ['0100350'], '0100297': ['0100381'], '0100490': ['0100484'], '0100049': ['0100336'], '0100075': ['0100076'], '0100074': ['0100075'], '0100077': ['0000195'], '0100071': ['0100072'], '0100265': ['0000202'], '0100266': ['0000201'], '0100035': ['0100226'], '0100079': ['0100348'], '0100050': ['0100058'], '0100017': ['0100369'], '0100030': ['0100465'], '0100033': ['0100322'], '0100058': ['0100056'], '0100013': ['0100326'], '0100036': ['0100463'], '0100321': ['0100320'], '0100323': ['0100503'], '0100003': ['0100004'], '0100056': ['0100489'], '0100055': ['0100033'], '0100053': ['0100495'], '0100286': ['0100461'], '0100285': ['0100196'], '0100482': ['0100483']}

Python dictionary created from CSV file should merge the value (integer) whenever the key repeats

I have a file named report_data.csv that contains the following:
user,score
a,10
b,15
c,10
a,10
a,5
b,10
I am creating a dictionary from this file using this code:
with open('report_data.csv') as f:
f.readline() # Skip over the column titles
mydict = dict(csv.reader(f, delimiter=','))
After running this code mydict is:
mydict = {'a':5,'b':10,'c':10}
But I want it to be:
mydict = {'a':25,'b':25,'c':10}
In other words, whenever a key that already exists in mydict is encountered while reading a line of the file, the new value in mydict associated with that key should be the sum of the old value and the integer that appears on that line of the file. How can I do this?
The most straightforward way is to use defaultdict or Counter from useful collections module.
from collections import Counter
summary = Counter()
with open('report_data.csv') as f:
f.readline()
for line in f:
lbl, n = line.split(",")
n = int(n)
summary[lbl] = summary[lbl] + n
One of the most useful features in Counter class is the most_common() function, that is absent from the plain dictionaries and from defaultdict
This should work for you:
with open('report_data.csv') as f:
f.readline()
mydict = {}
for line in csv.reader(f, delimiter=','):
mydict[line[0]] = mydict.get(line[0], 0) + int(line[1])
try this.
mydict = {}
with open('report_data.csv') as f:
f.readline()
x = csv.reader(f, delimiter=',')
for x1 in x:
if mydict.get(x1[0]):
mydict[x1[0]] += int(x1[1])
else:
mydict[x1[0]] = int(x1[1])
print mydict

Categories