How to convert CSV data into a dictionary using itertools.groupby - python

I have a text file, job.txt, which is below
job,salary
Developer,29000
Developer,28000
Tester,27000
Tester,26000
My code is
with open(r'C:\Users\job.txt') as f:
file_content = f.readlines()
data = {}
for i, line in enumerate(file_content):
if i == 0:
continue
job, salary = line.split(",")
job = job.strip()
salary = int(salary.strip())
if not job in data:
data[job] = []
data[job].append(salary)
print("data =", data)
My expected result is below
data = {'Developer': [29000, 28000], 'Tester': [27000, 26000]}
How can I convert my code to use itertools.groupby?

Here is the code that will generate the dictionary you wanted.
from itertools import groupby
data = [
["Developer",29000],
["Developer",28000],
["Tester",27000],
["Tester",26000]
]
def keyfunc(e):
return e[0]
unique_keys = {}
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
unique_keys[k] = [i[1] for i in g]
>>> print(unique_keys)
{'Developer': [29000, 28000], 'Tester': [27000, 26000]}
P.S: I would suggest using the csv module to read the file instead of doing it yourself.

Try this if pandas is an option:
from collections import defaultdict
import pandas as pd
d = pd.read_csv('job.txt').to_numpy().tolist()
res = defaultdict(list)
for v, k in d: res[v].append(k)
d = dict(res)
d
# {'Developer': [29000, 28000], 'Tester': [27000, 26000]}

You can only rely on groupby if your data is already chunked into categories.
from itertools import groupby
with open("job.txt") as f:
rows = [x.split(",") for x in f.readlines()[1:]]
data = {
k.strip(): [int(y[1]) for y in v]
for k, v in groupby(rows, key=lambda x: x[0])
}
With that in mind, I think a defaultdict is more appropriate here. Ordering is automatically handled and it's just less clever. Additionally, there's no need to slurp the file into memory or sort it (if unordered). Use dict(data) at the end if you don't like the defaultdict subclass.
from collections import defaultdict
data = defaultdict(list)
with open("job.txt") as f:
for i, line in enumerate(f):
if i:
job, salary = [x.strip() for x in line.split(",")]
data[job].append(int(salary))
As mentioned in the accepted answer, do prefer a CSV module if your actual data is at all more complicated than your example. CSVs can be difficult to parse and there's no reason to reinvent the wheel.

Related

Python convert child dicts with same key name to csv (DictWriter)

I have a list of json file like below:
[
{"A":{"value":1}, "B":{"value":2}},
{"A":{"value":9}, "B":{"value":3}}
]
Which I want to turn to csv like so:
A.value,B.value
1,2
9,3
The issue is that I have nested keys which have the same name : value but should be in a separate column. I could not find an elegant solution to this anywhere yet. I would like to be able to do something like:
data = json.load(open(file, 'r'))
with open("output.csv", "w") as f:
columns = ["A.value","B.value"]
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(data)
Which I know would work if I did not have any nested keys. I found other questions similar to this but I don't think this applies to my situation.
As an extra challenge:
I'd rather keep a generic approach. Later I might have a list of jsons like:
[
{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}
]
Meaning not all keys I want to add to csv will have a nested value key!
**output ^ **
A.value,B.value,key
1,2,90
9,3,91
Flattening the dicts worked. Since there is a list of dicts, the flattening has to be done for each dict:
import collections
def flatten(d, parent_key='', sep='.'):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_json_list(list):
flattened_list = []
for d in list:
flattened_list.append(flatten(d))
return flattened_list
Then the code should work as implemented in the question:
with open("out.csv","w") as f:
columns = ["A.value","B.value","key"]
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(data)
This should do the job:
cw.writerows([{f'{key}.value':val['value'] for key, val in row.items()} for row in data])
or as regular loop:
for row in data:
cw.writerow({f'{key}.value':val['value'] for key, val in row.items()})
EDIT
import csv
data = [
{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}
]
def parse(row):
for key, value in row.items():
try:
yield f'{key}.value', value['value']
except TypeError:
yield key, value
with open("output.csv", "w") as f:
columns = ['A.value', 'B.value', 'key']
cw = csv.DictWriter(f, columns)
cw.writeheader()
cw.writerows(dict(parse(row)) for row in data)
The only thing I don't like is the hardcoded headers
jsonpath-ng can parse even such a nested json object very easily. It can be installed by the following command:
pip install --upgrade jsonpath-ng
Code:
from collections import defaultdict
import jsonpath_ng as jp
import pandas as pd
import re
jp.jsonpath.auto_id_field = 'json_path'
def json_to_df(json):
expr = jp.parse(f'$..*.{jp.jsonpath.auto_id_field}')
d = defaultdict(list)
for m in expr.find(json):
if not isinstance(m.datum.value, (dict, list)):
d[re.sub(r'\[\d+]\.', '', m.value)].append(m.datum.value)
return pd.DataFrame(d)
data = [{"A":{"value":1}, "B":{"value":2}, "key":90},
{"A":{"value":9}, "B":{"value":3}, "key":91}]
df = json_to_df(data)
df.to_csv('output.csv', index=False)
Output:
key
A.value
B.value
90
1
2
91
9
3
Another complicated example:
data = [{"A":{"value":1}, "B":{"value":2, "C":{"value":6}}, "key":90},
{"A":{"value":9}, "B":{"value":3, "C":{"value":8}}, "key":91}]
json_to_df(data).to_csv('output.csv', index=False)
key
A.value
B.value
B.C.value
90
1
2
6
91
9
3
8

Print out dictionary from file

E;Z;X;Y
I tried
dl= defaultdict(list)
for line in file:
line = line.strip().split(';')
for x in line:
dl[line[0]].append(line[1:4])
dl=dict(dl)
print (votep)
It print out too many results. I have an init that reads the file.
What ways can I edit to make it work?
The csv module could be really handy here, just use a semicolon as your delimiter and a simple dict comprehension will suffice:
with open('filename.txt') as file:
reader = csv.reader(file, delimiter=';')
votep = {k: vals for k, *vals in reader}
print(votep)
Without using csv you can just use str.split:
with open('filename.txt') as file:
votep = {k: vals for k, *vals in (s.split(';') for s in file)}
print(votep)
Further simplified without the comprehension this would look as follows:
votep = {}
for line in file:
key, *vals = line.split(';')
votep[key] = vals
And FYI, key, *vals = line.strip(';') is just multiple variable assignment coupled with iterable unpacking. The star just means put whatever’s left in the iterable into vals after assigning the first value to key.
if you read file in list object, there is a simple function to iterate over and convert it to dictionary you expect:
a = [
'A;X;Y;Z',
'B;Y;Z;X',
'C;Y;Z;X',
'D;Z;X;Y',
'E;Z;X;Y',
]
def vp(a):
dl = {}
for i in a:
split_keys = i.split(';')
dl[split_keys[0]] = split_keys[1:]
print(dl)

Group certain columns and summing up another column from a CSV

I have data in a csv that needs to be parsed. It looks like:
Date,Tag,Amount
13/06/2018,ABC,6750000
13/06/2018,ABC,159800
24/05/2018,ABC,-1848920
16/05/2018,AB,-1829700
16/05/2018,AB,3600000
28/06/2018,A,15938000
16/05/2018,AB,3748998
28/06/2018,A,1035000
28/06/2018,A,1035000
14/06/2018,ABC,2122717
You can see each date has a tag and number next to it.
what i am trying to achieve is to make the date and tag the key and group by the date and tag and to sum up the amount.
expected result
Date,Tag,Amount
13/06/2018,ABC,5220680
16/05/2018,AB,5519298
28/06/2018,A,18008000
14/06/2018,ABC,2122717
the code i am using now is below which is not working.
from collections import defaultdict
import csv
d = defaultdict(int)
with open("file.csv") as f:
for line in f:
tokens = [t.strip() for t in line.split(",")]
try:
date = int(tokens[0])
tag = int(tokens[1])
amount = int(tokens[2])
except ValueError:
continue
d[date] += amount
print d
could someone show me how to aheive this please without using pandas
You should definitely use pandas. With the exception that you have to code this by yourself, you can just install the pandas module, import it (import pandas as pd) as solve this problem with 2 simple and intuitive lines of code
>>> df = pd.read_csv('file.csv')
>>> df.groupby(['Date', 'Tag']).Amount.sum()
Date Tag
13/06/2018 ABC 6909800
14/06/2018 ABC 2122717
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
28/06/2018 A 18008000
If you really need to code this yourself, you can use a nested defaultdict so you can have two layers of groupby. Also, why you try to cast to int the date and the tag? Makes no sense at all. Just remove it.
d = defaultdict(lambda: defaultdict(int))
for line in z:
tokens = [t.strip() for t in line.split(",")]
try:
date = tokens[0]
tag = tokens[1]
amount = int(tokens[2])
except ValueError as e:
continue
d[date][tag] += amount
The output is:
13/06/2018 ABC 6909800
24/05/2018 ABC -1848920
16/05/2018 AB 5519298
28/06/2018 A 18008000
14/06/2018 ABC 2122717
To output the result above, just iterate through the items:
for k,v in d.items():
for k2, v2 in v.items():
print(k,k2,v2)
To make your code even better, read the first line only, and then iterate from the second line til the end. That way, your try/except can be removed and you'd get a simpler and cleaner code. But you can pick up from here, right? ;)
To write to a csv, simply
s = '\n'.join(['{0} {1} {2}'.format(k, k2, v2) for k,v in d.items() for k2,v2 in v.items()])
with open('output.txt', 'w') as f:
f.write(s)
This is one approach using a simple iteration.
Ex:
from collections import defaultdict
import csv
result = defaultdict(int)
with open(filename) as infile:
reader = csv.reader(infile)
header = next(reader)
for line in reader:
result[tuple(line[:2])] += int(line[2])
print(header)
for k, v in result.items():
print(k[0], k[1], v)
Output:
14/06/2018 ABC 2122717
13/06/2018 ABC 6909800
28/06/2018 A 18008000
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
To CSV
with open(filename, "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(header)
for k, v in result.items():
writer.writerow([k[0], k[1], v])
You can use itertools.groupby:
from itertools import groupby
import csv
header, *data = csv.reader(open('filename.csv'))
new_data = [[a, list(b)] for a, b in groupby(sorted(data, key=lambda x:x[:2]), key=lambda x:x[:2])]
results = [[*a, sum(int(c) for *_, c in b)] for a, b in new_data]
with open('calc_results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header, *results])
Output:
Date,Tag,Amount
13/06/2018,ABC,6909800
14/06/2018,ABC,2122717
16/05/2018,AB,5519298
24/05/2018,ABC,-1848920
28/06/2018,A,18008000

editing a text file in python and making a new one

I have a text file like this:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|GAPDH|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|RNF14-005|ACTB|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|RNF14-202|HELLE|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
I want to make a list in python for the 6th element of the lines that start with ">".
to do so, I first make a dictionary in python and then the keys should be the list that I want. like this:
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
res = [el[5:] for s in k for el in s.split("|")]
but it returns all elements in the line starts with ">".
do you know how to fix it?
here is expected output:
["RNF14", "GAPDH", "ACTB", "HELLE"]
This should help. ->Using a simple iterattion, str.startswith and str.split
Demo:
res = []
with open(filename, "r") as infile:
for line in infile:
if line.startswith(">"):
val = line.split("|")
res.append(val[5])
print(res)
Output:
['RNF14', 'GAPDH', 'ACTB', 'HELLE']
In you code Replace
res = [el[5:] for s in k for el in s.split("|")]
with
res = [s.split("|")[5] for s in k ] #Should work.
a solution near yours with filter instead of groupby and map
with open('infile.txt') as f:
lines = f.readlines()
groups = filter(lambda x: x.startswith(">"), lines)
res = list(map(lambda x: x.split('|')[5],groups))

Converting text file to dictionary in python

So lets say I want to convert the following to a dictionary where the 1st column is keys, and 2nd column is values.
http://pastebin.com/29bXkYhd
The following code works for this (assume romEdges.txt is the name of the file):
f = open('romEdges.txt')
dic = {}
for l in f:
k, v = l.split()
if k in dic:
dic[k].extend(v)
else:
dic[k] = [v]
f.close()
OK
But why doesn't the code work for this file?
http://pastebin.com/Za0McsAM
If anyone can tell me the correct code for the 2nd text file to work as well I would appreciate it.
Thanks in advance.
You should use append instead of extend
from collections import defaultdict
d = defaultdict(list)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].append(v)
print d
or using sets to prevent duplicates
d = defaultdict(set)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].add(v)
print d
If you want to append the data to dictionary, then you can use update in python. Please use following code:
f = open('your file name')
dic = {}
for l in f:
k,v = l.split()
if k in dic:
dict.update({k:v })
else:
dic[k] = [v]
print dic
f.close()
output:
{'0100464': ['0100360'], '0100317': ['0100039'], '0100405': ['0100181'], '0100545': ['0100212'], '0100008': ['0000459'], '0100073': ['0100072'], '0100044': ['0100426'], '0100062': ['0100033'], '0100061': ['0000461'], '0100066': ['0100067'], '0100067': ['0100164'], '0100064': ['0100353'], '0100080': ['0100468'], '0100566': ['0100356'], '0100048': ['0100066'], '0100005': ['0100448'], '0100007': ['0100008'], '0100318': ['0100319'], '0100045': ['0100046'], '0100238': ['0100150'], '0100040': ['0100244'], '0100024': ['0100394'], '0100025': ['0100026'], '0100022': ['0100419'], '0100009': ['0100010'], '0100020': ['0100021'], '0100313': ['0100350'], '0100297': ['0100381'], '0100490': ['0100484'], '0100049': ['0100336'], '0100075': ['0100076'], '0100074': ['0100075'], '0100077': ['0000195'], '0100071': ['0100072'], '0100265': ['0000202'], '0100266': ['0000201'], '0100035': ['0100226'], '0100079': ['0100348'], '0100050': ['0100058'], '0100017': ['0100369'], '0100030': ['0100465'], '0100033': ['0100322'], '0100058': ['0100056'], '0100013': ['0100326'], '0100036': ['0100463'], '0100321': ['0100320'], '0100323': ['0100503'], '0100003': ['0100004'], '0100056': ['0100489'], '0100055': ['0100033'], '0100053': ['0100495'], '0100286': ['0100461'], '0100285': ['0100196'], '0100482': ['0100483']}

Categories