Grouping data in Python

Grouping data in Python - python

I have the following (space delimited) input:
2012-10-05 PETER 6
2012-10-05 PETER 4
2012-10-06 PETER 60
2012-10-06 TOM 10
2012-10-08 SOMNATH 80
And I would like to achieve the following pipe-delimited output:
(where the columns are [DATE AND NAME, NUM ENTRIES, SUM OF LAST COL])
2012-10-05 PETER|2|10
2012-10-06 PETER|1|60
2012-10-06 TOM|1|10
2012-10-08 SOMNATH|1|80
This is my code so far:
s = open("output.txt","r")
fn=s.readlines()
d = {}
for line in fn:
parts = line.split()
if parts[0] in d:
d[parts[0]][1] += int(parts[2])
d[parts[0]][2] += 1
else:
d[parts[0]] = [parts[1], int(parts[2]), 1]
for date in sorted(d):
print "%s %s|%d|%d" % (date, d[date][0], d[date][2], d[date][1])
I am getting the output as:
2012-10-06 PETER|2|70
instead of
2012-10-06 PETER|1|60
and TOM isn't showing in the list.
What do I need to do to correct my code?

d = collections.defaultdict(list)
with open('output.txt', 'r') as f:
for line in f:
date, name, val = line.split()
d[date, name].append(int(val))
for (date, name), vals in sorted(d.items()):
print '%s %s|%d|%d' % (date, name, len(vals), sum(vals))

<3 itertools
import itertools
with open('output.txt', 'r') as f:
splitlines = (line.split() for line in f if line.strip())
for (date, name), bits in itertools.groupby(splitlines, key=lambda bits: bits[:2]):
total = 0
count = 0
for _, _, val in bits:
total += int(val)
count += 1
print '%s %s|%d|%d' % (date, name, count, total)
If you don't want to use groupby (either it is unavailable, or your input data isn't guaranteed to be sorted), here's a conventional solution (which is effectively just a fixed version of your code):
d = {}
with open('output.txt', 'r') as f:
for line in f:
date, name, val = line.split()
key = (date, name)
if key not in d:
d[key] = [0, 0]
d[key][0] += int(val)
d[key][1] += 1
for key in sorted(d):
date, name = key
total, count = d[key]
print '%s %s|%d|%d' % (date, name, count, total)
Note that we use (date, name) as the key instead of just using date.

Related

Python CSV sum value if they have same ID/name

I want to sum all values that have the same name / ID in a csv file
Right now I am only looking for ID with the name 'company'
csv file format:
company A, 100
company B, 200
company A, 300
The end result I am looking for is:
company A, 400
company B, 200
total: 600
My code so far:
import csv
name = ''
num = ''
total = 0
with open('xx.csv', 'r', newline='') as csvfile:
reader = csv.reader(csvfile)
next(csvfile)
for a in reader:
if a[0].__contain__('company'):
name = (a[0])
num = (a[1])
total += float(a[1])
print(str(name) + ', ' + str(num))
print('total: ' + str(total))

First, CSV typically have commas, and the delimiter for csv.reader must be a single character, so I suggest updating your file to properly use commas.
Secondly, to aggregate the companies, you need to store them as you iterate the file. Easiest way is to use a dictionary type.
Then only after you've aggregated everything, should you create a second loop to go over the aggregated values, then print the final total.
import csv
from collections import defaultdict
totals = defaultdict(int)
total = 0
with open('companies.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
# next (csvfile) # shown file has no header
for row in reader:
if not row[0].startswith('company'):
continue
name, value = row
totals[name] += float(value)
total = 0
for name, value in totals.items():
print(f'{name},{value}')
total += value
print(f'total: {total}')

You don't necessarily need to use csv module here. Just read every single line split them from right (rsplit) and fill a dictionary like below:
d = {}
with open('your_file.csv') as f:
# next(f) - If header needs to be skipped
for line in f:
name, value = line.rsplit(',', maxsplit=1)
d[name] = d.get(name, 0) + int(value)
for k, v in d.items():
print(f"{k}, {v}")
print(f"total: {sum(d.values())}")
output:
company A, 400
company B, 200
total: 600
In order not to iterate again through the dictionary's values to calculate the total(I mean in sum(d.values()) expression), you can do add to total while you are printing the items like:
d = {}
with open('new.csv') as f:
for line in f:
name, value = line.rsplit(',', maxsplit=1)
d[name] = d.get(name, 0) + int(value)
total = 0
for k, v in d.items():
total += v
print(f"{k}, {v}")
print(f"total: {total}")

Compare two files with input from third file and write the biggest count to the fourth file

I have three files:
file:1
mango
banana
orange
file:2 -> the count is in string, because when I wrote to file:2 -> write() only let me write strings.
mango 2
banana 3
file:3 -> the count is in string, because when I wrote to file:3 -> write() only let me write strings.
banana 4
orange 3
I want to take file:1 and check with file:2 & file:3. If they are present, I want to take the entry with the biggest count and write to file:4.
Expected output in file:4
mango 2
banana 4
orange 3
I tried writing file:2 and file:3 to a dictionary and do a dictionary compare, but I am getting lost with two many open() files.
I am new to python. Not being able to write an integer to file with write() itself threw me off.
Appreciate your help/hint.

Following produces file4 from file1, file2, file3;
def load_file(filepath):
" Loads the files as dictionary "
with open(filepath, 'r') as f:
return dict(line.rstrip().split() for line in f)
# Get keys
with open('file1.txt') as file1:
keys = [line.rstrip() for line in file1]
# Produce output (file4)
with open('file4.txt', 'w') as file_out:
dic1 = load_file('file2.txt')
dic2 = load_file('file3.txt')
for k in keys:
v1 = int(dic1.get(k, 0)) # convert dic counts to int)
v2 = int(dic2.get(k, 0)) # (use default to 0 if not present)
v = max(v1, v2)
if v > 0: # only write if count > 0
file_out.write(f"{k} {v}\n")

This works for as many files as you want for input.
values = {}
def func(file):
number_of_lines = file.readlines()
for line in number_of_lines:
elements = line.split()
if (elements[0] in values):
if (int(elements[1]) > int(values[elements[0]])):
values[elements[0]] = elements[1]
else:
values[elements[0]] = 0
file.close()
f = open("1.txt", "r")
func(f)
f = open("2.txt", "r")
func(f)
f = open("3.txt", "r")
func(f)
f = open("4.txt", "w+")
for key, val in values.items():
print (key, " ", val)
to_write = key + " " + val + "\n"
f.write(to_write)
f.close()

Python Writing a Dictionary to Text File isn't producing correct output

I have a dictionary as such -
d = {"Dave":("Male", "96"), "Alice":("Female", "98")}
I want to write it to a text file in such a format -
Dave
Male
96
Alice
Female
98
This is my current code -
d = {"Dave":("Male", "96"), "Alice":("Female", "98")}
with open("dictionary.txt", 'w') as f:
for key, value in d.items():
f.write('%s \n %s \n' % (key, value))
It is, however, producing the following output in the text file:
Dave
('Male', '96')
Alice
('Female', '98')
How can I adjust this?
Please help! Thanks.

When you convert a tuple to a str using formatting, you get the representation of the tuple, which is (roughly, there are 2 methods __str__ and __repr__ actually) what python prints when you print the item in the console.
To get elements without the tuple decorations, you have to unpack the tuple. One option (using format):
for key, value in d.items():
f.write("{}\n{}\n{}\n".format(key,*value))
* unpacks the elements of value into 2 elements. format does the rest
An even more compact way would be to multiply the format string by 3 (less copy/paste)
for key, value in d.items():
f.write(("{}\n"*3).format(key,*value))

I used the i in range method that iterates for every value in every key -
d = {"Dave":("Male", "96"), "Alice":("Female", "98")}
with open("dictionary.txt", 'w') as f:
for key, value in d.items():
for x in range(0,2):
f.write('%s \n %s \n' % (key, value[x]))

The following works in Python 3.6:
d = {"Dave":("Male", "96"), "Alice":("Female", "98")}
with open('dictionary.txt', mode='w') as f:
for name, (sex, age) in d.items():
f.write(f'{name}\n{sex}\n{age}\n')
You can unpack the tuple at the top of the for loop. Additionally, in Python 3.6, you can use the f-string mechanism to directly interpolate variable values into strings.

d = {"Dave":("Male", "96"), "Alice":("Female", "98")}
with open("dictionary.txt", 'w') as f:
for key in d.keys():
f.write('%s \n' % (key))
for v in d[key]:
f.write('%s \n' % (v))

Sort Average In A file

I have a file with 3 scores for each person. Each person has their own row. I want to use these scores, and get the average of all 3 of them. There scores are separated by tabs and in descending order. For example:
tam 10 6 11
tom 3 7 3
tim 5 4 6
these people would come out with an average of:
tam 9
tom 5
tim 4
I want these to be able to print to the python shell, however not be saved to the file.
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
while count < 3:
d.setdefault(names, []).append(average)
count = count + 1
for names, v in sorted(d.items()):
averages = (sum(v)/3)
print(names,average)
averageslist=[]
averageslist.append(averages)
My code only finds the first persons average and outputs it for all of them. I also want it to be descending in order of averages.

You can use the following code that parses your file into a list of (name, average) tuples and prints every entry of the by average sorted list:
import operator
with open("file.txt") as f:
data = []
for line in f:
parts = line.split()
name = parts[0]
vals = parts[1:]
avg = sum(int(x) for x in vals)/len(vals)
data.append((name, avg))
for person in sorted(data, key=operator.itemgetter(1), reverse=True):
print("{} {}".format(*person))

You are almost correct.You are calculating average in the first step.So need of sum(v)/3 again.Try this
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
d[names] = average
for names, v in sorted(d.items(),key=lambda x:x[1],reverse=True): #increasing order==>sorted(d.items(),key=lambda x:x[1])
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)
To sort by name
for names, v in sorted(d.items()):
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)

The issue is this:
averages = (sum(v)/3)
print(names,average)
Notice that on the first line you are computing averages (with an s at the end) and on the next line you are printing average (without an s).

Try This:
from operator import itemgetter
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
d.setdefault(names, []).append(average)
for names,v in sorted(d.items(), key=itemgetter(1),reverse=True):
print(names,v)

Delete and save duplicate in another file

In test.txt:
1 a
2 b
3 c
4 a
5 d
6 c
I want to remove duplicate and save the rest in test2.txt:
2 b
5 d
I tried to start with the codes below.
file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
if line:
sline = line.split('\t')
if sline[1] not in word:
print sline[0], sline[1]
word.add(sline[1])
#file2.close()
The results from the codes showed:
1 a
2 b
3 c
5 d
Any suggestion?

You can use collections.Orderedict here:
>>> from collections import OrderedDict
with open('abc') as f:
dic = OrderedDict()
for line in f:
v,k = line.split()
dic.setdefault(k,[]).append(v)
Now dic looks like:
OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])
Now we only need those keys which contain only 1 items in the list.
for k,v in dic.iteritems():
if len(v) == 1:
print v[0],k
...
2 b
5 d

What you're doing is that you're just making sure every second item (letter) gets printed out only once. Which obviously is not what you're saying you want.
You must split your code into two halfs - reading and gathering statistics about letter counts, and part which prints only those which has count == 1.
Converting your original code (I just made it a little simpler):
file1 = open('../test.txt')
words = {}
for line in file1:
if line:
line_num, letter = line.split('\t')
if letter not in words:
words[letter] = [1, line_num]
else:
words[letter][0] += 1
for letter, (count, line_num) in words.iteritems():
if count == 1:
print line_num, letter

I tried to keep it as similar to your stlye as possible:
file1 = open('../test.txt').read().split('\n')
word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
if line:
sline = line.split(' ')
test.append(" ".join([sline[0], sline[1]]))
if (sline[1] not in word):
word.add(sline[1])
num_lines = num_lines + 1;
else:
sin_duple.append(sline[1])
duplicate.append(" ".join([sline[0], sline[1]]))
num_lines = num_lines + 1;
num_duplicates = num_duplicates + 1;
for i in range (0,num_lines+1):
for item in test:
for j in range(0, num_duplicates):
#print((str(i) + " " + str(sin_duple[j])))
if item == (str(i) + " " + str(sin_duple[j])):
test.remove(item)
file2 = open("../test2.txt", 'w')
for item in test:
file2.write("%s\n" % item)
file2.close()

How about some Pandas
import pandas as pd
a = pd.read_csv("test_remove_dupl.txt",sep=",")
b = a.drop_duplicates(cols="a")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping data in Python - python

d = collections.defaultdict(list) with open('output.txt', 'r') as f: for line in f: date, name, val = line.split() d[date, name].append(int(val)) for (date, name), vals in sorted(d.items()): print '%s %s|%d|%d' % (date, name, len(vals), sum(vals))

Related

Python CSV sum value if they have same ID/name

Compare two files with input from third file and write the biggest count to the fourth file

Python Writing a Dictionary to Text File isn't producing correct output

Sort Average In A file

Delete and save duplicate in another file

Categories

Resources