Group items in a list and calculate sums - python

I have a list with weekly figures and need to obtain the grouped totals by month.
The following code does the job, but there should be a more pythonic way of doing it with using the standard libraries.
The drawback of the code below is that the list needs to be in sorted order.
#Test data (not sorted)
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2),]
month = sum_weekly[0][0].split('/')[1]
count=0
out=[]
for item in sum_weekly:
m_sel = item[0].split('/')[1]
if m_sel!=month:
out.append((month, count))
count=item[1]
else:
count+=item[1]
month = m_sel
out.append((month, count))
# monthly sums output as ('01', 242), ('02', 360), ('03', 220), ('04', 13), ('05', 67)
print (out)

You could use defaultdict to store the result instead of a list. The keys of the dictionary would be the months and you can simply add the values with the same month (key).
Possible implementation:
# Test Data
from collections import defaultdict
sum_weekly = [('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2), ('2020/04/19', 6), ('2020/04/26', 5),
('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28)]
results = defaultdict(int)
for date, count in sum_weekly: # used unpacking to make it clearer
month = date.split('/')[1]
# because we use a defaultdict if the key does not exist it
# the entry for the key will be created and initialize at zero
results[month] += count
print(results)

You can use itertools.groupby (it is part of standard library) - it does pretty much what you did under the hood (grouping together sequences of elements for which the key function gives same output). It can look like the following:
import itertools
def select_month(item):
return item[0].split('/')[1]
def get_value(item):
return item[1]
result = [(month, sum(map(get_value, group)))
for month, group in itertools.groupby(sorted(sum_weekly), select_month)]
print(result)

Terse, but maybe not that pythonic:
import calendar, functools, collections
{calendar.month_name[i]: val for i, val in functools.reduce(lambda a, b: a + b, [collections.Counter({datetime.datetime.strptime(time, '%Y/%m/%d').month: val}) for time, val in sum_weekly]).items()}

a method using pyspark
from pyspark import SparkContext
sc = SparkContext()
l = sc.parallelize(sum_weekly)
r = l.map(lambda x: (x[0].split("/")[1], x[1])).reduceByKey(lambda p, q: (p + q)).collect()
print(r) #[('04', 13), ('02', 360), ('01', 242), ('03', 220), ('05', 67)]

You can accomplish this with a Pandas dataframe. First, you isolate the month, and then use groupby.sum().
import pandas as pd
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89), ('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85), ('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14), ('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2)]
df= pd.DataFrame(sum_weekly)
df.columns=['Date','Sum']
df['Month'] = df['Date'].str.split('/').str[1]
print(df.groupby('Month').sum())

Related

Sorting list of tuples (Python)

I am new in python and for practice reasons, I am trying to solve the following task:
Given a list of tuples
A = [(17, 8), (17, 12), (7, 2), (9, 15), (9, 17), (1, 4), (3, 9), (12, 14)]
My goal is: to sort in (n log n) time in descending order the list according to the first element of the set and if the first element of two sets are the same, sorting in descending order according to the second element. --> x <= x', y <= y'
So, I want to get the result:
A = [(17, 12), (17, 8), (12, 14), (9, 17), (9, 15), (7, 2), (3, 9), (1, 4)]
I have tried using the following code:
A.sort(reverse = True, key=lambda x: x[0] )
But it just sorts according to the first element and I do know if it is in n log n time.
Could you please help me with that?
Thank you!
To sort based on both values, remove the key function:
>>> A.sort(reverse = True)
>>> A
[(17, 12), (17, 8), (12, 14), (9, 17), (9, 15), (7, 2), (3, 9), (1, 4)]
And yes, Python sorts in O(n log n), for more, check out Timsort on Wikipedia, the algorithm Python uses for sorting.
Python's inbuilt sorted() method with a comparator function(Similar to C++) will work
sorted(tup, key = lambda x: x[0])
Tuples are naturally sorted by the first element, then the second:
A.sort(reverse = True)
Output as requested.
There you go:
A = [(17, 8), (17, 12), (7, 2), (9, 15), (9, 17), (1, 4), (3, 9), (12, 14)]
reverse_sorted = []
for item in reversed(sorted(A)):
reverse_sorted.append(item)
print(reverse_sorted)
As to add to the other's excellent answers. Search on Google with "python list sort time complexity" will return for you what you seek regarding python's sort time complexities (that others have already written above).
In addition, if you need to move away from using python's inbuilt sort function. There are many different sorting algorithms created over the years that you can recreate that have the desired time complexities. It would be a good "training" exercise if you're new to programming in general.

Getting MultiSelectField selected items in views.py with query

I am writing a method in views.py that should return a list of Artist objects by country and genre.
I have this code in models.py:
GENRES = ((1, 'Alternative'),
(2, 'Blues'),
(3, 'Classical'),
(4, 'Country'),
(5, 'Disco'),
(6, 'Drum and Bass'),
(7, 'Dubstep'),
(8, 'EDM'),
(9, 'Electronic'),
(10, 'Experimental'),
(11, 'Folk'),
(12, 'Funk'),
(13, 'Garage'),
(14, 'Grime'),
(15, 'Hardstyle'),
(16, 'Heavy Metal'),
(17, 'Hip Hop'),
(18, 'House'),
(19, 'Indie'),
(20, 'Jazz'),
(21, 'Multi-Genre'),
(22, 'Pop'),
(23, 'Punk'),
(24, 'R&B'),
(25, 'Reggae'),
(26, 'Rock'),
(27, 'Ska'),
(28, 'Soul'),
(29, 'Techno'),
(30, 'Trance'),
(31, 'Urban'),
(32, 'World'))
genres = MultiSelectField(choices=GENRES, null=True)
country = CountryField(default="Poland")
Currently I am looping through all objects and search the ones I need using python:
countries_selection = request.POST.get('countriesSelection')
genres_selection = request.POST.get('genresSelection')
results = []
artists = Artist.objects.all()
for artist in artists:
if artist.country.name == countries_selection:
if artist.genres:
for genre in artist.genres.__str__().split(","):
if genres_selection in genre:
results.append(artist)
else:
results.append(artist)
Obviously this is not a good approach. I want to get the same results by using query. I tried:
Artist.objects.filter(genres__contains = 'Rock')
But it does not return anything because only keys are saved in the database, not values. For example this query work, but I will not have the key in my views.py function:
Artist.objects.filter(genres__contains = '26')
That's not a bad approach; it's essentially the best way to configure what you are trying to do, if you assist on using a multi-select field. Otherwise, I would suggest using a manytomany field in this type of instance. Allows for much easier referencing. (FYI: Multiselect just saves ('25', 'genre') as a comma separated string. That's why your method is best in this case)

How to filter dictionary by value? [duplicate]

This question already has answers here:
How to filter a dictionary according to an arbitrary condition function?
(7 answers)
Closed 4 years ago.
I have dictionary in format "site_mame": (side_id, frequency):
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
Is there a smart way to filter dictionary d by value so that I have only those positions, where frequency is less than 100? For example:
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
I don't want to use loops, just looking for smart and efficient solution...
You can use a dictionary comprehension with unpacking for a more Pythonic result:
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
new_data = [{a:(b, c) for a, (b, c) in d[0].items() if c < 100}]
Output:
[{'laposte.net': (2, 23), 'fpdownload2.macromedia.com': (1, 88), 'match.rtbidder.net': (5, 84), 'x2.vindicosuite.com': (6, 37), 'rp.gwallet.com': (7, 88)}]
You can use a dictionary comprehension to do the filtering:
d = {
'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88),
}
d_filtered = {
k: v
for k, v in d.items()
if v[1] < 100
}
What you want is a dictionary comprehension. I'll show it with a different example:
d = {'spam': 120, 'eggs': 20, 'ham': 37, 'cheese': 101}
d = {key: value for key, value in d.items() if value >= 100}
If you don't already understand list comprehensions, this probably looks like magic that you won't be able to maintain and debug, so I'll show you how to break it out into an explicit loop statement that you should be able to understand easily:
new_d = {}
for key, value in d.items():
if value >= 100:
new_d[key] = value
If you can't figure out how to turn that back into the comprehension, just use the statement version until you learn a bit more; it's a bit more verbose, but better to have code you can think through in your head.
Your problem is slightly more complicated, because the values aren't just a number but a tuple of two numbers (so you want to filter on value[1], not value). And because you have a list of one dict rather than just a dict (so you may need to do this for each dict in the list). And of course my filter test isn't the same as yours. But hopefully you can figure it out from here.

Pyspark: merging values in a nested list

I have a pair-RDD with the structure:
[(key, [(timestring, value)]]
Example:
[("key1", [("20161101", 23), ("20161101", 41), ("20161102", 66),...]),
("key2", [("20161101", 86), ("20161101", 9), ("20161102", 11),...])
...]
I want to process the list for each key, grouping by timestring and calculate the mean of all values for identical timestrings. So the above example would become:
[("key1", [("20161101", 32), ..]),
("key2", [("20161101", 47.5),...])
...]
I struggle to find a solution just using Pyspark methods in one step, is it at all possible or do I need to use some intermediate steps?
You can define a function:
from itertools import groupby
import numpy as np
def mapper(xs):
return [(k, np.mean([v[1] for v in vs])) for k, vs in groupby(sorted(xs), lambda x: x[0])]
And mapValues
rdd = sc.parallelize([
("key1", [("20161101", 23), ("20161101", 41), ("20161102", 66)]),
("key2", [("20161101", 86), ("20161101", 9), ("20161102", 11)])
])
rdd.mapValues(mapper)

How can I find the average of each similar entry in a list of tuples?

I have this list of tuples
[('Jem', 10), ('Sam', 10), ('Sam', 2), ('Jem', 9), ('Jem', 10)]
How do I find the average of the numbers coupled with each name, i.e. the average of all the numbers stored in a tuple with Jem, and then output them? In this example, the output would be:
Jem 9.66666666667
Sam 6
There's a couple ways to do this. One is easy, one is pretty.
Easy:
Use a dictionary! It's easy to build a for loop that goes through your tuples and appends the second element to a dictionary, keyed on the first element.
d = {}
tuples = [('Jem', 10), ('Sam', 10), ('Sam', 2), ('Jem', 9), ('Jem', 10)]
for tuple in tuples:
key,val = tuple
d.setdefault(key, []).append(val)
Once it's in a dictionary, you can do:
for name, values in d.items():
print("{name} {avg}".format(name=name, avg=sum(values)/len(values)))
Pretty:
Use itertools.groupby. This only works if your data is sorted by the key you want to group by (in this case, t[0] for each t in tuples) so it's not ideal in this case, but it's a nice way to highlight the function.
from itertools import groupby
tuples = [('Jem', 10), ('Sam', 10), ('Sam', 2), ('Jem', 9), ('Jem', 10)]
tuples.sort(key=lambda tup: tup[0])
# tuples is now [('Jem', 10), ('Jem', 9), ('Jem', 10), ('Sam', 10), ('Sam', 2)]
groups = groupby(tuples, lambda tup: tup[0])
This builds a structure that looks kind of like:
[('Jem', [('Jem', 10), ('Jem', 9), ('Jem', 10)]),
('Sam', [('Sam', 10), ('Sam', 2)])]
We can use that to build our names and averages:
for groupname, grouptuples in groups:
values = [t[1] for t in groupvalues]
print("{name} {avg}".format(name=groupname, avg=sum(values)/len(values)))
Seems like a straight-forward case for collections.defaultdict
from collections import defaultdict
l = [('Jem', 10), ('Sam', 10), ('Sam', 2), ('Jem', 9), ('Jem', 10)]
d = defaultdict(list)
for key, value in l:
d[key].append(value)
Then calculating the mean
from numpy import mean
for key in d:
print(key, mean(d[key]))
Output
Jem 9.66666666667
Sam 6.0
You can also use List comprehensions:
l = [('Jem', 10), ('Sam', 10), ('Sam', 2), ('Jem', 9), ('Jem', 10)]
def avg(l):
return sum(l)/len(l)
result = [(n, avg([v[1] for v in l if v[0] is n])) for n in set([n[0] for n in l])]
# result is [('Jem', 9.666666666666666), ('Sam', 6.0)]

Categories