I have two variable lenght lists extracted from an excel file. One has wagon number and the other the wagon weight, something like this:
wagon_list = [1234567, 2345678, 3456789, 4567890]
weight_list = [1.1, 2.2, 3.3, 4.4]
Sometimes the wagon_list will have a duplicate number, I need to sum the wagon weight and remove the duplicate from both:
wagon_list = [1234567, 2345678, 2345678, 4567890]
weight_list = [1.1, 2.2, 3.3, 4.4]
should become:
wagon_list = [1234567, 2345678, 4567890]
weight_list = [1.1, 5.5, 4.4]
My first option was to pop items and sum them while iterating with a for loop. It didnt work because (after some research) you cant change a list youre iterating over.
So I moved to the second option, using an auxiliary list. It doesnt work when it hits the last index. Even after some tweaking of my code, I cant find a solution.
I can see it would have further problems if the last three elements were to be added.
counter_3 = 0
for i in wagon_list:
if i == wagon_list[-1]: #last entry, simply appends to the new list. This comes first because the next option returns error if running the last entry as i
new_wagon_list.append(wagon_list[counter_3])
new_weight_list.append(weight_list[counter_3])
counter_3 +=2
elif i != wagon_list[(counter_3 + 1)]: #if they are different, appends.
new_wagon_list.append(wagon_list[counter_3])
new_weight_list.append(weight_list[counter_3])
counter_3 += 1
elif i == wagon_list[(counter_3 + 1)]: #if equal to next item, appends the wagon and sums the weights
new_wagon_list.append(wagon_list[counter_3])
new_weight_list.append(weight_list[counter_3] + weight_list[counter_3 + 1])
This should return:
wagon_list = [1234567, 2345678, 4567890]
weight_list = [1.1, 5.5, 4.4]
But returns
wagon_list = [1234567, 2345678, 3456789, 3456789, 3456789]
weight_list = [1.1, 2.2, 7.7, 7.7, 3.3]
Here is a simple way, using defaultdict (hence the result is correct even if wagon_list is unordered). You could also use groupby but then you have to sort both lists so that duplicate wagons are consecutive.
This solution requires a single pass through the lists, and doesn't change the order of the lists. It just removes duplicate wagons and adds their weight.
from collections import defaultdict
def group_weights(wagon_list, weight_list):
ww = defaultdict(float)
for wagon, weight in zip(wagon_list, weight_list):
ww[wagon] += weight
return list(ww), list(ww.values())
Example
# set up MRE
wagon_list = [1234567, 2345678, 2345678, 4567890]
weight_list = [1.1, 2.2, 3.3, 4.4]
new_wagon_list, new_weight_list = group_weights(wagon_list, weight_list)
>>> new_wagon_list
[1234567, 2345678, 4567890]
>>> new_weight_list
[1.1, 5.5, 4.4]
Addendum
If you'd like to avoid defaultdict altogether, you can also simply do this (same result as above):
ww = {}
for k, v in zip(wagon_list, weight_list):
ww[k] = ww.get(k, 0) + v
new_wagon_list, new_weight_list = map(list, zip(*ww.items()))
Explanation
A quick review of some of the tools and syntax used above:
zip(*iterables) "Make an iterator that aggregates elements from each of the iterables." So e.g.:
for x, y in zip(wagon_list, weight_list):
print(f'x={x}, y={y}')
# prints out
x=1234567, y=1.1
x=2345678, y=2.2
x=2345678, y=3.3
x=4567890, y=4.4
dict.get(key[, default]) "Return the value for key if key is in the dictionary, else default." In other words, with ww[k] = ww.get(k, 0) + v, we are saying: add v to ww[k], but if it doesn't exist yet, then use 0 as a starting point.
The last bit (new_wagon_list, new_weight_list = map(list, zip(*ww.items()))) uses the idiom that "zip() in conjunction with the * operator can be used to unzip a list" (or, in this case, an iterator of tuples key, value obtained from dict.items()). Without the map(list, ...), we would get tuples in the two variables. I thought you may want to stick with lists, so we apply list(.) to each tuple before assigning to new_wagon_list resp. new_weight_list.
Modifying a list that you're iterating over doesn't work out well. I'd zip the two lists together and use itertools.groupby:
>>> from itertools import groupby
>>> wagon_list = [1234567, 2345678, 2345678, 4567890]
>>> weight_list = [1.1, 2.2, 3.3, 4.4]
>>> wagon_list, weight_list = map(list, zip(*(
... (wagon, sum(weight for _, weight in group))
... for wagon, group in groupby(sorted(
... zip(wagon_list, weight_list)
... ), key=lambda t: t[0])
... )))
>>> wagon_list
[1234567, 2345678, 4567890]
>>> weight_list
[1.1, 5.5, 4.4]
Use a dictionary to combine the values:
In [1]: wagon_list = [1234567, 2345678, 2345678, 4567890]
...: weight_list = [1.1, 2.2, 3.3, 4.4]
Out[1]: [1.1, 2.2, 3.3, 4.4]
In [2]: together = {}
Out[2]: {}
In [3]: for k, v in zip(wagon_list, weight_list):
...: together[k] = together.setdefault(k, 0) + v
...:
In [4]: together
Out[4]: {1234567: 1.1, 2345678: 5.5, 4567890: 4.4}
In [6]: new_wagon_list = list(together.keys())
Out[6]: [1234567, 2345678, 4567890]
In [7]: new_weight_list = list(together.values())
Out[7]: [1.1, 5.5, 4.4]
No fluff, frills, dependency or mystery version. Either an index for the current wagon is going to be found, allowing us to pinpoint the weight index to modify or no index is found and we append both of the new values.
Your entire problem revolves around "Does this already exist?". When using any Iterable, we can answer that question with index. index throws an Exception if no index is found so, we wrap it in try and treat except as an else.
def wagon_filter(wagons:list, weights:list) -> tuple:
#pre-zip and clear so we can reuse the references
data = zip(wagons, weights)
wagons, weights = [], []
#reassign
for W, w in data:
try: #(W)agon exists? modify it's (w)eight index
i = wagons.index(W)
weights[i] += w
except: #else append new (W)agon and (w)eight
wagons.append(W)
weights.append(w)
return wagons, weights
usage:
#data
wagons = [1234567, 2345678, 2345678, 4567890]
weights = [1.1, 2.2, 3.3, 4.4]
#print filter results
print(*wagon_filter(wagons, weights), sep='\n')
#[1234567, 2345678, 4567890]
#[1.1, 5.5, 4.4]
So, I'm sure similar questions have been asked before but I couldn't find quite what I need.
I have a program that outputs a 2D array like the one below:
arr = [[0.2, 3], [0.3, "End"], ...]
There may be more or less elements, but each is a 2-element array, where the first value is a float and the second can be a float or a string.
Both of those values may repeat. In each of those arrays, the second element takes on only a few possible values.
What I want to do is sum the first elements' value within the arrays that have the same value of the second element and output a similar array that does not have those duplicated values.
For example:
input = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
output = [[0.5, 1.5], [0.4, 3.5], [0.85, "End"]]
I'd appreciate if the output array was sorted by this second element (floats ascending, strings at the end), although it's not necessary.
EDIT: Thanks for both answers; I've decided to use the one by Chris, because the code was more comprehensible to me, although groupby seems like a function designed to solved this very problem, so I'll try to read up on that, too.
UPDATE: The values of floats were always positive, by nature of the task at hand, so I used negative values to stop the usage of any strings - now I have a few if statements that check for those "encoded" negative values and replace them with strings again just before they're printed out, so sorting is now easier.
You could use a dictionary to accumulate the sum of the first value in the list keyed by the second item.
To get the 'string' items at the end of the list, the sort key could be set to positive infinity, float('inf'), in the sort key .
input_ = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
d = dict()
for pair in input_:
d[pair[1]] = d.get(pair[1], 0) + pair[0]
L = []
for k, v in d.items():
L.append([v,k])
L.sort(key=lambda x: x[1] if type(x[1]) == float else float('inf'))
print(L)
This prints:
[[0.5, 1.5], [0.4, 3.5], [0.8500000000000001, 'End']]
You can try to play with itertools.groupby:
import itertools
out = [[key, sum([elt[0]for elt in val])] for key, val in itertools.groupby(a, key=lambda elt: elt[1])]
>>> [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Explanation:
Groupby the 2D list according to the 2nd element of each sublist using itertools.groupby and the key parameters. We define the lambda key=lambda elt: elt[1] to groupby on the 2nd element:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(key, val)
# 1.5 <itertools._grouper object at 0x0000026AD1F6E160>
# End <itertools._grouper object at 0x0000026AD2104EF0>
# 3.5 <itertools._grouper object at 0x0000026AD1F6E160>
For each value of the group, compute the sum using the buildin function sum:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(sum([elt[0]for elt in val]))
# 0.5
# 0.8500000000000001
# 0.4
Compute the desired output:
out = []
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
out.append([sum([elt[0]for elt in val]), key])
print(out)
# [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Then you said about sorting on the 2nd value but there are strings and numbers, it's quite a problem for the computer. It can't make a choice between a number and a string. Objects must be comparable.
I have the following set of values stored in a list.
[-1.7683218, 0.22206295, -0.28429198, 5.925369, -3.952484, -3.0728238, 0.09690776, -0.31914753, 3.9695702, 26.934353, 1.4882066, 1.8194668, -0.5614318, 1.2354431, -0.09714768, -0.15579335, -0.059994906, 1.0105655, -23.25607, 31.982368, -0.09390785, 0.17786688, 0.36164832, -4.673975, 13.495866, -3.57134, 0.5583399, -1.801314, 2.4207468, 2.0513844, -3.429592, -9.599998, 23.412394, -3.963623, 6.930485, 2.5186272, 0.6805691, -1.1615586, -0.915736, -2.6307302, -14.409785, 0.6327307, 10.512744, -0.09292421, -0.61977243, 0.35928893, -1.3844814, 8.098062, -0.8270248, 0.47219157, 0.089366496, 0.9056338, 1.5297629, 3.3246832, -0.9748858, 36.62332, -1.0525678, -0.87139374, 6.7600174, 36.210625, -0.25728267, 14.568578, 0.87466383, -4.2237897, -5.4309, 19.762472, 0.8426512, -0.7807278, 0.03435099, 12.787761, -4.9308186, -1.4322343, 0.49790275, -12.979129, 0.18121482, -0.81953144, -1.5393608, 17.757078, 3.5726204, -11.319154, -0.002896044, -1.8806648, 0.30027565, -2.6210017, 16.230186, -2.2566936, 37.37506, -2.7738526, -0.91440165, -3.652771, 1.8378688, -0.25519317, 0.5222581, 0.2189773, 23.825306, 0.3779062, 2.6709516, 0.84001434, -0.41394734, -0.600579, -3.1629875, 0.2880843, -3.9132822, 5.674796, -0.5569526, 0.30253112, -4.4269695, 4.5206604, -0.8477638, 0.0032483074, -2.2814171, 0.5524869, -1.4271426, -0.24263692, 1.0095457, -3.187037, -1.6656531, 1.4805393, 0.064992905, -4.8124804, -0.07194552, -0.28692132, -0.19502515, 0.010771384, -32.744797, 1.2642047, 6.3942785, -1.2971659, 29.70087, 0.19707158, -2.734262, 2.8497686, -1.710305, -1.3836008, 22.758884, -1.8488939, 4.1740856, 0.26019523, -8.814447, -3.937495, 0.22731477, -0.7874651, 17.22002, -7.89242, -0.5795766, 3.3960745, 1.0440702, 0.5483718, 1.2849183, -0.63732344, -40.38428, -4.25527, 3.034935, 0.25527972, -0.81940174, -7.0720696, 1.7420169, 14.904871, -1.5399592, 0.20110837, 0.1902977, 2.5790472, -28.560707, 0.09560776, -0.973604, 0.6214314, -5.1268454, -0.9104073, 33.082394, 0.23800176, -9.696023, 12.288443, -16.52249, -7.6811, -21.928356, 25.690449, -0.6803232, -1.4738222, -1.831514, 0.00013296002, -3.1330614, 3.6067219, -3.0617614, -6.334016, -24.856865, -6.0669985, 2.8829474, 0.76423097, -0.21836776, -2.3173273, -2.092735, -0.19577695, 4.2984896, 0.029742926, 1.0902604, -0.28707412, -0.1671038, -0.4607489, -15.966867, -1.7149612, -1.3445716, 1.400264, 4.906401, -6.314724, -0.92188597, -0.14341217, -6.819194, 1.2750683, 21.634096, 0.5503013, 5.2122655, -0.096101895, -0.69029164, 2.6239898, -26.33101, -3.7901835, 10.026649, 1.0661886, 0.8891293, 34.24628, -0.9036363, -4.4846773, -30.846636, -5.8609247, -0.018534392, 4.657759e-06, 16.96108, 10.725708, -0.3170653, -3.2331817, 0.73887914, 0.69840825, 0.9043666, 1.0727708, 1.6571997, -0.70257163, 2.4863558, 0.07501343, -35.059708, 0.72496796, -3.0723267, -3.2004805, -0.9447444, 0.56954986, 2.6018164, -0.49256825, 22.71359, 0.45523545, -2.1936522, 4.008838, 0.62327665, 10.315046, 1.4006382, 1.1290226, 1.2660133, -8.46607]
I want to be able to create 100 more lists that are similar to this one but contain randomly chosen different random values within the highest and lowest values of the original list. Let's consider a smaller example to better understand the problem. Let's consider that I have the list with highest lowest value -1 and highest value 7.2.
original list : [0.5, 0.8, 1.1, 2.5, 7.2, -1]
random list 1 : [0.5, 0.2, 1.4, 4.5, 6.2, -0.5]
random list 2 : [5.3, 0.3, 0.7, 2.3, 4.2, -0.1]
....
random list 100 : [0.5, 0.9, 1.1, 2,1, 6.5, -1]
The key is that not all values have to change(in some cases they can like in list 2 for example). Is there a straightforward way to accomplish this in Python?
Below code prints what you need as the output. First you have to find the max and min numbers in the original list and then you have to use random library and random.uniform() function to get what you need.
import random
original_list = [0.5, 0.8, 1.1, 2.5, 7.2, -1]
max_number = max(original_list)
min_number = min(original_list)
'''because you need 100 more lists'''
for i in range(100):
random_list = []
for j in range(len(original_list)):
random_list.append(round(random.uniform(min_number,max_number),1))
print('random list '+str(i+1)+' ', end='')
print(random_list)
smallest = min(original_list)
largest = max(original_list)
newlist1 = [random.uniform(smallest, largest) for _ in range(len(original_list))]
newlist2 = [random.uniform(smallest, largest) for _ in range(len(original_list))]
# and so on
Using list comprehension and numpy.random.uniform:
import numpy as np
orig = [-1.7683218, 0.22206295, -0.28429198, 5.925369, -3.952484, -3.0728238, 0.09690776, -0.31914753, 3.9695702, 26.934353, 1.4882066, 1.8194668, -0.5614318, 1.2354431, -0.09714768, -0.15579335, -0.059994906, 1.0105655, -23.25607, 31.982368, -0.09390785, 0.17786688, 0.36164832, -4.673975, 13.495866, -3.57134, 0.5583399, -1.801314, 2.4207468, 2.0513844, -3.429592, -9.599998, 23.412394, -3.963623, 6.930485, 2.5186272, 0.6805691, -1.1615586, -0.915736, -2.6307302, -14.409785, 0.6327307, 10.512744, -0.09292421, -0.61977243, 0.35928893, -1.3844814, 8.098062, -0.8270248, 0.47219157, 0.089366496, 0.9056338, 1.5297629, 3.3246832, -0.9748858, 36.62332, -1.0525678, -0.87139374, 6.7600174, 36.210625, -0.25728267, 14.568578, 0.87466383, -4.2237897, -5.4309, 19.762472, 0.8426512, -0.7807278, 0.03435099, 12.787761, -4.9308186, -1.4322343, 0.49790275, -12.979129, 0.18121482, -0.81953144, -1.5393608, 17.757078, 3.5726204, -11.319154, -0.002896044, -1.8806648, 0.30027565, -2.6210017, 16.230186, -2.2566936, 37.37506, -2.7738526, -0.91440165, -3.652771, 1.8378688, -0.25519317, 0.5222581, 0.2189773, 23.825306, 0.3779062, 2.6709516, 0.84001434, -0.41394734, -0.600579, -3.1629875, 0.2880843, -3.9132822, 5.674796, -0.5569526, 0.30253112, -4.4269695, 4.5206604, -0.8477638, 0.0032483074, -2.2814171, 0.5524869, -1.4271426, -0.24263692, 1.0095457, -3.187037, -1.6656531, 1.4805393, 0.064992905, -4.8124804, -0.07194552, -0.28692132, -0.19502515, 0.010771384, -32.744797, 1.2642047, 6.3942785, -1.2971659, 29.70087, 0.19707158, -2.734262, 2.8497686, -1.710305, -1.3836008, 22.758884, -1.8488939, 4.1740856, 0.26019523, -8.814447, -3.937495, 0.22731477, -0.7874651, 17.22002, -7.89242, -0.5795766, 3.3960745, 1.0440702, 0.5483718, 1.2849183, -0.63732344, -40.38428, -4.25527, 3.034935, 0.25527972, -0.81940174, -7.0720696, 1.7420169, 14.904871, -1.5399592, 0.20110837, 0.1902977, 2.5790472, -28.560707, 0.09560776, -0.973604, 0.6214314, -5.1268454, -0.9104073, 33.082394, 0.23800176, -9.696023, 12.288443, -16.52249, -7.6811, -21.928356, 25.690449, -0.6803232, -1.4738222, -1.831514, 0.00013296002, -3.1330614, 3.6067219, -3.0617614, -6.334016, -24.856865, -6.0669985, 2.8829474, 0.76423097, -0.21836776, -2.3173273, -2.092735, -0.19577695, 4.2984896, 0.029742926, 1.0902604, -0.28707412, -0.1671038, -0.4607489, -15.966867, -1.7149612, -1.3445716, 1.400264, 4.906401, -6.314724, -0.92188597, -0.14341217, -6.819194, 1.2750683, 21.634096, 0.5503013, 5.2122655, -0.096101895, -0.69029164, 2.6239898, -26.33101, -3.7901835, 10.026649, 1.0661886, 0.8891293, 34.24628, -0.9036363, -4.4846773, -30.846636, -5.8609247, -0.018534392, 4.657759e-06, 16.96108, 10.725708, -0.3170653, -3.2331817, 0.73887914, 0.69840825, 0.9043666, 1.0727708, 1.6571997, -0.70257163, 2.4863558, 0.07501343, -35.059708, 0.72496796, -3.0723267, -3.2004805, -0.9447444, 0.56954986, 2.6018164, -0.49256825, 22.71359, 0.45523545, -2.1936522, 4.008838, 0.62327665, 10.315046, 1.4006382, 1.1290226, 1.2660133, -8.46607]
a = min(orig)
b = max(orig)
n = len(orig)
res = [[np.random.uniform(a,b,n)] for i in range(100)]
and you get res which is a list of 100 lists (with size len(orig)) of uniformly distributed numbers over [min(orig), max(orig)).
Let's say we want to find the 2 items which have their value the closest to 10:
A = {'abc': 12.3, 'def': 17.3, 'dsfsf': 18, 'ppp': 3.2, "jlkljkjlk": 9.23}
It works with:
def nearest(D, centre, k=10):
return sorted([[d, D[d], abs(D[d] - centre)] for d in D], key=lambda e: e[2])[:k]
print(nearest(A, centre=10, k=2))
[['jlkljkjlk', 9.23, 0.7699999999999996], ['abc', 12.3, 2.3000000000000007]]
But is there a Python built-in way to do this and/or a more optimized version when the dict has a much larger size (hundreds of thousands of items)?
If you do not mind using Pandas:
import pandas as pd
closest = (pd.Series(A) - 10).abs().sort_values()[:2]
#jlkljkjlk 0.77
#abc 2.30
closest.to_dict()
#{'jlkljkjlk': 0.7699999999999996, 'abc': 2.3000000000000007}
You could use heapq.nsmallest():
from heapq import nsmallest
A = {'abc': 12.3, 'def': 17.3, 'dsfsf': 18, 'ppp': 3.2, 'jlkljkjlk': 9.23}
def nearest(D, centre, k=10):
return [[x, D[x], abs(D[x] - centre)] for x in nsmallest(k, D, key=lambda x: abs(D[x] - centre))]
print(nearest(A, centre=10, k=2))
# [['jlkljkjlk', 9.23, 0.7699999999999996], ['abc', 12.3, 2.3000000000000007]]
As far as time complexity, this runs in O(n log(k)) time instead of O(n log(n)) of the solution based on sorting the dictionary.
Given you need to perform a lookup quite often, we can make this an O(log n) algorithm, by first storing the data in a sorted list:
from operator import itemgetter
ks = sorted(A.items(), key=itemgetter(1))
vs = list(map(itemgetter(1), ks))
Then for each item we can use the bisect.bisect_left point to determine the insertion point. We can then check the two surrounding values, to check the smallest, and return the corresponding key. It is also possible that
from bisect import bisect_left
from operator import itemgetter
def closests(v):
idx = bisect_left(vs, v)
i, j = max(0, idx-1), min(idx+2, len(ks))
part = ks[i:j]
return sorted([[*pi, abs(pi[-1]-v)] for pi in part], key=itemgetter(-1))[:2]
The above might not look as an improvement, but here we will always evaluate at most three elements in the sorted(..), and bisect_left will evaluate a logarithmic number of elements.
For example:
>>> closests(1)
[['ppp', 3.2, 2.2], ['jlkljkjlk', 9.23, 8.23]]
>>> closests(3.2)
[['ppp', 3.2, 0.0], ['jlkljkjlk', 9.23, 6.03]]
>>> closests(5)
[['ppp', 3.2, 1.7999999999999998], ['jlkljkjlk', 9.23, 4.23]]
>>> closests(9.22)
[['jlkljkjlk', 9.23, 0.009999999999999787], ['abc', 12.3, 3.08]]
>>> closests(9.24)
[['jlkljkjlk', 9.23, 0.009999999999999787], ['abc', 12.3, 3.0600000000000005]]
The "loading" phase thus takes O(n log n) (with n the number of elements). Then if we generalize the above method to fetch k elements (by increasing the slice), it would take O(log n + k log k) to perform a lookup.