Construct a superset from pandas groupby operation result - python

name_region
bahia [10, 11, 12, 1, 2, 3, 4]
distrito_federal [9, 10, 11, 12, 1, 2, 3, 4]
goias [9, 10, 11, 12, 1, 2, 3, 4]
maranhao [10, 11, 12, 1, 2, 3, 4]
mato_grosso [9, 10, 11, 12, 1, 2, 3, 4]
mato_grosso_do_sul [8, 9, 10, 11, 12, 1, 2, 3]
I have a pandas series above, obtained from a groupby operation. The 2nd column represents months of the year. How do I construct a superset of months i.e. [8, 9, 10, 11, 12, 1, 2, 3, 4] since that represents all possible months present in
the dataset
--NOTE:
I do want to preserve order

You can use the itertools recipe unique_everseen (which preserves order) like so:
>>> [i for i in unique_everseen([z for z in y['months'] for x,y in df.iterrows()])]
[9, 10, 11, 12, 1, 2, 3, 4]
Definition of unique_everseen:
import itertools as it
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in it.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element

I seem to have misinterpreted the data structure in the question, but as it might be useful for similar cases, I will keep this answer here for future reference.
You can use numpy's unique function.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x": [1,3,5], "y": [3,4,5]})
print np.unique(df) # prints [1 3 4 5]

I don't know if there is a way to do this more cleanly in Pandas so if anyone else knows please answer... Looking at the types this seems like a time for folding over that column.
I didn't see a fold operation in pandas, so maybe just a for loop that accumulates.. i.e.
all_months = []
for row in df.iterrows():
months = row['months']
all_months += [e for e in months if not e in all_months]
on second thought.. would use set instead of complicated for comprehension
all_months = set()
for row in df.iterrows():
months = set(row['months'])
all_months = all_months.union(months)
hmm just saw the other guys answer, haven't tested it.. but it looks better! choose that one :). Posting this just in case it helps someone...

Related

Merge lists in a dataframe column if they share a common value

What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]

Changing list to dataframe in dictionary

I am writing a dictionary that has to seperate a dataframe into multiple small dataframes based on a certain item that is repeated in the list calvo_massflows. If the items isn't repeated, it'll make a list in the dictionary. In the second for loop, the dictionary will add the index item from the df dataframe to one of the dictionary lists, if the key (l) and e are the same.
This is what I currently got:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import linregress
from scipy.optimize import curve_fit
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic[e] = []
for l in dic:
if e == l:
dic[e].append(pd.DataFrame([df.iloc[i]]))
The problem with the output is the fact each index is a seperate dataframe in thte dictionary. I would like to have all the dataframes combined. I tried doing something with pd.concat. But I didn't figure it out. Moreover, the chapters in the dictionary (if that's how you call them), are lists and I prefer them being dataframes. However, if I change my list to a dataframe like I done here:
dic3 = {}
massflows = []
for i, e in enumerate(calvo_massflow):
if e not in massflows:
massflows.append(e)
dic3[e] = pd.DataFrame([])
for l in dic3:
if e == l:
dic3[e].append(df.iloc[i])
I can't seem to add dataframes to the dataframes made by the dictionary.
My ideal scenario would be a dictionary with two dataframes. One having the key '1' and one being '2'. Both those dataframes, include all the information from the data frame df. And not how it is right now with separate dataframes for each index. Preferably the dataframes aren't in lists like they are now but it won't be a disaster.
Let me know if you guys can help me out or need more context!
IIUC you want to select the rows of df up to the length of calvo_massflow, group by calvo_massflow and convert to dict. This might look like this:
calvo_massflow = [1, 2, 1, 2, 2, 1, 1]
df = pd.DataFrame({"a":[1, 2, 3, 4, 11, 2, 4, 6, 7, 3],
"b":[5, 6, 7, 8, 10, 44, 23, 267, 4, 66]})
dic = dict(iter(df.iloc[:len(calvo_massflow)]
.groupby(calvo_massflow)))
print(dic)
resulting in a dictionary with keys 1 and 2 containing two filtered DataFrames:
{1: a b
0 1 5
2 3 7
5 2 44
6 4 23,
2: a b
1 2 6
3 4 8
4 11 10}

Comparisons between an arbitrary number of lists of arbitrary length Python

Given an arbitrary number of lists of integers of arbitrary length, I would like to group the integers into new lists based on a given distance threshold.
Input:
l1 = [1, 3]
l2 = [2, 4, 6, 10]
l3 = [12, 13, 15]
threshold = 2
Output:
[1, 2, 3, 4, 6] # group 1
[10, 12, 13, 15] # group 2
The elements of the groups act as a growing chain so first we have
abs(l1[0] - l2[0]) < threshold #true
so l1[0] and l2[0] are in group 1, and then the next check could be
abs(group[-1] - l1[1]) < threshold #true
so now l1[1] is added to group 1
Is there a clever way to do this without first grouping l1 and l2 and then grouping l3 with that output?
Based on the way that you asked the question, it sounds like you just want a basic python solution for utility, so I'll give you a simple solution.
Instead of treating the lists as all separate entities, it's easiest to just utilize a big cluster of non-duplicate numbers. You can exploit the set property of only containing unique values to go ahead and cluster all of the lists together:
# Throws all contents of lists into a set, converts it back to list, and sorts
elems = sorted(list({*l1, *l2, *l3}))
# elems = [1, 2, 3, 4, 6, 10, 12, 13, 15]
If you had a list of lists that you wanted to perform this on:
lists = [l1, l2, l3]
elems = []
[elems.extend(l) for l in lists]
elems = sorted(list(set(elems)))
# elems = [1, 2, 3, 4, 6, 10, 12, 13, 15]
If you want to keep duplicated:
elems = sorted([*l1, *l2, *l3])
# and respectively
elems = sorted(elems)
From there, you can just do the separation iteratively. Specifically:
Go through the elements one-by-one. If the next element is validly spaced, add it to the list you're building on.
When an invalidly-spaced element is encountered, create a new list containing that element, and start appending to the new list instead.
This can be done as follows (note, -1'th index refers to last element):
out = [[elems[0]]]
thresh = 2
for el in elems[1:]:
if el - out[-1][-1] <= thresh:
out[-1].append(el)
else:
out.append([el])
# out = [[1, 2, 3, 4, 6], [10, 12, 13, 15]]

Splitting arrays depending on unique values in an array

I currently have two arrays, one of which has several repeated values and another with unique values.
Eg array 1 : a = [1, 1, 2, 2, 3, 3]
Eg array 2 : b = [10, 11, 12, 13, 14, 15]
I was developing a code in python that looks at the first array and distinguishes the elements that are all the same and remembers the indices. A new array is created that contains the elements of array b at those indices.
Eg: As array 'a' has three unique values at positions 1,2... 3,4... 5,6, then three new arrays would be created such that it contains the elements of array b at positions 1,2... 3,4... 5,6. Thus, the result would be three new arrays:
b1 = [10, 11]
b2 = [12, 13]
b3 = [14, 15]
I have managed to develop a code, however, it only works for when there are three unique values in array 'a'. In the case there are more or less unique values in array 'a', the code has to be physically modified.
import itertools
import numpy as np
import matplotlib.tri as tri
import sys
a = [1, 1, 2, 2, 3, 3]
b = [10, 10, 20, 20, 30, 30]
b_1 = []
b_2 = []
b_3 = []
unique = []
for vals in a:
if vals not in unique:
unique.append(vals)
if len(unique) != 3:
sys.exit("More than 3 'a' values - check dimension")
for j in range(0,len(a)):
if a[j] == unique[0]:
b_1.append(c[j])
elif a[j] == unique[1]:
b_2.append(c[j])
elif a[j] == unique[2]:
b_3.append(c[j])
else:
sys.exit("More than 3 'a' values - check dimension")
print (b_1)
print (b_2)
print (b_3)
I was wondering if there is perhaps a more elegant way to perform this task such that the code is able to cope with an n number of unique values.
Well given that you are also using numpy, here's one way using np.unique. You can set return_index=True to get the indices of the unique values, and use them to split the array b with np.split:
a = np.array([1, 1, 2, 2, 3, 3])
b = np.array([10, 11, 12, 13, 14, 15])
u, s = np.unique(a, return_index=True)
np.split(b,s[1:])
Output
[array([10, 11]), array([12, 13]), array([14, 15])]
You can use the function groupby():
from itertools import groupby
from operator import itemgetter
a = [1, 1, 2, 2, 3, 3]
b = [10, 11, 12, 13, 14, 15]
[[i[1] for i in g] for _, g in groupby(zip(a, b), key=itemgetter(0))]
# [[10, 11], [12, 13], [14, 15]]

How to add numbers in your list, incrementally, while also being sorted from lowest to highest value?

I'm trying to write code to firstly, order numbers from lowest to highest (e.g. 1, 3, 2, 4, 5 to 1, 2, 3, 4, 5). Secondly, I would like to incrementally add the numbers in the list.
eg.
1
3
6
10
15
I've already tried using the sum function, then the sorted function, but I was wondering if I can write them neatly in a code to just get everything worked out.
Addition = [1, 13, 166, 3, 80, 6, 40]
print(sorted(Addition))
I was able to get the numbers sorted horizontally, but I wasn't able to get the numbers added vertically.
Apparently, you need a cumulative addition. You can code a simple one using a simple loop and yield the results on the go
def cumulative_add(array):
total = 0
for item in array:
total += item
yield total
>>> list(cumulative_add([1,2,3,4,5]))
[1, 3, 6, 10, 15]
Depending on your goals, you may also wish to use a library, such as pandas, that has cumulative sum already written for you.
For example,
>>> s = pd.Series([1,2,3,4,5])
>>> s.cumsum()
0 1
1 3
2 6
3 10
4 15
You can use itertools.accumulate with sorted:
import itertools
mylist = [1, 2, 3, 4, 5]
result = list(itertools.accumulate(sorted(mylist)))
# result: [1, 3, 6, 10, 15]
The default action is operator.add, but you can customize it. For example, you can do running product instead of running sum if you needed it:
import itertools
import operator
mylist = [1, 2, 3, 4, 5]
result = list(itertools.accumulate(sorted(mylist), operator.mul))
# result: [1, 2, 6, 24, 120]

Categories