Related
What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i is also contained in a list at row j, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!
You can use networkx and graphs for that:
import networkx as nx
G = nx.Graph([edge for nodes in df['col1'] for edge in zip(nodes, nodes[1:])])
result = pd.Series(nx.connected_components(G))
This is basically treating every number as a node, and whenever two number are in the same list then you connect them. Finally you find the connected components.
Output:
0 {1, 2, 4, 8, 10, 19}
1 {16, 17, 15}
2 {18, 3}
This is not straightforward. Merging lists has many pitfalls.
One solid approach is to use a specialized library, for example networkx to use a graph approach. You can generate successive edges and find the connected components.
Here is your graph:
You can thus:
generate successive edges with add_edges_from
find the connected_components
craft a dictionary and map the first item of each list
groupby and merge the lists (you could use the connected components directly but I'm giving a pandas solution in case you have more columns to handle)
import networkx as nx
G = nx.Graph()
for l in df['col1']:
G.add_edges_from(zip(l, l[1:]))
groups = {k:v for v,l in enumerate(nx.connected_components(G)) for k in l}
# {1: 0, 2: 0, 4: 0, 8: 0, 10: 0, 19: 0, 16: 1, 17: 1, 15: 1, 18: 2, 3: 2}
out = (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x)))
)
output:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
Seems more like a Python problem than pandas one, so here's one attempt that checks every after list, merges (and removes) if intersecting:
vals = d["col1"]
# while there are at least 1 more list after to process...
i = 0
while i < len(vals) - 1:
current = set(vals[i])
# for the next lists...
j = i + 1
while j < len(vals):
# any intersection?
# then update the current and delete the other
other = vals[j]
if current.intersection(other):
current.update(other)
del vals[j]
else:
# no intersection, so keep going for next lists
j += 1
# put back the updated current back, and move on
vals[i] = current
i += 1
at the end, vals is
In [108]: vals
Out[108]: [{1, 2, 4, 8, 10, 19}, {15, 16, 17}, {3, 18}]
In [109]: pd.Series(map(list, vals))
Out[109]:
0 [1, 2, 19, 4, 8, 10]
1 [16, 17, 15]
2 [18, 3]
dtype: object
if you don't want vals modified, can chain .copy() for it.
To add on mozway's answer. It wasn't clear from the question, but I also had rows with single-valued lists. This values aren't clearly added to the graph when calling add_edges_from(zip(l, l[1:]), since l[1:] is empty. I solved it adding a singular node to the graph when encountering emtpy l[1:] lists. I leave the solution in case anyone needs it.
import networkx as nx
import pandas as pd
d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4], [9]]}
df= pd.DataFrame(data=d)
G = nx.Graph()
for l in df['col1']:
if len(l[1:]) == 0:
G.add_node(l[0])
else:
G.add_edges_from(zip(l, l[1:]))
groups = {k: v for v, l in enumerate(nx.connected_components(G)) for k in l}
out= (df.groupby(df['col1'].str[0].map(groups), as_index=False)
.agg(lambda x: sorted(set().union(*x))))
Result:
col1
0 [1, 2, 4, 8, 10, 19]
1 [15, 16, 17]
2 [3, 18]
3 [9]
I want to group in a 2D array (couples) to see the family:
rij = [[11, 2], [15, 6], [7, 8], [3, 6], [9, 2], [2, 3], [2, 3]]
rij = np.sort(rij, axis=1) #sort inside array
rij = np.unique(rij, axis=0) #remove duplicates
After this code I get this:
[[ 2 3]
[ 2 9]
[ 2 11]
[ 3 6]
[ 6 15]
[ 7 8]
[ 7 20]]
This is where I get stuck, I need to loop through and see if the number already exists.
Expected result (the family) would be:
[2, 3, 6, 9, 11, 15]
[7, 8, 20]
Nice to have would be that I could add the degree, family in 2nd degree.
[2, 3, 9, 11]
[6, 15]
[7, 8, 20]
family in 3rd degree.
[2, 3, 6, 9, 11, 15]
[7, 8, 20]
family in last degree. (same as previous in this example)
[2, 3, 6, 9, 11, 15]
[7, 8, 20]
We can solve this using scipy's sparse matrix and graph module. Your rij forms an adjacency matrix. That is a matrix that is 1 if two nodes are connected and 0 if not. From this, we can compute other properties.
Let's apply this to your problem. We start by cleaning up your input. As #Ali_Sh noted, there is an inconsistency in your example. The first list of rij has different elements than the sorted and unique array below. I ignore the first line and start with the sorted unique version.
import numpy as np
pairings = ((2, 3), (2, 9), (2, 11), (3, 6), (6, 15), (7, 8), (7, 20))
pairings = np.array(pairings)
The IDs are not consecutive. This will waste resources further down so let's compress our range. The index will be the graph node. The value at the index is the original ID in pairings. We can use this as a lookup table. For the inverse mapping I use a simple dictionary.
node_to_id = np.unique(np.sort(np.ravel(pairings)))
id_to_node = {id_: node for node, id_ in enumerate(node_to_id)}
Now we build a sparse adjacency matrix. A node i is connected to node j if matrix[i, j] is true. Since our "family" relationship is undirected (if i is related to j, then j is always related to i), we build a symmetric matrix.
Scipy claims that directed graph algorithms with symmetric matrices are faster. So this allows us to do just that.
The graph algorithms need CSR format (compressed sparse row). We start with DOK format (dictionary of keys) and convert afterwards because it is easier to build. Since our input is sorted, LIL format (list of lists) may be faster but DOK has better worst-case performance in case we don't sort beforehand.
from scipy import sparse
n_nodes = len(node_to_id)
dok_mat = sparse.dok_matrix((n_nodes, n_nodes), dtype=bool)
for left, right in pairings:
row, col = id_to_node[left], id_to_node[right]
dok_mat[row, col] = True
dok_mat[col, row] = True # undirected graph
csr_mat = dok_mat.tocsr()
del dok_mat
Connected components gives us our families. For each row in the matrix, we get an integer label that marks its component.
import collections
from scipy.sparse import csgraph
_, components = csgraph.connected_components(csr_mat)
families = collections.defaultdict(list)
for id_, component in zip(node_to_id, components):
families[component].append(id_)
print("families", list(families.values()))
The shortest path gives the number of hops, i.e. the distance in relationship. Unrelated nodes have infinite distance.
shortest_paths = csgraph.shortest_path(csr_mat)
maxdist = 2.
for id_, row in zip(node_to_id, shortest_paths):
immediate_family = node_to_id[row <= maxdist]
print(id_, immediate_family)
The output will be
families [[2, 3, 6, 9, 11, 15], [7, 8, 20]]
2 [ 2 3 6 9 11]
3 [ 2 3 6 9 11 15]
6 [ 2 3 6 15]
7 [ 7 8 20]
8 [ 7 8 20]
9 [ 2 3 9 11]
11 [ 2 3 9 11]
15 [ 3 6 15]
20 [ 7 8 20]
My data frame includes a purchase. A buyer (buyer_id) can buy several items (item_id).
I split the data with splitter() and put it into a dok matrix generate_matrix(). Then I enter this data in method get_train_samples() and then get my x_train, x_test, y_train and y_test.
How can I compress this code?
And how can I combine generate_matrix() and get_train_samples() and enter them in a 'real' one hot encoding matrix?
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
Code:
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(df_main, dataframe, name):
mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0
return mat
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')
def get_train_samples(train_mat, num_negatives):
user_input, item_input, labels = [], [], []
num_user, num_item = train_mat.shape
for (u, i) in train_mat.keys():
user_input.append(u)
item_input.append(i)
labels.append(1)
# negative instances
for t in range(num_negatives):
j = np.random.randint(num_item)
while (u, j) in train_mat.keys():
j = np.random.randint(num_item)
user_input.append(u)
item_input.append(j)
labels.append(0)
return user_input, item_input, labels
num_users, num_items = train_mat.shape
model = get_model(num_users, num_items, ...)
user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)
What I need
user_input
item_input
labels
val_user_input
val_item_input
val_labels
num_users
It is pretty vague what kind of one hot encoding matrix you are looking for. From get_train_samples, it seems you don't really need the sparse matrix for model training at the end of the day. Besides, I'm not sure how you are going to one-hot encode observations with three variables (user_id,item_id,purchased or not)
As for the problem of combining generate_matrix with get_train_samples, it is pretty simple,
def generate_matrix(df_main,df,num_negatives):
n_samples,n_classes = df_main.shape[0],df_main['itemid'].nunique()
mat = sp.dok_matrix((n_samples,n_classes), dtype=np.float32)
user_input,item_input,labels = [],[],[]
for purchaseid,itemid in zip(df['purchaseid'],df['itemid']):
mat[purchaseid,itemid] = 1.0
# the data with label 0 in OP's original code
fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
# label the fake labels with -1.0
mat[np.repeat(purchaseid,num_negatives),fake_items] = -1.0
# the three lists
user_input.extend([purchaseid]*(num_negatives+1))
item_input.append(itemid);item_input.extend(fake_items.tolist())
labels.append(1.0);labels.extend(np.zeros(num_negatives).tolist())
return mat,user_input,item_input,labels
as you can see, in my generate_matrix, the fake samples (items not purchased by a user) is encoded with -1.0 in the sparse matrix. Besides, I use a pretty compact way fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives) to generate the fake itemid as compared to the while loop in your code.
With this function, you can run
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round(sum_purchase*(PERCENTAGE_SPLIT/100))
random_list = random.sample(df['purchaseid'].unique().tolist(),amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat,user_input_t,item_id_t,labels_t = generate_matrix(df, df_tr, NUM_NEGATIVES)
val_mat,user_input_v,item_id_v,labels_v = generate_matrix(df, df_val, NUM_NEGATIVES)
You will see by running this code, the length of train_mat.keys() can be different from the length of user_input_t. This is because the same items can be chosen multiple times in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives). If you want to keep the two lengths to be the same, you need to set replacement=False in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives).
Consider the question:
The grid is:
[ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ]
The max viewed from top (i.e. max across columns) is: [9, 4, 8, 7]
The max viewed from left (i.e. max across rows) is: [8, 7, 9, 3]
I know how to define a grid in Python:
maximums = [[0 for x in range(len(grid[0]))] for x in range(len(grid))]
Getting maximum across rows looks easy:
max_top = [max(x) for x in grid]
But how to get maximum across columns?
Further, I need to find a way to do so in linear space O(M+N) where MxN is the size of the Matrix.
Use zip:
result = [max(i) for i in zip(*grid)]
In Python, * is not a pointer, rather, it is used for unpacking a structure passed to an object's parameter or specifying that the object can receive a variable number of items. For instance:
def f(*args):
print(args)
f(434, 424, "val", 233, "another val")
Output:
(434, 424, 'val', 233, 'another val')
Or, given an iterable, each item can be inserted at its corresponding function parameter:
def f(*args):
print(args)
f(*["val", "val3", 23, 23])
>>>('val', 'val3', 23, 23)
zip "transposes" a listing of data i.e each row becomes a column, and vice versa.
You could use numpy:
import numpy as np
x = np.array([ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ])
print(x.max(axis=0))
Output:
[9 4 8 7]
You said that you need to do this in O(m+n) space (not using numpy), so here's a solution that doesn't recreate the matrix:
max = x[0]
for i in x:
for j, k in enumerate(i):
if k > max[j]:
max[j] = k
print(max)
Output:
[9, 4, 8, 7]
I figured a shortcut too:
transpose the matrix and then just take maximum over rows:
grid_transposed = [[grid[j][i] for j in range(len(grid[0]))] for i in range(len(grid))]
max_left = [max(x) for x in grid]
But then again this takes O(M*N) space I have to alter the matrix.
I don't want to use numpy as external libraries are not allowed in any assignments.
Easiest way is to use numpy's array max:
array.max(0)
Something like these works both ways and is quite easy to read:
# 1.
maxLR, maxTB = [], []
maxlr, maxtb = 0, 0
# max across rows
for i, x in enumerate(grid):
maxlr = 0
for j, y in enumerate(grid[0]):
maxlr = max(maxlr, grid[i][j])
maxLR.append(maxlr)
# max across columns
for j, y in enumerate(grid[0]):
maxtb = 0
for i, x in enumerate(grid):
maxtb = max(maxtb, grid[i][j])
maxTB.append(maxtb)
# 2.
row_maxes = [max(row) for row in grid]
col_maxes = [max(col) for col in zip(*grid)]
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)