Get all IDs of repeated rows in dataframe in Pandas - python

Suppose we have a dataframe df that has duplicated rows. I want to store the IDs of the unique rows, so that each has a list of integers (the IDs where they appear in the dataframe) associated.
Let me show an example:
import numpy as np
import pandas as pd
np.random.seed(0)
m = ['a','b']
M = ['X','Y']
n = np.arange(3)
size = 10
df = pd.DataFrame({'m': np.random.choice(m, size=size, replace=True),
'M': np.random.choice(M, size=size, replace=True),
'n': np.random.choice(n, size=size, replace=True)})
This generates the following dataframe:
m M n
0 a Y 2
1 b X 2
2 b X 0
3 a Y 1
4 b X 1
5 b X 1
6 b X 1
7 b X 0
8 b X 1
9 b Y 0
I believe I want to do something like df.groupby(df.columns.tolist()).size(), but instead of getting the number of appearances, I want to get the positions where they appear. So, in this case, the desired output would be (in a dictionary form for example):
output = {('a','Y',1):[3],
('a','Y',2):[0],
('b','X',0):[2,7],
('b','X',1):[4,5,6,8],
('b','X',2):[1],
('b','Y',0):[9]
}
How can I do this? The idea is to do it as efficient as possible, because the dataframe can have several columns and many thousands (or even a few million) of rows.

You have groups
df.groupby(list(df)).groups
Out[176]:
{('a', 'Y', 1): Int64Index([3], dtype='int64'),
('a', 'Y', 2): Int64Index([0], dtype='int64'),
('b', 'X', 0): Int64Index([2, 7], dtype='int64'),
('b', 'X', 1): Int64Index([4, 5, 6, 8], dtype='int64'),
('b', 'X', 2): Int64Index([1], dtype='int64'),
('b', 'Y', 0): Int64Index([9], dtype='int64')}

Related

How to create multiple rows of a data frame based on some original values

I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])

Dynamically updating the condition while indexing and selecting data in pandas Dataframe

While selecting data from a dataframe I need to do in such a way that the condition changes according to the length of the input list. This is my current code. The first element in the list is the name of the column and the second element is the value of the column.
import pandas as pd
list_1 = [('a', 2), ('b', 5)]
list_2 = [('a', 1), ('b', 2), ('c', 3)]
data = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [2, 5, 6]], columns=['a', 'b', 'c'])
def select_data(l, dataset):
df = None
i = len(l)
if i == 2:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1])]
if i == 3:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1]) & (dataset[l[2][0]] == l[2][1])]
return df
print(select_data(list_1, data))
print(select_data(list_2, data))
There must be a cleaner way of doing this.
Short filtering with Dataframe.loc[<filtering_mask>, :] approach:
In [150]: data
Out[150]:
a b c
0 1 2 3
1 1 2 3
2 1 1 1
3 2 5 6
In [151]: def select_data(filter_lst, df):
...: d = dict(filter_lst)
...: res = df.loc[(df[list(d.keys())] == pd.Series(d)).all(axis=1)]
...: return res
...:
...:
In [152]: select_data(list_1, data)
Out[152]:
a b c
3 2 5 6
In [153]: select_data(list_2, data)
Out[153]:
a b c
0 1 2 3
1 1 2 3
I think you want to create dynamic condition. To do, you can write the condition in a string format and then evaluate it with the eval function. To do so, you also need to know the dataset name.
Here the code assuming all the keys are in the given dataset:
import pandas as pd
input1 = [('a', 1), ('b', 2)]
# input2 = [('a', 1), ('b', 2), ('c', 3)] # Same output as for input1 for the below dataset
dataset = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [
2, 5, 6]], columns=['a', 'b', 'c'])
def create_query_condition(list_tuple, df_name):
""" Create a string condition from the keys and values in list_
Arguments:
:param list_tuple: 2D array: 1st column is key, 2nd column is value
:param df_name: dataframe name
"""
my_array = np.array(list_tuple)
# Get keys - values
keys = my_array[:, 0]
values = my_array[:, 1]
# Create string query
query = ' & '.join(['({0}.{1} == {2})'.format(df_name, k, v)
for k, v in zip(keys, values)])
return query
query = create_query_condition(input1, 'dataset')
print(query)
# (dataset.a == 1) & (dataset.b == 2)
print(dataset[eval(query)])
# a b c
# 0 1 2 3
# 1 1 2 3

Efficiently find index of DataFrame values in array

I have a DataFrame that resembles:
x y z
--------------
0 A 10
0 D 13
1 X 20
...
and I have two sorted arrays for every possible value for x and y:
x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]
so I wrote a function:
def lookup(record, lookup_list, lookup_attr):
return np.searchsorted(lookup_list, getattr(record, lookup_attr))
and then call:
df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')
# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]
but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?
I tried:
np.where(np.in1d(x_values, df.x))[0]
but this removes duplicate values and that is not desired.
You can convert your index arrays to pd.Index objects to make lookup fast(er).
u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})
x y
0 0 1
1 0 2
2 1 3
Where,
x_values
# [0, 1]
y_values
# ['a', 'A', 'D', 'X']
As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.
val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})
x y
0 0 1
1 0 2
2 1 3
Update using Series with .loc , you may can also try with reindex
pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

How to add rows for all missing values of one multi-index's level?

Suppose that I have the following dataframe df, indexed by a 3-level multi-index:
In [52]: df
Out[52]:
C
L0 L1 L2
0 w P 1
y P 2
R 3
1 x Q 4
R 5
z S 6
Code to create the DataFrame:
idx = pd.MultiIndex(levels=[[0, 1], ['w', 'x', 'y', 'z'], ['P', 'Q', 'R', 'S']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 2, 1, 1, 3], [0, 0, 2, 1, 2, 3]],
names=['L0', 'L1', 'L2'])
df = pd.DataFrame({'C': [1, 2, 3, 4, 5, 6]}, index=idx)
The possible values for the L2 level are 'P', 'Q', 'R', and 'S', but some of these values are missing for particular combinations of values for the remaining levels. For example, the combination (L0=0, L1='w', L2='Q') is not present in df.
I would like to add enough rows to df so that, for each combination of values for the levels other than L2, there is exactly one row for each of the L2 level's possible values. For the added rows, the value of the C column should be 0.
IOW, I want to expand df so that it looks like this:
C
L0 L1 L2
0 w P 1
Q 0
R 0
S 0
y P 2
Q 0
R 3
S 0
1 x P 0
Q 4
R 5
S 0
z P 0
Q 0
R 0
S 6
REQUIREMENTS:
the operation should leave the types of the columns unchanged;
the operation should add the smallest number of rows needed to complete only the specified level (L2)
Is there a simple way to perform this expansion?
Suppose L2 initially contains all the possible values you need, you can use unstack.stack trick:
df.unstack('L2', fill_value=0).stack(level=1)

Merge two tables based on intersection in Python

Suppose I have two tables A and B where
A is
and B is
I have to merge them in such a way that the new table looks like this
For the common first elements in A and B, I'm taking the weighted average of middle elements of both rows having the common first elements. For example:
A and B have 'AAA' in common, so I'll compute the middle element using (5 * 3 + 5 * 2) / (3 + 2) = 5. Hence the first row of the third table becomes 'AAA', 5, 3 + 2 = 5.
I know it can be done by iterating over all the elements if I use lists, but is there a faster way to do this?
edit from comments: I'm also searching for a simpler way using pandas.DataFrame
The simplest pure python solution would be to use dictionary-like data structure, with keys being your labels, and values pairs (value = quantity * weight, quantity):
from collections import defaultdict
A = [
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
]
B = [
('A', 5, 5),
('D', 2, 1),
('B', 5, 4),
]
# we need to calculate (value, quantity) for each label:
a = {key: [weight * quantity, quantity] for key, weight, quantity in A}
b = {key: [weight * quantity, quantity] for key, weight, quantity in B}
# defaultdict is a dictionary like structure, but able to create
# a new item if key is not found, in our case a new [0, 0] list:
merged = defaultdict(lambda: list((0, 0)))
# let's sum quantities and
for key, pair in a.items() + b.items():
# add both value and quantity respectively
quantity, value = map(sum, zip(merged[key], pair))
merged[key] = [quantity, value]
# now let's calculate means
for key, (quantity, value) in merged.items():
mean = quantity / float(value)
merged[key] = [mean, value]
for item in merged.items():
print item
And even simplier it is using pandas:
import pandas as pd
# first let's create dataframes
colnames = 'label weight quantity'.split()
A = pd.DataFrame.from_records([
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
], columns=colnames)
B = pd.DataFrame.from_records([
('A', 5, 2),
('D', 2, 1),
('B', 5, 4),
], columns=colnames)
# we can just concatenate those DataFrames and do calculation:
df = pd.concat([A, B])
df['value'] = df.weight * df.quantity
# sum each group with the same label
df = df.groupby('label').sum()
del df['weight'] # it got messed up anyway and we don't need it
# and calculate means:
df['mean'] = df.value / df.quantity
print df
print(df[['mean', 'quantity']])
# mean quantity
# label
# A 5.000000 5
# B 5.200000 5
# C 2.000000 4
# D 7.333333 3
You can do better than this, but here is a pandas solution
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame({'AAA':np.array([5,3]),'BBB':np.array([6,1]),
.....: 'DDD':np.array([10,2]),'CCC':np.array([2,4])})
In [4]: df2 = pd.DataFrame({'AAA':np.array([5,2]),'DDD':np.array([2,1]),
.....: 'BBB':np.array([5,4])})
In [5]: df = pd.concat([df1,df2])
In [6]: df.transpose()
0 1 0 1
AAA 5 3 5 2
BBB 6 1 5 4
CCC 2 4 NaN NaN
DDD 10 2 2 1
In [7]: vals = np.nan_to_num(df.values)
In [8]: _mean = (vals[0,:]*vals[1,:]+vals[2,:]*vals[3,:])/(vals[1,:]+vals[3,:])
In [9]: _sum = (vals[1,:]+vals[3,:])
In [10]: result = pd.DataFrame(columns = df.columns,data = [_mean,_sum], index=['mean','sum'])
In [11]: result.transpose()
mean sum
AAA 5.000000 5
BBB 5.200000 5
CCC 2.000000 4
DDD 7.333333 3
It probably isn't the most elegant solution, but gets the job done.

Categories