Merge two tables based on intersection in Python - python

Suppose I have two tables A and B where
A is
and B is
I have to merge them in such a way that the new table looks like this
For the common first elements in A and B, I'm taking the weighted average of middle elements of both rows having the common first elements. For example:
A and B have 'AAA' in common, so I'll compute the middle element using (5 * 3 + 5 * 2) / (3 + 2) = 5. Hence the first row of the third table becomes 'AAA', 5, 3 + 2 = 5.
I know it can be done by iterating over all the elements if I use lists, but is there a faster way to do this?
edit from comments: I'm also searching for a simpler way using pandas.DataFrame

The simplest pure python solution would be to use dictionary-like data structure, with keys being your labels, and values pairs (value = quantity * weight, quantity):
from collections import defaultdict
A = [
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
]
B = [
('A', 5, 5),
('D', 2, 1),
('B', 5, 4),
]
# we need to calculate (value, quantity) for each label:
a = {key: [weight * quantity, quantity] for key, weight, quantity in A}
b = {key: [weight * quantity, quantity] for key, weight, quantity in B}
# defaultdict is a dictionary like structure, but able to create
# a new item if key is not found, in our case a new [0, 0] list:
merged = defaultdict(lambda: list((0, 0)))
# let's sum quantities and
for key, pair in a.items() + b.items():
# add both value and quantity respectively
quantity, value = map(sum, zip(merged[key], pair))
merged[key] = [quantity, value]
# now let's calculate means
for key, (quantity, value) in merged.items():
mean = quantity / float(value)
merged[key] = [mean, value]
for item in merged.items():
print item
And even simplier it is using pandas:
import pandas as pd
# first let's create dataframes
colnames = 'label weight quantity'.split()
A = pd.DataFrame.from_records([
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
], columns=colnames)
B = pd.DataFrame.from_records([
('A', 5, 2),
('D', 2, 1),
('B', 5, 4),
], columns=colnames)
# we can just concatenate those DataFrames and do calculation:
df = pd.concat([A, B])
df['value'] = df.weight * df.quantity
# sum each group with the same label
df = df.groupby('label').sum()
del df['weight'] # it got messed up anyway and we don't need it
# and calculate means:
df['mean'] = df.value / df.quantity
print df
print(df[['mean', 'quantity']])
# mean quantity
# label
# A 5.000000 5
# B 5.200000 5
# C 2.000000 4
# D 7.333333 3

You can do better than this, but here is a pandas solution
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame({'AAA':np.array([5,3]),'BBB':np.array([6,1]),
.....: 'DDD':np.array([10,2]),'CCC':np.array([2,4])})
In [4]: df2 = pd.DataFrame({'AAA':np.array([5,2]),'DDD':np.array([2,1]),
.....: 'BBB':np.array([5,4])})
In [5]: df = pd.concat([df1,df2])
In [6]: df.transpose()
0 1 0 1
AAA 5 3 5 2
BBB 6 1 5 4
CCC 2 4 NaN NaN
DDD 10 2 2 1
In [7]: vals = np.nan_to_num(df.values)
In [8]: _mean = (vals[0,:]*vals[1,:]+vals[2,:]*vals[3,:])/(vals[1,:]+vals[3,:])
In [9]: _sum = (vals[1,:]+vals[3,:])
In [10]: result = pd.DataFrame(columns = df.columns,data = [_mean,_sum], index=['mean','sum'])
In [11]: result.transpose()
mean sum
AAA 5.000000 5
BBB 5.200000 5
CCC 2.000000 4
DDD 7.333333 3
It probably isn't the most elegant solution, but gets the job done.

Related

Remove the list items in column more than one combination in pandas DataFrame

[enter image description here][1]
I have obtained a DataFrame of compounds given below, whereas BaSiO3 has two possibilities. Now I want to remove one of them.
(ABO3) (A_OS, B_OS)
['CaFeO3'] [3, 2]
['BaSiO3'] [1,5], [4, 2]
['BaGeO3'] [4, 2]
Desired output is :
(ABO3) (A_OS, B_OS)
['CaFeO3'] [3,2]
['BaSiO3'] [1,5]
['BaGeO3'] [4,2]
I have tried to create this dataframe as follows:
res = ['CeTmO3', 'CeLuO3', 'CeRhO3']
list1 = [['Ce', 'Tm', 'O'], ['Ce', 'Lu', 'O'], ['Ce', 'Rh', 'O']] # to extract the properties of elements
A = []
B = []
final = []
for i in range(len(list1)):
A = element(list1[i][0]).oxistates # Only calling the Oxidation-
B = element(list1[i][1]).oxistates # states of A and B
D = A,B
final.append(D)
#print(final)
solutions = []
for x in range(len(final)): #for y in range((len(final[x]))):
E = [(x1,x2) for x1 in final[x][0] for x2 in final[x][1] if sum([x1,x2]) == 6]
solutions.append(E)
#print(solutions)
list2 = [(x, y) for x, y in zip(res, solutions)]
list2
and then
import pandas as pd
import numpy as np
dataPd1 = pd.DataFrame(res)
data = np.array(solutions)
dataPd2 = pd.DataFrame(data = data)
df_os = pd.concat([dataPd1,dataPd2],axis=1)
df_os.columns = ["ABO3", "Oxi_State"]
df_os
Now in the above code, 'CeTmO3' and 'CeRhO3' have two possibilities. I want only one and delete other. Like as:
ABO3 Oxi_State
0 CeTmO3 [(4, 2), (3, 3)]
1 CeLuO3 [(3, 3)]
2 CeRhO3 [(4, 2), (3, 3)]
The desired one is:
ABO3 Oxi_State
0 CeTmO3 [(4, 2)]
1 CeLuO3 [(3, 3)]
2 CeRhO3 [(4, 2)]
Please note that few compounds have more than two possibilities also in the original data set. For ex:
BrNO3 [(7,-1),(5,1),(3,3),(1,5)]
The simplest approach, assuming the desired compound is always the first one listed, given a df of the form:
ABO3 Oxi_State
0 CeTmO3 [(4, 2), (3, 3)]
1 CeLuO3 [(3, 3)]
2 CeRhO3 [(4, 2), (3, 3)]
Is to utilize the following approach:
df['Oxi_State'] = [[x[0]] for x in df['Oxi_State'].to_list()]
This produces:
ABO3 Oxi_State
0 CeTmO3 [(4, 2)]
1 CeLuO3 [(3, 3)]
2 CeRhO3 [(4, 2)]
​

Dynamically updating the condition while indexing and selecting data in pandas Dataframe

While selecting data from a dataframe I need to do in such a way that the condition changes according to the length of the input list. This is my current code. The first element in the list is the name of the column and the second element is the value of the column.
import pandas as pd
list_1 = [('a', 2), ('b', 5)]
list_2 = [('a', 1), ('b', 2), ('c', 3)]
data = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [2, 5, 6]], columns=['a', 'b', 'c'])
def select_data(l, dataset):
df = None
i = len(l)
if i == 2:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1])]
if i == 3:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1]) & (dataset[l[2][0]] == l[2][1])]
return df
print(select_data(list_1, data))
print(select_data(list_2, data))
There must be a cleaner way of doing this.
Short filtering with Dataframe.loc[<filtering_mask>, :] approach:
In [150]: data
Out[150]:
a b c
0 1 2 3
1 1 2 3
2 1 1 1
3 2 5 6
In [151]: def select_data(filter_lst, df):
...: d = dict(filter_lst)
...: res = df.loc[(df[list(d.keys())] == pd.Series(d)).all(axis=1)]
...: return res
...:
...:
In [152]: select_data(list_1, data)
Out[152]:
a b c
3 2 5 6
In [153]: select_data(list_2, data)
Out[153]:
a b c
0 1 2 3
1 1 2 3
I think you want to create dynamic condition. To do, you can write the condition in a string format and then evaluate it with the eval function. To do so, you also need to know the dataset name.
Here the code assuming all the keys are in the given dataset:
import pandas as pd
input1 = [('a', 1), ('b', 2)]
# input2 = [('a', 1), ('b', 2), ('c', 3)] # Same output as for input1 for the below dataset
dataset = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [
2, 5, 6]], columns=['a', 'b', 'c'])
def create_query_condition(list_tuple, df_name):
""" Create a string condition from the keys and values in list_
Arguments:
:param list_tuple: 2D array: 1st column is key, 2nd column is value
:param df_name: dataframe name
"""
my_array = np.array(list_tuple)
# Get keys - values
keys = my_array[:, 0]
values = my_array[:, 1]
# Create string query
query = ' & '.join(['({0}.{1} == {2})'.format(df_name, k, v)
for k, v in zip(keys, values)])
return query
query = create_query_condition(input1, 'dataset')
print(query)
# (dataset.a == 1) & (dataset.b == 2)
print(dataset[eval(query)])
# a b c
# 0 1 2 3
# 1 1 2 3

Pandas - Filter dataframe using list of tuples

I need to create a function that filters a dataframe using a list of tuples - taking as arguments a dataframe and a tuple list, as follows:
tuplelist=[('A', 5, 10), ('B', 0, 4),('C', 10, 11)]
What is the proper way to do this?
I have tried the following:
def multcolfilter(data_frame, tuplelist):
def apply_single_cond(df_0,cond):
df_1=df_0[(df_0[cond[0]]>cond[1]) & (df_0[cond[0]]<cond[2])]
return df_1
for x in range(len(tuplelist)-1):
df=apply_single_cond(apply_single_cond(data_frame,tuplelist[x-1]),tuplelist[x])
return df
Example dataframe and tuplelist:
df = pd.DataFrame({'A':range(1,10), 'B':range(1,10), 'C':range(1,10)})
tuplelist=[('A', 2, 10), ('B', 0, 4),('C', 3, 5)]
Instead of working with tuples, create a dictionary form them:
filters = {x[0]:x[1:] for x in tuplelist}
print(filters)
{'A': (5, 10), 'B': (0, 4), 'C': (10, 11)}
You can use pd.cut to bin the values of the dataframe's columns:
rows = np.asarray([~pd.cut(df[i], filters[i], retbins=False, include_lowest=True).isnull()
for i in filters.keys()]).all(axis = 0)
Use rows as a boolean indexer of df:
df[rows]
A B C
2 3 3 3
3 4 4 4

Get all IDs of repeated rows in dataframe in Pandas

Suppose we have a dataframe df that has duplicated rows. I want to store the IDs of the unique rows, so that each has a list of integers (the IDs where they appear in the dataframe) associated.
Let me show an example:
import numpy as np
import pandas as pd
np.random.seed(0)
m = ['a','b']
M = ['X','Y']
n = np.arange(3)
size = 10
df = pd.DataFrame({'m': np.random.choice(m, size=size, replace=True),
'M': np.random.choice(M, size=size, replace=True),
'n': np.random.choice(n, size=size, replace=True)})
This generates the following dataframe:
m M n
0 a Y 2
1 b X 2
2 b X 0
3 a Y 1
4 b X 1
5 b X 1
6 b X 1
7 b X 0
8 b X 1
9 b Y 0
I believe I want to do something like df.groupby(df.columns.tolist()).size(), but instead of getting the number of appearances, I want to get the positions where they appear. So, in this case, the desired output would be (in a dictionary form for example):
output = {('a','Y',1):[3],
('a','Y',2):[0],
('b','X',0):[2,7],
('b','X',1):[4,5,6,8],
('b','X',2):[1],
('b','Y',0):[9]
}
How can I do this? The idea is to do it as efficient as possible, because the dataframe can have several columns and many thousands (or even a few million) of rows.
You have groups
df.groupby(list(df)).groups
Out[176]:
{('a', 'Y', 1): Int64Index([3], dtype='int64'),
('a', 'Y', 2): Int64Index([0], dtype='int64'),
('b', 'X', 0): Int64Index([2, 7], dtype='int64'),
('b', 'X', 1): Int64Index([4, 5, 6, 8], dtype='int64'),
('b', 'X', 2): Int64Index([1], dtype='int64'),
('b', 'Y', 0): Int64Index([9], dtype='int64')}

Print a dictionary by rows

I have a dictionary in this format:
d = {'type 1':[1,2,3],'type 2':['a','b','c']}
It's a symmetric dictionary, in the sense that for each key I'll always have the same number of elements.
There is a way to loop as if it were rows:
for row in d.rows():
print row
So I get the output:
[1]: 1, a
[2]: 2, b
[3]: 3, b
You can zip the .values(), but the order is not guaranteed unless you are using a collections.OrderedDict (or a fairly recent version of PyPy):
for row in zip(*d.values()):
print(row)
e.g. when I run this, I get
('a', 1)
('b', 2)
('c', 3)
but I could have just as easily gotten:
(1, 'a')
(2, 'b')
(3, 'c')
(and I might get it on future runs if hash randomization is enabled).
If you want a specified order and have a finite set of keys that you know up front, you can just zip those items directly:
zip(d['type 1'], d['type 2'])
If you want your rows ordered alphabetically by key you might consider:
zip(*(v for k, v in sorted(d.iteritems())))
# [(1, 'a', 'd', 4), (2, 'b', 'e', 5), (3, 'c', 'f', 6)]
If your dataset is huge then consider itertools.izip or pandas.
import pandas as pd
df = pd.DataFrame(d)
print df.to_string(index=False)
# type 1 type 2 type 3 type 4
# 1 a d 4
# 2 b e 5
# 3 c f 6
print df.to_csv(index=False, header=False)
# 1,a,d,4
# 2,b,e,5
# 3,c,f,6
Even you pass the dictionary to OrderedDict you will not get ordered result as dictionary entry. So here tuples of tuple can be a good option.
See the code and output below:
Code (Python 3):
import collections
t = (('type 1',[1,2,3]),
('type 2',['a','b','c']),
('type 4',[4,5,6]),
('type 3',['d','e','f']))
d = collections.OrderedDict(t)
d_items = list(d.items())
values_list = []
for i in range(len(d_items)):
values_list.append(d_items[i][1])
values_list_length = len(values_list)
single_list_length = len(values_list[0])
for i in range(0,single_list_length):
for j in range(0,values_list_length):
print(values_list[j][i],' ',end='')
print('')
Output:
1 a 4 d
2 b 5 e
3 c 6 f

Categories