Find overlapping columns ratio in pandas

Find overlapping columns ratio in pandas - python

Dataframe (Assume all values as categorical):
df = pd.DataFrame(
{"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
I want to find what percentage of overlap is present between different columns
check_a_b = df.a == df.b
check_b_c = df.b == df.c
check_a_c = df.a == df.c
print(np.sum(check_a_b)/len(check_a_b)) # 0.6
print(np.sum(check_b_c)/len(check_b_c)) # 0.2
print(np.sum(check_a_c)/len(check_a_c)) # 0.4
Final output required as a matrix / DataFrame ( Triangular matrix):
a b c
a 0.6 0.4
b 0.2
c
Now I want to implement this for 15 columns in an automated way for a data of more than 100K rows.
What would be the optimized way to do this?

Dropping down to numpy is usually efficient. Only return to pandas when you have the result.
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
a = df.values
d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}
res, c, vals = np.zeros((a.shape[1], a.shape[1])), \
list(map(list, zip(*d.keys()))), list(d.values())
res[c[0], c[1]] = vals
res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)
# a b c
# a 0.0 0.6 0.4
# b 0.0 0.0 0.2
# c 0.0 0.0 0.0

One way you can do this is as follows:
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
df_out = pd.DataFrame()
for i in combinations(df.columns, 2):
s = pd.DataFrame((df[i[0]] == df[i[1]]).mean(),index=[i[0]], columns=[i[1]])
df_out = pd.concat([df_out,s])
df_out.sum(level=0).reindex(df.columns).reindex(df.columns, axis=1).fillna(0)
Output:
a b c
a 0.0 0.6 0.4
b 0.0 0.0 0.2
c 0.0 0.0 0.0

There is on way
Yourdf=pd.DataFrame(columns=df.columns,index=df.columns)
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).sum()/len(df),axis=1).unstack()
Yourdf=Yourdf.where(np.triu(np.ones(Yourdf.shape),1).astype(np.bool))
Yourdf
Out[169]:
a b c
a NaN 0.6 0.4
b NaN NaN 0.2
c NaN NaN NaN
Update : mention by Scott
Change to mean
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).mean(),axis=1).unstack()

Related

Division by 0 in pandas -Avoid it

df = pd.DataFrame({f'Diff (a - b)': c['a'] - c['b'],
'Diff in %': (c['a'] - c['b']) * 100 / c['a']})
If some value in c['a'] will be 0 it will not be correct to divide by 0.
Overall function doesn't fail, and outputs inf for these cases.
How to avoid this situation and instead of inf add 0 for these cases (when c['a'] == 0)?

You can replace np.inf by 0 with replace method:
a = [0, 1, 2]
b = [4, 5, 6]
c = pd.DataFrame({'a': a, 'b': b})
df = pd.DataFrame({'col21': (c['a'] - c['b']) * 100 / c['a']})
df = df.replace({-np.inf: 0})
print(df)
# Output
col21
0 0.0
1 -400.0
2 -200.0

Understand numpy's transpose

I have the following python code
import numpy as np
import itertools as it
ref_list = [0, 1, 2]
p = it.permutations(ref_list)
transpose_list = tuple(p)
#print('transpose_list', transpose_list)
na = nb = nc = 2
A = np.zeros((na,nb,nc))
n = 1
for la in range(na):
for lb in range(nb):
for lc in range(nc):
A[la,lb,lc] = n
n = n + 1
factor_list = [(i+1)*0.0 for i in range(6)]
factor_list[0] = 0.1
factor_list[1] = 0.2
factor_list[2] = 0.3
factor_list[3] = 0.4
sum_A = np.zeros((na,nb,nc))
for m, t in enumerate(transpose_list):
if abs(factor_list[m]) < 1.e-3:
continue
factor_list[m] * np.transpose(A, transpose_list[m])
print('inter', m, t, factor_list[m], np.transpose(A, transpose_list[m])[0,0,1] )
B = np.transpose(A, (0, 2, 1))
C = np.transpose(A, (1, 2, 0))
for la in range(na):
for lb in range(nb):
for lc in range(nc):
print(la,lb,lc,'A',A[la,lb,lc],'B',B[la,lb,lc],'C',C[la,lb,lc])
The result is
inter 0 (0, 1, 2) 0.1 2.0
inter 1 (0, 2, 1) 0.2 3.0
inter 2 (1, 0, 2) 0.3 2.0
inter 3 (1, 2, 0) 0.4 5.0
0 0 0 A 1.0 B 1.0 C 1.0
0 0 1 A 2.0 B 3.0 C 5.0
0 1 0 A 3.0 B 2.0 C 2.0
0 1 1 A 4.0 B 4.0 C 6.0
1 0 0 A 5.0 B 5.0 C 3.0
1 0 1 A 6.0 B 7.0 C 7.0
1 1 0 A 7.0 B 6.0 C 4.0
1 1 1 A 8.0 B 8.0 C 8.0
My question is, why the inter 1 and inter 3 get 3.0 and 5.0? The objective is to obtain
P(A)[0,0,1].
For inter 1, it is (0, 2, 1), I thought about (0,2,1) on [0,0,1] -> [0,1,0]
For inter 3, it is (1, 2, 0), I thought about (1,2,0) on [0,0,1] -> [0,1,0]
So the value should be the same. The output are not the same (3.0 and 5.0). So apparently I misunderstood np.transpose. What would be the correct understanding that what happened inside np.transpose?
More specifically, from How does numpy.transpose work for this example?, Anand S Kumar's answer
I tried to think from both (0, 2, 1) and (1, 2, 0), both lead to
(0,0,0) -> (0,0,0)
(0,0,1) -> (0,1,0)
I guess it related to the inverse of permutation. But I am not sure why.

A more direct way of making your A:
In [29]: A = np.arange(1,9).reshape(2,2,2)
In [30]: A
Out[30]:
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
The transposes:
In [31]: B = np.transpose(A, (0, 2, 1))
...: C = np.transpose(A, (1, 2, 0))
In [32]: B
Out[32]:
array([[[1, 3],
[2, 4]],
[[5, 7],
[6, 8]]])
In [33]: C
Out[33]:
array([[[1, 5],
[2, 6]],
[[3, 7],
[4, 8]]])
two of the cases:
In [35]: A[0,0,1], B[0,1,0],C[0,1,0]
Out[35]: (2, 2, 2)
In [36]: A[1,0,0], B[1,0,0], C[0,0,1]
Out[36]: (5, 5, 5)
It's eash to match A and B, by just swaping the last 2 indices. It's tempting to just swap the 1st and 3rd for C, but that's wrong. When the 1st is moved to the end, the others shift over without changing their order:
In [38]: for la in range(na):
...: for lb in range(nb):
...: for lc in range(nc):
...: print(la,lb,lc,'A',A[la,lb,lc],'B',B[la,lc,lb],'C',C[lb,lc,la])
0 0 0 A 1 B 1 C 1
0 0 1 A 2 B 2 C 2
0 1 0 A 3 B 3 C 3
0 1 1 A 4 B 4 C 4
1 0 0 A 5 B 5 C 5
1 0 1 A 6 B 6 C 6
1 1 0 A 7 B 7 C 7
1 1 1 A 8 B 8 C 8

create df form list comprehension within loop

I have to the following code to create df from a list comprehension within a loop. However, the output is not as I desire.
I would like to create a new column for each group in the list. In this example, 3 groups implies 3 columns.
Input:
t = [x * .001 for x in range(2)]
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp = pd.DataFrame([], dtype=object)
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
tmp = tmp.append(l1, ignore_index=False)
Output:
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp=
0
0 0.0
1 0.8
0 0.0
1 6.4
0 0.0
1 28.8
Desired Output:
0.0 0.0 0.0
0.8 6.4 28.8
How can I get the above desired output?

I believe you can create lists and then call DataFrame cosntructor for improve performance:
t=[x * .001 for x in range(2)]
l=[[10,2,40],[20,4,80],[30,6,160]]
tmp = []
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
print (l1)
mp.append(l1)
df = pd.DataFrame(tmp, dtype=object).T
print (df)
0 1 2
0 0 0 0
1 0.8 6.4 28.8
If need use DataFrame.append:
t=[x * .001 for x in range(2)]
l=[[10,2,40],[20,4,80],[30,6,160]]
tmp = pd.DataFrame([], dtype=object)
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
print (l1)
tmp=tmp.append([l1])
df = tmp.T
df.columns = range(len(df.columns))
print (df)
0 1 2
0 0.0 0.0 0.0
1 0.8 6.4 28.8

you can use concat instead of append:
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
l1 = pd.DataFrame(l1)
tmp = pd.concat([tmp,l1], axis=1)

If you wanted to make your code a little bit cleaner and increase its readability, I suggest to use double list comprehension in combination with numpy.prod and numpy.array funcitons.
import pandas as pd
import numpy as np
t = [x * .001 for x in range(2)]
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp = pd.DataFrame(
np.array(
[
np.prod(np.array(i)) * j
for j in t
for i in l
]
).reshape(len(t), len(l))
)
The result looks like this:
>>> print(tmp)
0 1 2
0 0.0 0.0 0.0
1 0.8 6.4 28.8

Python Timeseries Pandas: remove np.nan if more than 3 occurrences continuously

I have a timeseries dataframe with column [timestamp,Digital_Data]
Could you guide me how to remove all rows that are matches if the digital_Data consecutively np.nan for more than three occurrence. data sample as below.
Sorry i am not sure how to add a table here, it turns into image when i copy and paste it from excel
Sample Data

There MUST be a pythonic way to solve it, or even a solution provided by pandas itself, and I encourage you to search! but just in case you urgently need the solution, here is how I solve it:
1. example
x = [1, 2, np.nan, np.nan, np.nan, np.nan, 2, 1, np.nan, np.nan, 3]
y = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
df = pd.DataFrame({'x': x, 'y': y})
output is
x y
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
5 NaN 6
6 2.0 7
7 1.0 8
8 NaN 9
9 NaN 10
10 3.0 11
2. Get NaN indices
ind = df[df.x.isna()].index.tolist()
3. Get the block of adjacent NaN indices
I create an empty holder inds_to_delete and fill it with the blocks of adjacent indices. I check the adjacency by checking if the element i is 1 more than i-1
# first element by default in temp
temp = [ind[0]]
for i in range(1, len(ind)):
try:
assert ind[i] == ind[i-1] + 1
# if condition holds, append to temp
temp.append(ind[i])
except AssertionError:
# if condition doesn't hold, we have a break, append temp to holder
inds_to_delete.append(temp)
# restart temp for the next block
temp = [ind[i]]
# last block of the series also appended to the holder
inds_to_delete.append(temp)
output of inds_to_delete
[[2, 3, 4, 5], [8, 9]]
4. blocks with length more than 2 and joining
inds_to_delete = [i for i in inds_to_delete if len(i)>2]
>>> [[2, 3, 4, 5]]
inds_to_delete = [i for j in inds_to_delete for i in j]
>>> [2, 3, 4, 5]
if inds_to_delete is [[1, 2, 3], [6, 7, 8]] then final line makes it: [1, 2, 3, 6, 7, 8]
5. Drop from dataframe
df.drop(inds_to_delete, inplace=True)
output is
x y
0 1.0 1
1 2.0 2
6 2.0 7
7 1.0 8
8 NaN 9
9 NaN 10
10 3.0 11
(maybe this solution can be awarded by SO as the most unpythonic solution)

Thanks Alireza, and as you said i hope there will be a pythonic way to solve this.
I have temporarily fixed it with the below code assuming the threshold is more than 15 for nan removal:
df = pd.DataFrame(list(zip(x,y)), columns =['TimeStamp','FHR']).set_index('TimeStamp', drop=True)
df = df = df.resample('S').mean()
TimeStampToRemove = []
fhrtoremove = []
df1 = df
for i, row in enumerate(df.values):
fhr = df['FHR'][i]
if np.isnan(fhr):
TimeStampToRemove.append(df.index[i])
fhrtoremove.append(fhr)
else:
if len(TimeStampToRemove) > 15:
df1toRemove = pd.DataFrame(list(zip(TimeStampToRemove,fhrtoremove)), columns =['TimeStamp','FHR']).set_index('TimeStamp', drop=True)
TimeStampToRemove.clear()
fhrtoremove.clear()
df1 = df1.drop(df1toRemove.index.tolist())
if len(TimeStampToRemove) > 0:
df1toRemove = pd.DataFrame(list(zip(TimeStampToRemove,fhrtoremove)), columns =['TimeStamp','FHR']).set_index('TimeStamp', drop=True)
df1 = df1.drop(df1toRemove.index.tolist())
TimeStampToRemove.clear()
fhrtoremove.clear()

Pandas aggregating average while excluding current row

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])

Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.

You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find overlapping columns ratio in pandas - python

Related

Division by 0 in pandas -Avoid it

Understand numpy's transpose

create df form list comprehension within loop

Python Timeseries Pandas: remove np.nan if more than 3 occurrences continuously

Pandas aggregating average while excluding current row

Categories

Resources