Calculating overlap between categorical variables in the same DataFrame in Python

Calculating overlap between categorical variables in the same DataFrame in Python - python

I am working on my undergraduate thesis and I have a question about part of my code. I am trying to calculate the number of times each column pair in a df has the same value. The columns are all binary (0, 1).
The input has this format:
df = pd.DataFrame([[1, 0, 1], [0, 0, 1]], columns = ["col1", "col2"])
For instance, the number of times col1 and col2 had the same value in the snippet above is 3.
This is the code i have so far
bl1 = []
bl2 = []
overlap = []
for i in df.iterrows():
for j in range(len(df.columns)):
for k in range(j):
a = df.iloc[j]
b = df.iloc[k]
comparison_column = np.where(a == b)
bl1.append(df.columns[j])
bl2.append(df.columns[k])
overlap.append(len(comparison_column[0]))
After combining the lists into a pd.Dataframe the output looks like this
Base Learner 1 Base Learner 2 overlap
col1 col2 2
I know that the code does not work because I did a count in excel and got different results for the overlap count. I suspect that the loops fail to sum the number of times the pair was found across the df.iterrows() part but i do not know how to fix it. Please give me any suggestions you can. Thanks.

Let's try the magic matrix multiplication:
(df.T # df) + ((1-df.T)#(1-df))
Output:
Rule21 Rule22 Rule23 Rule24
Rule21 5 5 4 2
Rule22 5 5 4 2
Rule23 4 4 5 3
Rule24 2 2 3 5
Explanation: df.T # df counts if corresponding cells in both columns are 1. Similarly, ((1-df.T)#(1-df)) counts if both columns are 0.

Related

dataframe group by for all columns in new dataframe

I want to create a new dataframe with the values grouped by each column header dataset
this is the dataset i'm working with.
I essentially want a new dataframe which sums the occurences of 1 and 0 for each feature (chocolate, fruity etc)
i tried this code with the groupby and sort function
`
chocolate = data.groupby(["chocolate"]).size()
bar = data.groupby(["bar"]).size()
hard = data.groupby(["hard"]).size()
display(chocolate,bar, hard)
`
but this only gives me the sum per feature
this is the end result i want to become
end result

You could try the following:
res = (
data
.drop(columns="competitorname")
.melt().value_counts()
.unstack()
.fillna(0).astype("int").T
)
Eliminate the columns that aren't relevant (I've only seen competitorname, but there could be more).
.melt the dataframe. The result has 2 columns, one with the column names, and another with the resp. 0/1 values.
Now .value_counts gives you a series that essentially contains what you are looking for.
Then you just have to .unstack the first index level (column names) and transpose the dataframe.
Example:
data = pd.DataFrame({
"competitorname": ["A", "B", "C"],
"chocolate": [1, 0, 0], "bar": [1, 0, 1], "hard": [1, 1, 1]
})
competitorname chocolate bar hard
0 A 1 1 1
1 B 0 0 1
2 C 0 1 1
Result:
variable bar chocolate hard
value
0 1 2 0
1 2 1 3
Alternative with .pivot_table:
res = (
data
.drop(columns="competitorname")
.melt().value_counts().to_frame()
.pivot_table(index="value", columns="variable", fill_value=0)
.droplevel(0, axis=1)
)
PS: Please don't post images, provide a litte example (like here) that encapsulates your problem.

Quick sum of all rows that fill a condition in DataFrame

I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0

Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b

Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)

Create a column based on the aggregation of values from multiple column at multiple row indexes

I'm trying to translate a technical analysis operator from another proprietary language to python as using dataframes, but I got stuck on a problem that seems rather simple, but I can't get to solve the pandas way. To simplify the problem let's have the example of this dataframe:
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = pd.DataFrame(data=d)
which result in the following dataframe:
What I want to achieve is this:
which in Pseudocode I would achieve in the following way:
value1 = [0,1,2,3]
value2 = [4,5,6,7]
result = []
for i in range(len(value1)):
calculation = value1[i] * value2[i]
lookback = value1[i]
for j in range(lookback):
calculation -= value2[j]
result[i] = calculation
How would I tackle a this in a dataframe context? Because the only similar approach that I found in the documentation is the usage of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html# but there is no mentioning of interacting/manipulating the series contained in the columns/rows.

df['result'] = df.value1 * df.value2 - (df.value2.cumsum() - df.value2)
df
Output
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6
Explanation
We are calculating cumulative sum for value2 and subtracting the current value2 which in total is subtracted by the product of value1 and value2.

This solution should work even if the first column value1 has random integers and not increasing integers from 0, and follow the pseudocode provided by the OP.
You should just ensure that any value in value1 is a valid integer for the dataframe (that is, no integer grater than the amount of rows in the dataframe, which is also required by the pseudocode).
import pandas as PD
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = PD.DataFrame(data=d)
csum2 = df["value2"].cumsum()
df["sum2"] = [csum2[i] for i in df["value1"]]
df["result"] = df["value1"] * df["value2"] - df["sum2"] + df["value2"]
df.drop("sum2", axis=1, inplace=True)
To explain: I save in an additional column "sum2" the result of the inner loop in the pseucode for j in range(lookback): so that I can then perform the main operation to get the "result" column.
At the end df is:
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6

Python df groupby with agg for string and sum

With this df as base i want the following output:
So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.
With the following code i could aggregate the strings but without summing the numbers:
df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2

Avoid an arbitrary number of columns
Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.
So you can just use groupby + sum to store all the information you require.
df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
1: ['Mat1', 'Mat2', 'Mat2'],
2: [3, 1, 1]})
df_sum = df.groupby([0, 1]).sum().reset_index()
print(df_sum)
0 1 2
0 2008-04_E.pdf Mat1 3
1 2008-04_E.pdf Mat2 2
But if you insist...
If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.
key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)
res.columns = res.columns.droplevel(0)
print(res)
Key1 Key2 Key1 Key2
0 2008-04_E.pdf Mat1 Mat2 3 2

This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped
df_grouped = df.groupby([0,1]).sum().reset_index()
Then reshape to the form you want:
def group_to_row(group):
group = group.sort_values(1)
output = []
for i, row in group[[1,2]].iterrows():
output += row.tolist()
return pd.DataFrame(data=[output])
df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()
This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.

How to remove blanks/NA's from dataframe and shift the values up

I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.

You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318

A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test

Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)

As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating overlap between categorical variables in the same DataFrame in Python - python

Related

dataframe group by for all columns in new dataframe

Quick sum of all rows that fill a condition in DataFrame

Create a column based on the aggregation of values from multiple column at multiple row indexes

Python df groupby with agg for string and sum

How to remove blanks/NA's from dataframe and shift the values up

Categories

Resources