Count the intersections between duplicates using Pandas

Count the intersections between duplicates using Pandas - python

I have a Dataframe that looks like this:
Symbols Count
A 3
A 1
A 2
A 4
B 1
B 3
B 9
C 2
C 1
C 3
What I want to do using Pandas is to identify duplicate rows on the "Count" column but I want to count the number of times the Symbols intersect with each other on the duplicate.
By this I mean, if a Count value appears twice with two different Symbols. The Symbols are listed as having one intersection between them as they share the same Count value.
Something like this:
Symbol Symbol Number of Intersections
A B 2
B A 2
C A 3
.....
I'm sure there is a Pythonic Pandas way of doing this. But its not coming to mind.

Let's use merge to do a self merge then query, and groupby:
df_selfmerge = df.merge(df, on='Count', how="inner").query('Symbols_x != Symbols_y')
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.count()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
'Count':'Number of Intersections'}))
EDIT: Use size() is safer just incase of NaN valueas
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.size()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
0:'Number of Intersections'}))
Output:
Symbol Symbol Number of Intersections
0 A B 2
1 A C 3
2 B A 2
3 B C 2
4 C A 3
5 C B 2

Related

split pandas data frame into multiple of 4 rows

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.

i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

Shuffle rows of a DataFrame until all consecutive values in a column are different?

I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.

If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3

Pandas GroupBy using another DataFrame of one-hot encodings/overlapping masks

I have two dataframes with observations on rows and features (or group membership) on columns, e.g.:
> data_df
a b c
A 1 2 1
B 0 1 3
C 0 0 1
D 2 1 1
E 1 1 1
> mask_df
g1 g2
A 0 1
B 1 0
C 1 0
D 1 0
E 0 1
I want to group and aggregate (by sum) the values in the first dataframe (data_df) conditional on the binary values (masks) in the second dataframe (mask_df). The result should be the following (groups x features):
> aggr_df
a b c
g1 2 2 5
g2 2 3 2
Is there a way in pandas to group the first dataframe (data_df) using the masks contained in a second dataframe (mask_df) in a single command?

You can do this cheaply with dot and groupby:
data_df.groupby(mask_df.dot(mask_df.columns)).sum()
a b c
g1 2 2 5
g2 2 3 2
Where,
mask_df.dot(mask_df.columns)
A g2
B g1
C g1
D g1
E g2
dtype: object
Which works well assuming each row always has exactly one column set to 1.

Notice that this will work even in the case that observations in the first dataframe (data_df) belong to multiple masks in the second dataframe (mask_df).
> pd.concat({x:data_df.mul(mask_df[x],0).sum() for x in mask_df}).unstack()
a b c
g1 2 2 5
g2 2 3 2

The best way to do this is to combine the dataframes. You can combine on the index by using a join statement first. df_merge = data_df.merge(aggr_df, left_on=True, right_on=True). Then you can just use df_merge for your grouping operations.

I decided to write another answer since:
coldspeed's answer works only with one-hot encodings
W-B's answer cannot be easily parallelized since it runs on dict comprehension
In my case I noticed that I could achieve the same result just by using a dot product of mask_df with data_df:
> mask_df.T.dot(data_df)
In the special case of getting the average instead of the sum, this is achievable scaling the mask_df by the number of ones for each group:
> mask_df.T.dot(data_df).div(mask_df.sum(), axis=0)

Here's a way using a list comprehension:
pd.DataFrame([(data_df.T * mask_df[i]).sum(axis=1) for i in mask_df.columns],
index = mask.columns)
a b c
g1 2 2 5
g2 2 3 2

df.groupby() modification HELP needed

This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!

Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.

fill in dataframe with two for loops and if condition in python

I have two DataFrames, one looks something like this:
df1:
x y Counts
a b 1
a c 3
b c 2
c d 1
The other one has both as index and as columns the list of unique values in the first two columns:
df2
a b c d
a
b
c
d
What I wouldl like to do is to fill in the second DataFrame with values from the first one, given the intersection of column and index is the same line from the first DataFrame, e.g.:
a b c d
a 0 1 3 0
b 1 0 2 0
c 3 2 0 1
d 0 0 1 0
While I try to use two for loops with a double if-condition, it makes the computer block (given that a real DataFrame contains more than 1000 rows).
The piece of code I am trying to implement (and which makes calculations apparently too 'heavy' for a computer to perform):
for i in df2.index:
for j in df2.columns:
if (i==df1.x.any() and j==df1.y.any()):
df2.loc[i,j]=df1.Counts
Important to notice, the list of unique values (i.e., index and columns in the second DataFrame) is longer than the number of rows in the first columns, in my example they coincided.
If it is of any relevance, the first dataframe represents basically combinations of words in the first and in the second column and their occurences in the text. Occurrences are basically the weights of edges.
So, I am trying to create a matrix so as to plot a graph via igraph. I chose to first create a DataFrame, then its values taken as an array pass to igraph.
As far as I could understand, python-igraph cannot use dataframe to plot a graph, a numpy array only.
Tried some of the soulutions suggested for the similar issues, nothing worked out so far.
Any suggestions to improve my question are warmly welcomed (it's my first question here).

You can do something like this:
import pandas as pd
#df = pd.read_clipboard()
#df2 = df.copy()
df3=df2.pivot(index='x',columns='y',values='Counts')
print df3
print
new=sorted((set(df3.columns.tolist()+df3.index.tolist())))
df3 = df3.reindex(new,columns=new).fillna(0).applymap(int)
print df3
output:
y b c d
x
a 1.0 3.0 NaN
b NaN 2.0 NaN
c NaN NaN 1.0
y a b c d
x
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0

stack df2 and fillna with df1
idx = pd.Index(np.unique(df1[['x', 'y']]))
df2 = pd.DataFrame(index=idx, columns=idx)
df2.stack(dropna=False).fillna(df1.set_index(['x', 'y']).Counts) \
.unstack().fillna(0).astype(int)
a b c d
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.