Dataframe with intersection among rows in pandas dataframe? - python

I have a challenge in a pandas dataframe.
Basically, I have 2 columns. In the first one, I have 3 different classes and in the second a list of students that are enrolled in the subject. The example is as follow:
df = pd.DataFrame({'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
I would like to have a second dataframe with the number of students that are presented in more thant one class. In other words, the intersection between the classes, as follow
result= pd.DataFrame({'Comparison': ['1A-2B','1A-2C', '2B-2C'],
'Intersection size': [2, 1, 2]})
Thank you for your help and attention!

import pandas as pd
import itertools
df = pd.DataFrame({'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
combination: to generate a combination of all colunms. use itertools.
col=list(itertools.combinations(df.Class,2))
> col Out[68]:
> [('1A','2B'), ('1A', '2C'), ('2B', '2C')]
explode: to form a structured dataframe
df1=df.explode('Students')
write a for
d={}
for c in col:
tmp=df1[(df1['Class']==c[0]) | (df1['Class']==c[1])]
count=len(tmp)-tmp.Students.nunique()
d[str(c[0])+'-'+str(c[1])]=count
The dictionary d has what you want:
d
Out[71]: {'1A-2B': 2, '1A-2C': 1, '2B-2C': 2}

You can try the following:
import pandas as pd
df = pd.DataFrame({
'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
ddf = df.explode("Students")
ddf = pd.crosstab(ddf["Students"], ddf["Class"])
A = ddf.values
result = pd.DataFrame(A.T # A, index=ddf.columns, columns=ddf.columns)
print(result)
It gives:
Class 1A 2B 2C
Class
1A 3 2 1
2B 2 4 2
2C 1 2 4
The intersection of every row and column gives the number of students taking both classes. Diagonal entries give numbers of students in each individual class.
If you want to get a dataframe listing only combinations of different classes with non-zero intersection values, then the following should work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
ddf = df.explode("Students")
ddf = pd.crosstab(ddf["Students"], ddf["Class"])
A = ddf.values
result = pd.DataFrame(np.tril(A.T # A, k=-1).T,
index=ddf.columns,
columns=ddf.columns).stack()
result.index = result.index.map(lambda x: f"{x[0]}-{x[1]}")
result[result > 0]
It gives:
1A-2B 2.0
1A-2C 1.0
2B-2C 2.0

Related

How to create new columns using data from the column inside the DataFrame?

I have the next DataFrame:
a = [{'id': 1, 'name': 'AAA', 'pn':"[{'code_1': 'green'}, {'code_2': 'link'}, {'code_3': '10'}]"}, {'id': 2, 'name': 'BBB', 'pn': "[{'code_1': 'blue'}, {'code_2': 'link'}, {'code_3': '15'}]"}]
df = pd.DataFrame(a)
print(df)
I need to create two new columns and for example new column df['color'] = ..., new column df['link'] = ... and it should be like this, thanks:
You could do this :
import ast
df['color'] = df.pn.map(lambda x: ast.literal_eval(x)[0])
df['link'] = df.pn.map(lambda x: ast.literal_eval(x)[1])
ast.literal_eval is used to convert the string into an actual python list and then we select the element we want :)
Edit : In the case that the keys are out of orders :
import ast
df['pn'] = df.pn.map(lambda x: ast.literal_eval(x))
df['color'] = df.pn.map(lambda x: [el for el in x if list(el.keys())[0] == "code_1"])
df['link'] = df.pn.map(lambda x: [el for el in x if list(el.keys())[0] == "code_2"])
First, you transform the values in the pn column in python lists.
Second, you go through the new column and find the dictionary with the right key !
Have a nice day,
Gabriel
Here's a way to do what you're asking:
df2 = ( df.assign(
pn=df.pn.apply(eval)).explode('pn').pn.astype(str)
.str.extract("'([^']*)'[^']*'([^']*)'") )
df2[0] = df2[0].map({'code_1':'color', 'code_2':'link', 'code_3':'code_3'})
df2 = df2.pivot(columns=[0])
df2.columns = df2.columns.droplevel(None)
df2 = ( df2.drop(columns='code_3').assign(
color = df2.color.map(lambda x: "{" + f"'code_1': '{x}'" + "}"),
link = df2.link.map(lambda x: "{" + f"'code_2': '{x}'" + "}")) )
df2 = pd.concat([df[['id','name']], df2], axis = 1)
Output:
id name color link
0 1 AAA {'code_1': 'green'} {'code_2': 'link'}
1 2 BBB {'code_1': 'blue'} {'code_2': 'link'}
My original answer did not take into account the order of the list being different.
I am assuming that the keys in the dictionaries are consistent, i.e. code_1, code_2 etc.
In order to take into acouunt a different ordering of the list you could do this.
list_ = df["pn"].apply(lambda x: x.strip("[]").split(", "))
df["color"] = [dict_ for mapping in list_ for dict_ in mapping if "code_1" in dict_]
df["code"] = [dict_ for mapping in list_ for dict_ in mapping if "code_2" in dict_]
df.loc[:, ["id", "name", "color", "code"]]
If you create a new dataframe with the items out of order
a = [
{'id': 1, 'name': 'AAA', 'pn':"[{'code_1': 'green'}, {'code_2': 'link'}, {'code_3': '10'}]"},
{'id': 2, 'name': 'BBB', 'pn': "[{'code_2': 'link'}, {'code_1': 'blue'}, {'code_3': '15'}]"},
{'id': 3, 'name': 'CCC', 'pn': "[{'code_3': '15'}, {'code_2': 'link'}, {'code_1': 'red'}]"},
]
df = pd.DataFrame(a)
Which would look like
id name pn
0 1 AAA [{'code_1': 'green'}, {'code_2': 'link'}, {'code_3': '10'}]
1 2 BBB [{'code_2': 'link'}, {'code_1': 'blue'}, {'code_3': '15'}]
2 3 CCC [{'code_3': '15'}, {'code_2': 'link'}, {'code_1': 'red'}]
You would get this output:
id name color code
0 1 AAA {'code_1': 'green'} {'code_2': 'link'}
1 2 BBB {'code_1': 'blue'} {'code_2': 'link'}
2 3 CCC {'code_1': 'red'} {'code_2': 'link'}

Apply a function on two pandas tables

I have the following two tables:
>>> df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
... '2': ['mike', '30', 'ana'],
... '3': ['ana', '20', 'mike'],
... '4': ['eve', 'eve', 'eve'],
... '5': ['10', np.NaN, '10'],
... '6': [np.NaN, np.NaN, '20']},
... index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df1
1 2 3 4 5 6
index
ind1 john mike ana eve 10 NaN
ind2 10 30 20 eve NaN NaN
ind3 john ana mike eve 10 20
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df2
first_n
index
ind1 4
ind2 4
ind3 3
I also have the following function that reverses a list and gets the first n non-NA elements:
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[:top_n]
>>> get_rev_first_n(['john', 'mike', 'ana', 'eve', '10', np.NaN], 4)
['10', 'eve', 'ana', 'mike']
How would I apply this function to the two tables so that it takes in both df1 and df2 and outputs either a list or columns?
df=pd.concat([df1,df2],axis=1)
df.apply(get_rev_first_n,args=[4]) #send args as top_in
axis=0 is run along rows means runs on each column which is the default you don't have to specify it
args=[4] will be passed to second argument of get_rev_first_n
You can try apply with lambda on each row of the data frame, I just concatenate the two df's using concat and applied your method to each row of the resulted dataframe.
Full Code:
import pandas as pd
import numpy as np
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[1:top_n]
df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
'2': ['mike', '30', 'ana'],
'3': ['ana', '20', 'mike'],
'4': ['eve', 'eve', 'eve'],
'5': ['10', np.NaN, '10'],
'6': [np.NaN, np.NaN, '20']},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df3 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
df = df3.apply(lambda row : get_rev_first_n(row, row['first_n']), axis = 1)
print(df)
Output:
index
ind1 [10, eve, ana]
ind2 [eve, 20, 30]
ind3 [20, 10]
dtype: object

How to loop through a column and compare each value to a list

I have a column in a dataset. I need to compare each value from that column to a list. After comparison, if it satisfies a condition, the value of another column should change.
for example,
List- james, michael, clara
According to the code, if a name in col A is in the list, col B should be 1, else 0.
How to solve this in python
Change B column where value A is in List
Using the loc operator you can easily select the rows where the item in the A column is in your List, and change the B column of these rows.
df.loc[(df["A"].isin(List)), "B"] = 1
Use np.fillna to fill empty cells with zeros.
df.fillna(0, inplace=True)
Full Code
names = ['james', 'randy', 'mona', 'lisa', 'paul', 'clara']
List = ["james", "michael", "clara"]
df = pd.DataFrame(data=names, columns=['A'])
df["B"] = np.nan
df.loc[(df["A"].isin(List)), "B"] = 1
df.fillna(0, inplace=True)
This would be a good time to use np.where()
import pandas as pd
import numpy as np
name_list = ['James', 'Sally', 'Sarah', 'John']
df = pd.DataFrame({
'Names' : ['James', 'Roberts', 'Stephen', 'Hannah', 'John', 'Sally']
})
df['ColumnB'] = np.where(df['Names'].isin(name_list), 1, 0)
df

Pandas: Find unique values from columns to calculate probabilities

I have a large dataset (circa. 200,000 rows x 30 columns) as a CSV. I need to use pandas to pre-process this data. I have included a dummy dataset below to help visualise the problem.
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
df
The goal is to have individual columns that show the probability of each outcome for a batsman & bowler. By way of an example from the dummy dataset, Tom would have a 50% chance of an outcome of '1' or 'Out'
This is calculated by:
Batsman column - The total number of rows with batsman 'X';
Outcome column - The total number of outcomes with 'X';
Point 2. / Point 1. to determine the probability of each outcome;
Repeat the above to determine the Bowler probabilities
The final dataframe from the dummy data should look similar to:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'zero_prob_bat':[0,0.4,0.4,0.4,0,0.4,0.4,0.5,0.5],
'one_prob_bat':[0.5,0.4,0.4,0.4,0.5,0.4,0.4,0,0],
'two_prob_bat':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bat':[0,0,0,0,0,0,0,0,0],
'four_prob_bat':[0,0.2,0.2,0.2,0,0.2,0.2,0,0],
'six_prob_bat':[0,0,0,0,0,0,0,0,0],
'out_prob_bat':[0.5,0,0,0,0.5,0,0,0,0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben'],
'zero_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.5,0.5],
'one_prob_bowl':[0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0,0],
'two_prob_bowl':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bowl':[0,0,0,0,0,0,0,0,0],
'four_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'six_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'out_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0]
}
One issue is that with my original dataset there are over 600 unique names. I could manually .groupby each unique name in the batsman/bowler columns, but this is not a scaleable solution as new names will continually be added.
I am tempted to:-
.count the number of instances of each unique name for batsman/bowler;
.count the number of different outcomes for each unique batsman/bowler;
Perform a lookup to match the probability next to each batsman/bowler;
However, I am cautious about implementing a lookup function as detailed in the answer here due to my dataset size which will continuously grow. In the past this has also created numerous issues when I have worked with excel/CSVs so I do not want to fall into any similar traps.
If someone could explain how they would go about solving this problem, so that I have something to aim towards, then it would be much appreciated.
Not sure how much this scales with your actual dataset, but I find it hard to think of a better solution than using groupby on the "Batsman" column and then value_counts on the grouped "Outcome" column. Example:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
grouped_data = df.groupby('Batsman')['Outcome'].value_counts(normalize=True)
print(grouped_data)
Output:
Batsman Outcome
Nick 1 0.4
0 0.2
4 0.2
Wide 0.2
Pete 0 0.5
2 0.5
Tom 1 0.5
Out 0.5
Name: Outcome, dtype: float64
Note that we did not need to groupby over each unique name manually, since groupby already does that for us.
The same logic can be applied to the "Bowler" column by simply replacing the "Batsman" string in the groupby call.
I think this answers your question...
import pandas as pd
import numpy as np
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
display(df)
batsman = df['Batsman'].unique()
bowler = df['Bowler'].unique()
print(sorted(batsman))
print(sorted(bowler))
final_df = pd.DataFrame()
for man in batsman:
df1 = df[df['Batsman'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
batsman_prob = np.array(count_man/count_outcome)
batsman_df = pd.DataFrame(data=[batman_prob], columns=[man], index=['Batsman'])
final_df = pd.concat([final_df, batsman_df, ], axis=1)
for man in bowler:
df1 = df[df['Bowler'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
bowler_prob = np.array(count_man/count_outcome)
bowler_df = pd.DataFrame(data=[bowler_prob], columns=[man], index=['Bowler'])
final_df = pd.concat([final_df, bowler_df, ], axis=1)
display(final_df)
Here is the output:
Tom Nick Pete Bill Ben
Batsman 0.333333 0.333333 0.333333 NaN NaN
Bowler NaN NaN NaN 1.166667 0.333333

Finding the most frequent occurrences of pairs in a list of lists

I've a dataset that denotes the list of authors of many technical reports. Each report can be authored by one or multiple people:
a = [
['John', 'Mark', 'Jennifer'],
['John'],
['Joe', 'Mark'],
['John', 'Anna', 'Jennifer'],
['Jennifer', 'John', 'Mark']
]
I've to find the most frequent pairs, that is, people that had most collaborations in the past:
['John', 'Jennifer'] - 3 times
['John', 'Mark'] - 2 times
['Mark', 'Jennifer'] - 2 times
etc...
How to do this in Python?
Use a collections.Counter dict with itertools.combinations:
from collections import Counter
from itertools import combinations
d = Counter()
for sub in a:
if len(a) < 2:
continue
sub.sort()
for comb in combinations(sub,2):
d[comb] += 1
print(d.most_common())
[(('Jennifer', 'John'), 3), (('John', 'Mark'), 2), (('Jennifer', 'Mark'), 2), (('Anna', 'John'), 1), (('Joe', 'Mark'), 1), (('Anna', 'Jennifer'), 1)]
most_common() will return the pairings in order of most common to least, of you want the first n most common just pass n d.most_common(n)
import collections
import itertools
a = [
['John', 'Mark', 'Jennifer'],
['John'],
['Joe', 'Mark'],
['John', 'Anna', 'Jennifer'],
['Jennifer', 'John', 'Mark']
]
counts = collections.defaultdict(int)
for collab in a:
collab.sort()
for pair in itertools.combinations(collab, 2):
counts[pair] += 1
for pair, freq in counts.items():
print(pair, freq)
Output:
('John', 'Mark') 2
('Jennifer', 'Mark') 2
('Anna', 'John') 1
('Jennifer', 'John') 3
('Anna', 'Jennifer') 1
('Joe', 'Mark') 1
You can use a set comprehension to create a set of all numbers then use a list comprehension to count the occurrence of the pair names in your sub list :
>>> from itertools import combinations as comb
>>> all_nam={j for i in a for j in i}
>>> [[(i,j),sum({i,j}.issubset(t) for t in a)] for i,j in comb(all_nam,2)]
[[('Jennifer', 'John'), 3],
[('Jennifer', 'Joe'), 0],
[('Jennifer', 'Anna'), 1],
[('Jennifer', 'Mark'), 2],
[('John', 'Joe'), 0],
[('John', 'Anna'), 1],
[('John', 'Mark'), 2],
[('Joe', 'Anna'), 0],
[('Joe', 'Mark'), 1],
[('Anna', 'Mark'), 0]]

Categories