find most frequent pairs in a dataframe - python

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?

So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

Related

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png
You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D
To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]
I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

How to write a function that runs that function on 20 different csv files in Python?

I have a directory of 19 csv files each containing a list of student registration numbers and their names. There are two separate files named quiz1 and quiz2 and both these contain information about all students who took these quizzes with their names and total marks obtained. Marks obtained in each have to be segregated into various columns, along with a column 'noofpresent' that shows their attendance for that particular quiz.
My task is parse through all these files and create a dataframe that basically looks like as shown under. The image above shows 5 batches out of a total of 19.
While I have filled up the relevant fields of Batch4, like as shown in the image, I realized repeating that process for 18 files is insane.
How can I write a program or a function that does all the operations for all the remaining 18 remaining batches for both the quizzes? I just need an idea of how to go ahead with the automation logic for the remaining 18 files.
An ex for batch 9 (say):
This is the code I need to replicate for each of the 19 batches:
import pandas as pd
spath = 'd:\\a2\\studentlist.csv'
q1path = 'd:\\a2\\quiz\\quiz1.csv'
q2path = 'd:\\a2\\quiz\\quiz2.csv'
b1path = 'd:\\a2\\batchwiselist\\1.csv'
b9path = 'd:\\a2\\batchwiselist\\9.csv'
tpath = 'd:\\a2\\testcasestudent.txt'
# the final dataframe that needs to be created and filled up eventually
idx = pd.MultiIndex.from_product([['batch1', 'batch2', 'batch3', 'batch4', 'batch9'], ['quiz1', 'quiz2']])
cols=['noofpresent', 'lesserthan50', 'between50and60', 'between60and70', 'between70and80', 'greaterthan80']
statdf = pd.DataFrame('-', idx, cols)
# ============BATCH 9===================]
# ----------- QUIZ 1 -----------]
# Master list of students in Batch 9
b9 = pd.read_csv(b9path, usecols=['studentName', 'admissionNumber'])
b9.rename(columns={'studentName' : 'Firstname'}, inplace=True)
# To match column from quiz1.csv to batch9.csv to for merger
# Master list of all who attended Quiz1
q1 = pd.read_csv(q1path, usecols = ['Firstname', 'Grade/10.00', 'State'], na_values = ['-', 'In progress', np.NaN])
q1.dropna(inplace=True)
q1['Grade/10.00'] = q1['Grade/10.00'] * 10
# Multiplying the grades by 10 to mark against 100 instead of 10
# Merge batch9 list of names to list of quiz1 on their firstname column
q1b9 = pd.merge(b9, q1)
q1b9 = q1.loc[q1['Firstname'].isin(b9.Firstname)] # checking if the name exits in either lists
q1b9.reset_index(inplace=True)
#print(q1b9)
lt50 = q1b9.loc[(q1b9['Grade/10.00'] < 50)]
#findout list of students whose grades are lesser than 50
out9q1 = (lt50['Grade/10.00'].count())
# print(out9q1) to just get the count of number of students who got <50 quiz1 from batch9
# Similar process for quiz2 below for batch9.
# -------------------- QUIZ 2 ------------------]
# Master list of all who attended Quiz2
q2 = pd.read_csv(q2path, usecols = ['Firstname', 'Grade/10.00', 'State'], na_values = ['-', 'In progress', np.NaN])
q2.dropna(inplace=True)
q2['Grade/10.00'] = q2['Grade/10.00'] * 10
# Merge B1 to Q2
q2b9 = pd.merge(b9, q2)
q2b9 = q2.loc[q2['Firstname'].isin(b9.Firstname)]
q2b9.reset_index(inplace=True)
q2b9.loc[(q2b9['Grade/10.00'] <= 50)].count()
lt50 = q2b9.loc[(q2b9['Grade/10.00'] < 50)]
out9q2 = (lt50['Grade/10.00'].count())
# print(out9q2)
The above code computes for all students who obtained less than 50 in either quizzes. I have done similar for batch4. I need to replicate this so that one function can do so for all the available remaining (17-18)batches.
In the below code I have generated all csv paths and loads one by one then do all the process then resulted dataframes are saved in a list of dataframes like [[batch1_q1_result, batch1_q2_result], [batch2_q1_result, batch2_q2_result] ...]
def doAll(baseBatchPath, numberOfBatches):
batchResultListAll = [] # this will store the resulted dataframes
spath = 'd:\\a2\\studentlist.csv'
q1path = 'd:\\a2\\quiz\\quiz1.csv'
q2path = 'd:\\a2\\quiz\\quiz2.csv'
tpath = 'd:\\a2\\testcasestudent.txt'
# the final dataframe that needs to be created and filled up eventually
idx = pd.MultiIndex.from_product([['batch1', 'batch2', 'batch3', 'batch4', 'batch9'], ['quiz1', 'quiz2']])
cols=['noofpresent', 'lesserthan50', 'between50and60', 'between60and70', 'between70and80', 'greaterthan80']
statdf = pd.DataFrame('-', idx, cols)
# Master list of all who attended Quiz1
q1 = pd.read_csv(q1path, usecols = ['Firstname', 'Grade/10.00', 'State'], na_values = ['-', 'In progress', np.NaN])
q1.dropna(inplace=True)
q1['Grade/10.00'] = q1['Grade/10.00'] * 10
# Master list of all who attended Quiz2
q2 = pd.read_csv(q2path, usecols = ['Firstname', 'Grade/10.00', 'State'], na_values = ['-', 'In progress', np.NaN])
q2.dropna(inplace=True)
q2['Grade/10.00'] = q2['Grade/10.00'] * 10
# generate each batch file path and do other works
for batchId in range(numberOfBatches-1):
batchCsvPath = baseBatchPath + str(batchId+1) + ".csv"
# Master list of students in Batch 9
batch = pd.read_csv(batchCsvPath, usecols=['studentName', 'admissionNumber'])
batch.rename(columns={'studentName' : 'Firstname'}, inplace=True)
# Merge eachBatch list of names to list of quiz1 on their firstname column
q1batch = pd.merge(batch, q1)
q1batch = q1.loc[q1['Firstname'].isin(batch.Firstname)] # checking if the name exits in either lists
q1batch.reset_index(inplace=True)
#print(q1batch)
lt50 = q1batch.loc[(q1batch['Grade/10.00'] < 50)]
#findout list of students whose grades are lesser than 50
outBatchq1 = (lt50['Grade/10.00'].count())
# print(outBatchq1) to just get the count of number of students who got <50 quiz1 from batch -> batchId
#do same for quiz 2
# Merge each Batch to Q2
q2batch = pd.merge(batch, q2)
q2batch = q2.loc[q2['Firstname'].isin(batch.Firstname)]
q2batch.reset_index(inplace=True)
q2batch.loc[(q2batch['Grade/10.00'] <= 50)].count()
lt50 = q2batch.loc[(q2batch['Grade/10.00'] < 50)]
outBatchq2 = (lt50['Grade/10.00'].count())
# print(outBatchq2)
# finally save the resulted DF for later use
batchResultListAll.append([q1batch, q2batch])
#call the function using base path and number of batch csv files
doAll("d:\\\\a2\\\\batchwiselist\\\\", 18)
Make a list object which contains all the CSV file paths and then use a for loop to parse through all this. Obviously you will have to tweak your code where you have hard coded in the csv file with the now dynamic file
Something like this:
csv_files = ['file1.csv','file2.csv2']
for file in csv_files:
(YOUR CODE GOES HERE)

Create random groupings from list

I need to take a list of over 500 people and place them into groups of 15. The groups should be randomized so that we don't end up with groups where everyone's last name begins with "B", for example. But I also need to balance the groups of 15 for gender parity as close as possible. The list is in a 'students.csv' file with this structure:
Last, First, ID, Sport, Gender, INT
James, Frank, f99087, FOOT, m, I
Smith, Sally, f88329, SOC, f,
Cranston, Bill, f64928, ,m,
I was looking for some kind of solution in pandas, but I have limited coding knowledge. The code I've got so far just explores the data a bit.
import pandas as pd
data = pd.read_csv('students.csv', index_col='ID')
print(data)
print(data.Gender.value_counts())
First thing I would do is filter into two lists, one for each gender:
males = [d for d in data if d.Gender == 'm']
females = [d for d in data if d.Gender == 'f']
Next, shuffle the orders of the lists, to make it easier to select "randomly" while actually not having to choose random indices:
random.shuffle(males)
random.shuffle(females)
then, choose elements, while trying to stay more-or-less in line with the gender ratio:
# establish number of groups, and size of each group
GROUP_SIZE = 15
GROUP_NUM = math.ceil(len(data) / group_size)
# make an empty list of groups to add each group to
groups = []
while len(groups) < GROUP_NUM and (len(males) > 0 and len(females) > 0):
# calculate the proper gender ratio, to perfectly balance this group
num_males = len(males) / len(data) * GROUP_SIZE
num_females = GROUP_SIZE - num_males
# select that many people from the previously-shuffled lists
males_in_this_group = [males.pop(0) for n in range(num_males) if len(males) > 0]
females_in_this_group = [males.pop(0) for n in range(num_females) if len(females) > 0]
# put those two subsets together, shuffle to make it feel more random, and add this group
this_group = males_in_this_group + females_in_this_group
random.shuffle(this_group)
groups.append(this_group)
This will ensure that the gender ratio in each group is as true to the original sample as possible. The last group will, of course, be smaller than the others, and will contain "whatever's left" from the other groups.
Approach using pandas means - groups of 15 members. The rest are in the very last group. Gender ratio is kinda the same at the accuracy as pandas randomizer allows.
import pandas as pd
df = pd.read_csv('1.csv', skipinitialspace=True) # 1.csv contains sample data from the question
# shuffle data / pandas way
df = df.sample(frac=1).reset_index(drop=True)
# group size
SIZE = 15
# create column with group number
df['group'] = df.index // SIZE
# list of groups, groups[0] is dataframe with the first group members
groups = [
df[df['group'] == num]
for num in range(df['group'].max() + 1)]
Save dataframe to file:
# one csv-file
df.to_csv('2.csv')
# many csv-files
for num, group_df in enumerate(groups, 1):
group_df.to_csv('group_{}.csv'.format(num))

Pandas - Change value in column based on its relationship with another column

I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...

Create a new column in a dataframe with increment number based on another column

Consider the below pandas DataFrame:
from pandas import Timestamp
df = pd.DataFrame({
'day': [Timestamp('2017-03-27'),
Timestamp('2017-03-27'),
Timestamp('2017-04-01'),
Timestamp('2017-04-03'),
Timestamp('2017-04-06'),
Timestamp('2017-04-07'),
Timestamp('2017-04-11'),
Timestamp('2017-05-01'),
Timestamp('2017-05-01')],
'act_id': ['916298883',
'916806776',
'923496071',
'926539428',
'930641527',
'931935227',
'937765185',
'966163233',
'966417205']
})
As you may see, there are 9 unique ids distributed in 7 days.
I am looking for a way to add two new columns.
The first column:
An increment number for each new day. For example 1 for '2017-03-27'(same number for same day), 2 for '2017-04-01', 3 for '2017-04-03', etc.
The second column:
An increment number for each new act_id per day. For example 1 for '916298883', 2 for '916806776' (which is linked to the same day '2017-03-27'), 1 for '923496071', 1 for '926539428', etc.
The final table should look like this
I have already tried to build the first column with apply and a function but it doesn't work as it should.
#Create helper function to give index number to a new column
counter = 1
def giveFlag(x):
global counter
index = counter
counter+=1
return index
And then:
# Create day flagger column
df_helper['day_no'] = df_helper['day'].apply(lambda x: giveFlag(x))
try this:
days = list(set(df['day']))
days.sort()
day_no = list()
iter_no = list()
for index,day in enumerate(days):
counter=1
for dfday in df['day']:
if dfday == day:
iter_no.append(counter)
day_no.append(index+1)
counter+=1
df['day_no'] = pd.Series(day_no).values
df['iter_no'] = pd.Series(iter_no).values

Categories