Create random groupings from list - python

I need to take a list of over 500 people and place them into groups of 15. The groups should be randomized so that we don't end up with groups where everyone's last name begins with "B", for example. But I also need to balance the groups of 15 for gender parity as close as possible. The list is in a 'students.csv' file with this structure:
Last, First, ID, Sport, Gender, INT
James, Frank, f99087, FOOT, m, I
Smith, Sally, f88329, SOC, f,
Cranston, Bill, f64928, ,m,
I was looking for some kind of solution in pandas, but I have limited coding knowledge. The code I've got so far just explores the data a bit.
import pandas as pd
data = pd.read_csv('students.csv', index_col='ID')
print(data)
print(data.Gender.value_counts())

First thing I would do is filter into two lists, one for each gender:
males = [d for d in data if d.Gender == 'm']
females = [d for d in data if d.Gender == 'f']
Next, shuffle the orders of the lists, to make it easier to select "randomly" while actually not having to choose random indices:
random.shuffle(males)
random.shuffle(females)
then, choose elements, while trying to stay more-or-less in line with the gender ratio:
# establish number of groups, and size of each group
GROUP_SIZE = 15
GROUP_NUM = math.ceil(len(data) / group_size)
# make an empty list of groups to add each group to
groups = []
while len(groups) < GROUP_NUM and (len(males) > 0 and len(females) > 0):
# calculate the proper gender ratio, to perfectly balance this group
num_males = len(males) / len(data) * GROUP_SIZE
num_females = GROUP_SIZE - num_males
# select that many people from the previously-shuffled lists
males_in_this_group = [males.pop(0) for n in range(num_males) if len(males) > 0]
females_in_this_group = [males.pop(0) for n in range(num_females) if len(females) > 0]
# put those two subsets together, shuffle to make it feel more random, and add this group
this_group = males_in_this_group + females_in_this_group
random.shuffle(this_group)
groups.append(this_group)
This will ensure that the gender ratio in each group is as true to the original sample as possible. The last group will, of course, be smaller than the others, and will contain "whatever's left" from the other groups.

Approach using pandas means - groups of 15 members. The rest are in the very last group. Gender ratio is kinda the same at the accuracy as pandas randomizer allows.
import pandas as pd
df = pd.read_csv('1.csv', skipinitialspace=True) # 1.csv contains sample data from the question
# shuffle data / pandas way
df = df.sample(frac=1).reset_index(drop=True)
# group size
SIZE = 15
# create column with group number
df['group'] = df.index // SIZE
# list of groups, groups[0] is dataframe with the first group members
groups = [
df[df['group'] == num]
for num in range(df['group'].max() + 1)]
Save dataframe to file:
# one csv-file
df.to_csv('2.csv')
# many csv-files
for num, group_df in enumerate(groups, 1):
group_df.to_csv('group_{}.csv'.format(num))

Related

loop over pandas column for wmd similarity

I have two dataframe. both have two columns. I want to use wmd to find closest match for each entity in column source_label to entities in column target_label However, at the end I would like to have a DataFrame with all the 4 columns with respect to the entities.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
Expected result
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
The dataframe is so big that I am looking for some faster way to iterate over both label columns. So far I tried this:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
First of all this code is not working well as the other two columns are coming directly from the dfs' and do not take entities relation with the uri into consideration.
How one can make it faster as the entities are in millions. Thanks in advance.
A quick fix :
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column

Maximum Consecutive Ones/Trues per year that also considers the boundaries (Start-of-year and End-of-year)

Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together.
I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.
Reproducible Example Code:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Below is my initial code to achieved wanted output
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
However, output above is still not what we want because groupby function cuts the data for each year.
So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Now I've got the time. Here is my solution:
# Modules needed
import pandas as pd
import numpy as np
input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})
streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts
streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths
longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()
output:
dates
2000 9
2001 3
Name: streak_length, dtype: int64
Not sure if this is the most efficient, but it's one solution:
arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index() # convert to df
df['adj_year'] = df['index'].dt.year # adj_year will be adjusted for streaks
mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year
Output:
adj_year
2000 9
2001 3
dtype: int64
Note:
By convention variable names in Python (except for classes) are lower case , so arr as opposed to InputArray
1 and 0 are equivalent to True and False, so you can make convert them to boolean without the explicit comparison
cumsum is zero-indexed (as is usual in Python) so we add 1
This solution doesn't answer the question exactly, so will not be the final answer.
i.e. This regards max_consecutive trues at the boundaries for both current-year and following year
boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))
distinct = boolean_array.ne(boolean_array.shift()).cumsum()
distinct_masked = distinct[boolean_array]
streak_sum = distinct_masked.value_counts()
streak_sum_series = pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()
Output:
max_consect
2000 9
2001 9
dtype: int64

compare a long list with strings in dataframe and on the basis of matching populate the dataframe in Python

I've aminor dataset I've to extract only computer science terms out of it so there is list1 to which i've to compare my dataset for this task.
https://www.aminer.org/oag2019
list1 = ['document types', 'surveys and overviews', 'reference works', 'general conference proceedings', 'biographies', 'general literature', 'computing standards, rfcs and guidelines', 'cross-computing tools and techniques',......]
total count of list1 is 2112 computer science terms from ACM.
data frame to which I've to compare(string comparison) list1 in a data frame column in the form of
df_train14year['keywords'].head()
0 "nmr spectroscopy","mass spectrometry","nanost...
1 "plk1","cationic dialkyl histidine","crystal s...
2 "case-control","child","fuel","hydrocarbons","...
3 "Ca2+ handling","CaMKII","cardiomyocyte","cont...
4
Name: keywords, dtype: object
in each of these lists in dataframe there are max 10 keywords min (3) in each and there are millions of records in dataframe.
so I've to compare each keywords with original list1 if more then 3 words are matching in both lists and populate a dataframe with those values, substrings match may also be needed.
how to do this task inefficient way in python, what I've done is by for loop to each keyword compared to the whole list and there are three loops in it, so it is inefficient.
# for i in range(5):
# df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']
count = 0;
i = 0;
for index, row in df_train14year.iterrows():
# print("index",index)
i=1+1;
# if(i==50):
# break
for outr in row['keywords'].split(","):
#print(count)
if (count>1):
# print("found1")
count = 0;
break;
for inr in computerList:
# outr= outr.replace("[","") # i skip the below three lines because i applied the pre- processing on data to remove the [] and "
# outr= outr.replace("]","")
outr= outr.replace('"',"")
#print("outr",outr,"inr",inr)
if outr in inr:
count = count+1
if (count>10):
#print("outr",outr,"inr",inr)
# print("found2")
# df12.loc[i] = [index,row['keywords']]
#df12.insert(index,"keywords",row['keywords'])
df14_4_match = df14_4_match.append({'abstract': row['abstract'],'keywords': row['keywords'],'title': row['title'],'year': row['year']}, ignore_index=True)
break;
# else:
# print('not found')```
kewwords = ["nmr spectroscopy","mass spectrometry","nanos"] rows in the data frame
prprocessing:
dfk['list_keywords']=[[x for x in j.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']] for j in dfk['keywords']]
after converting the orgional list to set
dfk['set_keywords']=dfk['list_keywords'].map(lambda x: set(x))
we compare the intersection of kewordlist and computerlist(list1) mentioned in question
to get number of items or keywords matched
dfk['set_keywords']=dfk['set_keywords'].map(lambda x:x.intersection(proceseedComputerList))
get the length with this function
dfk['len_keywords']=dfk['set_keywords'].map(lambda x:len(x))
sort in ascending order
dfk.head()```

How to iterate and sample from each category of a dataframe?

I'm working on a script that calculates a sample size, and then extracts samples from each category in a dataframe evenly. I want to re-use this code for various dataframes with different categories, but I'm having trouble figuring out the for loop to do this:
df2 = df.loc[(df['Track Item']=='Y')]
categories = df2['Category'].unique()
categories_total = len(categories)
total_rows = len(df2.axes[0])
ss = (2.58**2)*(0.5)*(1-0.5)/.04**2
ss2 = ss / categories_total
ss3 = round(ss2)
one = df.loc[(df['Category']=='HOUSEHOLD FANS')].sample(ss3)
two = df.loc[(df['Category']=='HUMIDIFIERS')].sample(ss3)
three = df.loc[(df['Category']=='HOME WATER FILTERS')].sample(ss3)
four = df.loc[(df['Category']=='CAMPING & HIKING WATER FILTERS')].sample(ss3)
five = df.loc[(df['Category']=='THERMOMETERS')].sample(ss3)
six = df.loc[(df['Category']=='AIR PURIFIERS')].sample(ss3)
seven = df.loc[(df['Category']=='DETECTORS')].sample(ss3)
eight = df.loc[(df['Category']=='AIR CONDITIONERS')].sample(ss3)
nine = df.loc[(df['Category']=='AROMATHERAPY')].sample(ss3)
ten = df.loc[(df['Category']=='AIR HEATING')].sample(ss3)
eleven = df.loc[(df['Category']=='HOUSEHOLD FANS')].sample(ss3)
I need to loop through each category, taking a sample from each one evenly. Any idea how I can accomplish this task?
How about a groupby with sample instead:
df.groupby('Category').apply(lambda x: x.sample(ss3))

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

Categories