loop over pandas column for wmd similarity - python

I have two dataframe. both have two columns. I want to use wmd to find closest match for each entity in column source_label to entities in column target_label However, at the end I would like to have a DataFrame with all the 4 columns with respect to the entities.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
Expected result
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
The dataframe is so big that I am looking for some faster way to iterate over both label columns. So far I tried this:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
First of all this code is not working well as the other two columns are coming directly from the dfs' and do not take entities relation with the uri into consideration.
How one can make it faster as the entities are in millions. Thanks in advance.

A quick fix :
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column

Related

Openpyxl and Binary Search

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

compare a long list with strings in dataframe and on the basis of matching populate the dataframe in Python

I've aminor dataset I've to extract only computer science terms out of it so there is list1 to which i've to compare my dataset for this task.
https://www.aminer.org/oag2019
list1 = ['document types', 'surveys and overviews', 'reference works', 'general conference proceedings', 'biographies', 'general literature', 'computing standards, rfcs and guidelines', 'cross-computing tools and techniques',......]
total count of list1 is 2112 computer science terms from ACM.
data frame to which I've to compare(string comparison) list1 in a data frame column in the form of
df_train14year['keywords'].head()
0 "nmr spectroscopy","mass spectrometry","nanost...
1 "plk1","cationic dialkyl histidine","crystal s...
2 "case-control","child","fuel","hydrocarbons","...
3 "Ca2+ handling","CaMKII","cardiomyocyte","cont...
4
Name: keywords, dtype: object
in each of these lists in dataframe there are max 10 keywords min (3) in each and there are millions of records in dataframe.
so I've to compare each keywords with original list1 if more then 3 words are matching in both lists and populate a dataframe with those values, substrings match may also be needed.
how to do this task inefficient way in python, what I've done is by for loop to each keyword compared to the whole list and there are three loops in it, so it is inefficient.
# for i in range(5):
# df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']
count = 0;
i = 0;
for index, row in df_train14year.iterrows():
# print("index",index)
i=1+1;
# if(i==50):
# break
for outr in row['keywords'].split(","):
#print(count)
if (count>1):
# print("found1")
count = 0;
break;
for inr in computerList:
# outr= outr.replace("[","") # i skip the below three lines because i applied the pre- processing on data to remove the [] and "
# outr= outr.replace("]","")
outr= outr.replace('"',"")
#print("outr",outr,"inr",inr)
if outr in inr:
count = count+1
if (count>10):
#print("outr",outr,"inr",inr)
# print("found2")
# df12.loc[i] = [index,row['keywords']]
#df12.insert(index,"keywords",row['keywords'])
df14_4_match = df14_4_match.append({'abstract': row['abstract'],'keywords': row['keywords'],'title': row['title'],'year': row['year']}, ignore_index=True)
break;
# else:
# print('not found')```
kewwords = ["nmr spectroscopy","mass spectrometry","nanos"] rows in the data frame
prprocessing:
dfk['list_keywords']=[[x for x in j.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']] for j in dfk['keywords']]
after converting the orgional list to set
dfk['set_keywords']=dfk['list_keywords'].map(lambda x: set(x))
we compare the intersection of kewordlist and computerlist(list1) mentioned in question
to get number of items or keywords matched
dfk['set_keywords']=dfk['set_keywords'].map(lambda x:x.intersection(proceseedComputerList))
get the length with this function
dfk['len_keywords']=dfk['set_keywords'].map(lambda x:len(x))
sort in ascending order
dfk.head()```

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

Create random groupings from list

I need to take a list of over 500 people and place them into groups of 15. The groups should be randomized so that we don't end up with groups where everyone's last name begins with "B", for example. But I also need to balance the groups of 15 for gender parity as close as possible. The list is in a 'students.csv' file with this structure:
Last, First, ID, Sport, Gender, INT
James, Frank, f99087, FOOT, m, I
Smith, Sally, f88329, SOC, f,
Cranston, Bill, f64928, ,m,
I was looking for some kind of solution in pandas, but I have limited coding knowledge. The code I've got so far just explores the data a bit.
import pandas as pd
data = pd.read_csv('students.csv', index_col='ID')
print(data)
print(data.Gender.value_counts())
First thing I would do is filter into two lists, one for each gender:
males = [d for d in data if d.Gender == 'm']
females = [d for d in data if d.Gender == 'f']
Next, shuffle the orders of the lists, to make it easier to select "randomly" while actually not having to choose random indices:
random.shuffle(males)
random.shuffle(females)
then, choose elements, while trying to stay more-or-less in line with the gender ratio:
# establish number of groups, and size of each group
GROUP_SIZE = 15
GROUP_NUM = math.ceil(len(data) / group_size)
# make an empty list of groups to add each group to
groups = []
while len(groups) < GROUP_NUM and (len(males) > 0 and len(females) > 0):
# calculate the proper gender ratio, to perfectly balance this group
num_males = len(males) / len(data) * GROUP_SIZE
num_females = GROUP_SIZE - num_males
# select that many people from the previously-shuffled lists
males_in_this_group = [males.pop(0) for n in range(num_males) if len(males) > 0]
females_in_this_group = [males.pop(0) for n in range(num_females) if len(females) > 0]
# put those two subsets together, shuffle to make it feel more random, and add this group
this_group = males_in_this_group + females_in_this_group
random.shuffle(this_group)
groups.append(this_group)
This will ensure that the gender ratio in each group is as true to the original sample as possible. The last group will, of course, be smaller than the others, and will contain "whatever's left" from the other groups.
Approach using pandas means - groups of 15 members. The rest are in the very last group. Gender ratio is kinda the same at the accuracy as pandas randomizer allows.
import pandas as pd
df = pd.read_csv('1.csv', skipinitialspace=True) # 1.csv contains sample data from the question
# shuffle data / pandas way
df = df.sample(frac=1).reset_index(drop=True)
# group size
SIZE = 15
# create column with group number
df['group'] = df.index // SIZE
# list of groups, groups[0] is dataframe with the first group members
groups = [
df[df['group'] == num]
for num in range(df['group'].max() + 1)]
Save dataframe to file:
# one csv-file
df.to_csv('2.csv')
# many csv-files
for num, group_df in enumerate(groups, 1):
group_df.to_csv('group_{}.csv'.format(num))

Categories