I have a list of co-authors:
ten_author_pairs = [('creutzig', 'gao'),
('creutzig', 'linshaw'),
('gao', 'linshaw'),
('jing', 'zhang'),
('jing', 'liu'),
('zhang', 'liu'),
('jing', 'xu'),
('briant', 'einav'),
('chen', 'gao'),
('chen', 'jing')]
From here I can generate a list of negative examples - i.e. authors-pairs which are unconnected using the following code:
#generating negative examples -
from itertools import combinations
elements = list(set([e for l in ten_author_pairs for e in l])) # find all unique elements
complete_list = list(combinations(elements, 2)) # generate all possible combinations
#convert to sets to negate the order
set1 = [set(l) for l in ten_author_pairs]
complete_set = [set(l) for l in complete_list]
# find sets in `complete_set` but not in `set1`
ten_unconnnected = [list(l) for l in complete_set if l not in set1]
print(len(ten_author_pairs))
print(len(ten_unconnnected))
Next, I want to implement a link prediction problem for which I want to obtain a dataframe as follows:
author-pair jaccard Resource_Allocation Adamic_Adar Preferential cn_soundarajan_hopcroft within_inter_cluster link
creutzig-linshaw 0.25 0.25 0.25 0.25 0.25 0.25 1
I can calculate these and have lists with scores as output using networkx documentation, but I am not able to put it together as a table as shown above.
Like for the positive examples (the list mentioned above), I can generate a dataframe using:
df = pd.DataFrame(list, columns = ['u1','u2])
and then make a graph with:
G = nx.from_pandas_edgelist(df, u1, u2, create_using = nx.Graph())
After which say for jaccard index I can apply:
nx.jaccard_coefficient(G)
Which returns me a list of node pairs with jaccard score.
The 'link' column is generated with the logic - 1 for co-authors and 0 for pairs in the negative example.
But, I need all the respective scores as a table as mentioned.
Can anyone please help me with how to construct the above dataframe.
(The scores mentioned are just for example purpose to indicate the kind of table i need)
Oh -- this has been a good two years, but I just stumbled upon this...in case I understood you correctly, building on your basis:
from itertools import combinations
import pandas as pd
import networkx as nx
elements = list(set([e for l in ten_author_pairs for e in l]))
complete_list = list(combinations(elements, 2))
set1 = [set(l) for l in ten_author_pairs]
df = pd.DataFrame(set1, columns=["u1", "u2"])
G = nx.from_pandas_edgelist(df, "u1", "u2", create_using=nx.Graph())
Then defining the list of generators
list_generators = [
nx.jaccard_coefficient,
nx.resource_allocation_index,
nx.adamic_adar_index,
nx.preferential_attachment,
]
Building the score dataframe:
dfx = pd.DataFrame()
for item_generator in list_generators:
if dfx.shape[0]:
dfx = dfx.merge(
right=get_df_network(generator=item_generator, graph=G),
left_index=True,
right_index=True,
)
else:
dfx = get_df_network(generator=item_generator, graph=G)
And finally merging in the link dataframe
df_link = (
pd.DataFrame(set1, columns=["node_0", "node_1"])
.set_index(["node_0", "node_1"])
.assign(link=[1] * len(set1))
)
dfx.merge(df_link, left_index=True, right_index=True, how="outer").fillna(0)
could do the job?
Related
I have 2 DataFrames of (df1) 35k and (df2) 76k rows where I need to check whether df1["col1"] elements exist in df2["col2"] sub-elements. The code seems to be working fine on a sample dataset I have provided but the runtime takes forever on the original one. Here is a for-loop code I used on the sample dataset:
import pandas as pd
post_token_list = [['wXrL3TbK'], ['wXmTQKw1'], ['wXvnlWej'], ['wXvXBjKp']]
tokens_list = [['wXv3qoPQ', 'wXvT7ylu', 'wXvnIJuH', 'wXvXH7vy', 'wXvDXSS1', 'wXvjVE1F', 'wXvPV6z1', 'wXvHF1uw',
'wXvH1q03', 'wXvnTlcr', 'wXvDEG9U', 'wXLfZtO6', 'wXvLDDDl', 'wXvHTgjk', 'wXvHDDr8', 'wXvPBLbu',
'wXvvxXHI', 'wXvPBFge', 'wXvLxSii', 'wXvDhk2h', 'wXv3Alan', 'wXvvQuKy', 'wXvvQ6LO', 'wXpHNjw9'],
['wXYr2lVk', 'wXXj7iDP', 'wXXXIsQr', 'wXQbXKz6', 'wXN3tMp1', 'wXMfZV5N', 'wXvnlWej', 'wXSDyEaW',
'wXQ7mM78', 'wXMPvojh', 'wXMjo-8G', 'wXLfZtO6', 'wXN3tMp1'],
['wXr_jZmX', 'wXr7D0AM', 'wXrzjhxL', 'wXrfjQNe', 'wXrnihqT', 'wXrjyqm5', 'wXr3CD4h', 'wXrnSZsy',
'wXrTieP7', 'wXLfZtO6', 'wXgHVwkc', 'wXdvewsV', 'wXrfxZeg', 'wXrLB7Zo', 'wXprtX71', 'wXrHhjtO',
'wXrzwKBt', 'wXqz-RlY', 'wXq_fp7F', 'wXq7Po7n', 'wXq7fC73', 'wXqzvRSW', 'wXqf_PQ3', 'wXML2vCd'],
['wXv3aQrv', 'wXvn6ONM', 'wXvfaG0M', 'wXvf6LIr', 'wXvjJBg_', 'wXvL6M-0', 'wXv7p2cd', 'wXv3poSs',
'wXvz5kUz', 'wXvrZz0_', 'wXv_YVCb', 'wXLfZtO6', 'wXvX5Hgi', 'wXvz3Ptg', 'wXvHJUU-', 'wXvr4fB7',
'wXvnlWej', 'wXv_YUrK', 'wXv7Id05', 'wXv7IYOV', 'wXvfYfLo', 'wXv7Y3AV', 'wXvT4_pE', 'wXvPovRt'],
['wXoDui-2', 'wXoT9yTg', 'wXmTQKw1', 'wXormLxu', 'wXMX-NNQ', 'wXo7kUfB', 'wXon0rt_', 'wXozT-3V',
'wXnvYjEc', 'wXnTn9D6', 'wXnLH7Cz', 'wXn_2HV_', 'wXnPGou9', 'wXnPVSNo', 'wXuG0sl3', 'wXnjAs7X',
'wXm38mLv', 'wXmnj5Oh', 'wXmfjQ2h', 'wXm_wXuD', 'wXlPOUmy', 'wXcfHkmx', 'wXQ_62cx', 'wXUD3qyx']]
df1 = pd.DataFrame({"col1": post_token_list})
df2 = pd.DataFrame({"col2": tokens_list})
query_bounce = []
def query_bounce_checker(dataset_clicked, dataset_loaded, col1, col2):
for i in dataset_clicked[col1]:
for j in i:
[query_bounce.append(k) for k in dataset_loaded[col2] if j in k]
return query_bounce
query_bounce_checker(df1, df2, "col1", "col2")
i, j, and k values are used to access and compare the elements and sub-elements of the two respecting columns.
Speed is a contributing factor for me, and the function written here is not fast enough for a dataset of this size.
If this is actually what you want, this should be pretty fast.
import numpy as np
np.intersect1d(np.hstack(df1.col1),np.hstack(df2.col2))
Output
array(['wXmTQKw1', 'wXvnlWej'], dtype='<U8')
I am not sure if it is what you want. If you just want to check which values in df1 also exist in df2, you can transform two dataframes into arrays and use np.in1d() to do so.
Try this:
array1 = np.array((','.join(df1['col1'].apply(lambda x: ','.join(x)))).split(','))
array2 = np.array((','.join(df2['col2'].apply(lambda x: ','.join(x)))).split(','))
print(array1[np.in1d(array1,array2)])
Output:
['wXmTQKw1' 'wXvnlWej']
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")
I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following:
For each key in dict:
1. Get bigrams for the pandas column and the dict[key]
2. Calculate Jaccard similarity
3. Append to an empty list
4. Store the list in the dataframe
5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:
So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615
I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.
I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)