How to automatically identify citations of the same paper? - python

Consider 3 ways to cite the same paper:
cite1 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model (2003), in: Journal of Machine Learning Research, 3(1137--1155)"
cite2 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. (2003) A Neural Probabilistic Language Model"
cite3 = "Bengio Y, Ducharme R, Vincent P, Jauvin C. (2003) A Neural Probabilistic Language Model"
A simple way of automatically identifying citations of the same paper is to compute the similarity of those citations with the difflib module in the Python Standard Library:
from difflib import SequenceMatcher as smatch
def similar(x, y): return smatch(None, x.strip(), y.strip()).ratio()
similar(cite1, cite2) # 0.721
similar(cite1, cite3) # 0.553
similar(cite2, cite3) # 0.802
Unfortunately, the similarity metric ranges from 0.553 to 0.802 so it's not clear what threshold should be set. If the threshold is too low, then citations of different papers could be mistaken as the same paper. But if the threshold is too high, then we miss out some citations.
Are there better solutions?

It is important to consider What makes a citation unique?
Based on your example, it appears that the combination of authors, the title of the article, and what year it was published constitutes a unique citation.
This means that you can parse the names, then compare how close they are (Because the third example lists the names differently). Parse the title, and it should match 100%. Parse the year, and it should also be a 100% match.

Apart neural networks and NLP, which would be a rather ... complicated approach, i would approach this problem by preprocessing the data.
Few things you can do:
- Create Short names Yoshua Bengio => Bengio Y
- Normalize the names: Réjean Ducharme -> rejean ducharme
- Extract author part of the string, title part of the string, and the "leftovers". Calculate similarity for each of the parts and average the result.
- Extract the year of the publication and make it a three variable problem.
- Use additional metadata if available (paper field, citation index, etc.
The above approach works if your problem is limited to these three bibliography types.
If you have large variations amongst the bibliography (i.e. apply it on entire springer/ieee database) you should look into machine learning approaches.
While i cant suggest a correct model on top of my head, i remember this paper being somewhere close to your problem.
Amongst other approaches, if you have a large dataset of bibliography, you can attempt semi supervised approaches like word2vec/node2vec or kmeans and see if the subsequent similarity score would be accurate enough for you.
A word of advice.
in some cases you have very similar paper names coming along from the same research teams or short names being identical when long ones differ W. Xu can be either Wang Xu or Wei Xu are both transcribed to Xu W..
in other cases you have same authors having different names Réjean Ducharme and Rejean Ducharme
Paper titles can have variations: Conference of awesome discoveries and Awesome discoveries, conference of

Related

Understanding results of word2vec gensim for finding substitutes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I have implemented the word2vec model on transaction data (link) of a single category.
My goal is to find substitutable items from the data.
The model is giving results but I want to make sure that my model is giving results based on customers historical data (considering context) and not just based on content (semantic data). Idea is similar to the recommendation system.
I have implemented this using the gensim library, where I passed the data (products) in form of a list of lists.
Eg.
[['BLUE BELL ICE CREAM GOLD RIM', 'TILLAMK CHOC CHIP CK DOUGH IC'],
['TALENTI SICILIAN PISTACHIO GEL', 'TALENTI BLK RASP CHOC CHIP GEL'],
['BREYERS HOME MADE VAN ICE CREAM',
'BREYERS HOME MADE VAN ICE CREAM',
'BREYERS COFF ICE CREAM']]
Here, each of the sub lists is the past one year purchase history of a single customer.
# train word2vec model
model = Word2Vec(window = 5, sg = 0,
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
# extract all vectors
X = []
words = list(model.wv.index_to_key)
for word in words:
x = model.wv.get_vector(word)
X.append(x)
Y = np.array(X)
Y.shape
def similar_products(v, n = 3):
# extract most similar products for the input vector
ms = model.wv.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model.wv['BLUE BELL ICE CREAM GOLD RIM'])
Results:
[('BLUE BELL ICE CREAM BROWN RIM', 0.7322707772254944),
('BLUE BELL ICE CREAM LIGHT', 0.4575043022632599),
('BLUE BELL ICE CREAM NSA', 0.3731085956096649)]
To get intuitive understanding of word2vec and its working on how results are obtained, I created a dummy dataset where I wanted to find subtitutes of 'FOODCLUB VAN IC PAIL'.
If two products are in the same basket multiple times then they are substitutes.
Looking at the data first substitute should be 'FOODCLUB CHOC CHIP IC PAIL' but the results I obtained are:
[('FOODCLUB NEAPOLITAN IC PAIL', 0.042492810636758804),
('FOODCLUB COOKIES CREAM ICE CREAM', -0.04012278839945793),
('FOODCLUB NEW YORK VAN IC PAIL', -0.040678512305021286)]
Can anyone help me understand the intuitive working of word2vec model in gensim? Will each product be treated as word and customer list as sentence?
Why are my results so absurd in dummy dataset? How can I improve?
What hyperparameters play a significant role w.r.t to this model? Is negative sampling required?
You may not get a very good intuitive understanding of usual word2vec behavior using these sorts of product-baskets as training data. The algorithm was originally developed for natural-language texts, where texts are runs of tokens whose frequencies, & co-occurrences, follow certain indicative patterns.
People certainly do use word2vec on runs-of-tokens that aren't natural language - like product baskets, or logs-of-actions, etc – but to the extent such tokens have very-different patterns, it's possible extra preprocessing or tuning will be necessary, or useful results will be harder to get.
As just a few ways customer-purchases might be different from real language, depending on what your "pseudo-texts" actually represent:
the ordering within a text might be an artifact of how you created the data-dump rather than anything meaningful
the nearest-neighbors to each token within the window may or may not be significant, compared to more distant tokens
customer ordering patterns might in general not be as reflective of shades-of-relationships as words-in-natural-language text
So it's not automatic that word2vec will give interesting results here, for recommendatinos.
That's especially the case with small datasets, or tiny dummy datasets. Word2vec requires lots of varied data to pack elements into interesting relative positions in a high-dimensional space. Even small demos usually have a vocabulary (count of unique tokens) of tens-of-thousands, with training texts that provide varied usage examples of every token dozens of times.
Without that, the model never learns anything interesing/generalizable. That's especially the case if trying to create a many-dimensions model (say the default vector_size=100) with a tiny vocabulary (just dozens of unique tokens) with few usage examples per example. And it only gets worse if tokens appear fewer than the default min_count=5 times – when they're ignored entirely. So don't expect anything interesting to come from your dummy data, at all.
If you want to develop an intuition, I'd try some tutorials & other goals with real natural language text 1st, with a variety of datasets & parameters, to get a sense of what has what kind of effects on result usefulness – & only after that try to adapt word2vec to other data.
Negative-sampling is the default, & works well with typical datasets, especially as they grow large (where negative-sampling suffes less of a performance hit than hierarchical-softmax with large vocabularies). But a toggle between those two modes is unlike to cause giant changes in quality unless there are other problems.
Sufficient data, of the right kind, is the key – & then tweaking parameters may nudge end-result usefulness in a better direction, or shift it to be better for certain purposes.
But more specific parameter tips are only possible with clearer goals, once some baseline is working.

Ways of obtaining a similarity metric between two full text documents?

So imagine I have three text documents, for example (let 3 randomly generated texts).
Document 1:
"Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be...."
Document 2:
"Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket..."
If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?
edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)
There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)
You could try the Simphile NLP text similarity library (disclosure: I'm the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:
Install:
pip install simphile
This example shows Jaccard, but is exactly the same with Euclidian or Compression:
from simphile import jaccard_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")

Feature extraction

Suppose I have been given data sets with headers :
id, query, product_title, product_description, brand, color, relevance.
Only id and relevance is in numeric format while all others consists of words and numbers. Relevance is the relevancy or ranking of a product with respect to a given query. For eg - query = "abc" and product_title = "product_x" --> relevance = "2.3"
In training sets, all these fields are filled but in test set, relevance is not given and I have to find out by using some machine learning algorithms. I am having problem in determining which features should I use in such a problem ? for example, I should use TF-IDF here. What other features can I obtain from such data sets ?
Moreover, if you can refer to me any book/ resources specifically for 'feature extraction' topic that will be great. I always feel troubled in this phase. Thanks in advance.
I think there is no book that will give the answers you need, as feature extraction is the phase that relates directly to the problem being solved and the existing data,the only tip you will find is to create features that describe the data you have. In the past i worked in a problem similar to yours and some features i used were:
Number of query words in product title.
Number of query words in product description.
n-igram counts
tf-idf
Cosine similarity
All this after some preprocessing like taking all text to upper(or lower) case, stemming, standard dictionary normalization.
Again, this depends on the problmen and the data and you will not find the direct answer, its like posting a question: "i need to develop a product selling system, how do i do it? Is there any book?" . You will find books on programming and software engineering, but you will not find a book on developing your specific system,you'll have to use general knowledge and creativity to craft your solution.

Finding relationships among words in text

In text, sometimes words tend to point to the same object.
For example: John is an actor, his father Abraham was Doctor
So here his points to John, and if we have the question Who is John's father? or What is John's father's occupation?, we should be able to answer this but I don't know how to achieve this.
Using lexical analysis, parse; using sentence parsing we can get VP, NP, N etc from the sentence. This can help for it - https://pypi.python.org/pypi/pylinkgrammar
Latent semantic analysis and Probabilistic latent semantic analysis (PLSA) provides relation and can be used to analyze two-mode and co-occurrence data. But its not clear how it can be used.
More of kinda semantic and syntactic analysis.
Any suggestion or reference for this would much appreciated.
What you describe is called coreference resolution as for the former problem (what does his refers to? John!) and relation extraction as for the latter (that is, job(John, actor), job(Abraham,doctor), and father(John,Abraham)).
There are tons of studies on these subjects. Hopefully, ACL Anthology is here to help :
coreference resolution
relation extraction
There's a specific library NLTK-dependent that I think fits perfect for your case: https://code.google.com/p/nltk-drt/
This PDF explain very detailed how it works: https://code.google.com/p/nltk-drt/downloads/detail?name=NLTK-DRT.pdf

Defining the context of a word - Python

I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!
This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.
Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD
I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.
The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.

Categories