Web scraping 40+ websites in search of opportunities in python

Web scraping 40+ websites in search of opportunities in python - python

I am trying to automate the task of searching for opportunities (tenders) in 40+ websites, for a company. The opportunities are usually displayed in table format. They have a title, date published, and a clickable link that takes you to a detailed description of what the opportunity is.
One website example is:
http://www.eib.org/en/about/procurement/index.htm
The goal would be to retrieve the new opportunities that are posted everyday and that fit specific criteria. So I need to look at specific keywords within the opportunities' title. These keywords are the fields and regions in which the company had previous experience.
My question is: After I extract these tables, with the tenders' titles, in a dataframe format, how do I search for the right opportunities and sort them by relevance (given a list of keywords)? Do I use NLP in this case and turn the words in the titles into binary code (0s and 1s)? Or are there other simpler methods I should be looking at?
Thanks in advance!

To sort the tenders by relevance, you need define the relevance.
In this case you could count the number of occurrence of your keywords in the tender and this would be your relevance score. You can then only keep the ones that have at least one appearing keyword.
This is a first try, you can improve this by adding keywords, or assign a higher score if the keyword is in the title rather than in the detailed description...
The task you might be trying to solve here is information retrieval: rank documents (the tenders) given their relevance to a query (your keyword).
So then you can use weighing schemes like Tf-Idf or BM25, etc... But it depends on your needs, maybe counting the keyword is more than enough !

Related

Assigning different weights to search keywords

I'm implementing a search engine and so far I am done with the part for web crawling, storing the results in the index and retrieving results for the search keywords entered by the user. However I would like the search results to be more specific. Let's say I'm searching "Shoe shops in Hyderabad". Is there any NLP library in python that can just process the text and assign higher weights on important words like in this case "shoes" and "Hyderabad".
Thanks.

I don't think one approach is going to solve the entire problem here. Your question is broad and it will take multiple steps to get best results. Here is how I would approach the problem
Create N-grams analyser with Lucene and query. Lucene also allows
Phrase queries. Shoe shops in Hyderabad is a good fit for that.
Use cosine similarity to treat Shoe shops in Hyderabad and Footwear shops in Hyderabad similarly.
Also think of some linguistic angle. Simple POS tagging and role based rule engine can help you get much smarter results for queries like Shoe shops in Hyderabad or Shoe under 500 bucks where a very finite word set of in, under, on etc can be assigned rules on location/ comparison etc. This point assumes you are looking at English language. You will have to build this layer separately for each language though.
Hope this helps.

I think the question is good (I was looking something similar last week) and of course as other people mention the question is too broad. But I think you can face it using a information Retrieval system. I can recommend you Lemur project and spefically Indri, it includes a lot of customization features for queries, then is possible weighting using n-grams, tf-idf (as previuos two answer suggest) or just use you own criteria. If you want use Indri, check this is an tutorial/introduction, and something about weighting is at page 56.
Good look!

Feature extraction

Suppose I have been given data sets with headers :
id, query, product_title, product_description, brand, color, relevance.
Only id and relevance is in numeric format while all others consists of words and numbers. Relevance is the relevancy or ranking of a product with respect to a given query. For eg - query = "abc" and product_title = "product_x" --> relevance = "2.3"
In training sets, all these fields are filled but in test set, relevance is not given and I have to find out by using some machine learning algorithms. I am having problem in determining which features should I use in such a problem ? for example, I should use TF-IDF here. What other features can I obtain from such data sets ?
Moreover, if you can refer to me any book/ resources specifically for 'feature extraction' topic that will be great. I always feel troubled in this phase. Thanks in advance.

I think there is no book that will give the answers you need, as feature extraction is the phase that relates directly to the problem being solved and the existing data,the only tip you will find is to create features that describe the data you have. In the past i worked in a problem similar to yours and some features i used were:
Number of query words in product title.
Number of query words in product description.
n-igram counts
tf-idf
Cosine similarity
All this after some preprocessing like taking all text to upper(or lower) case, stemming, standard dictionary normalization.
Again, this depends on the problmen and the data and you will not find the direct answer, its like posting a question: "i need to develop a product selling system, how do i do it? Is there any book?" . You will find books on programming and software engineering, but you will not find a book on developing your specific system,you'll have to use general knowledge and creativity to craft your solution.

What is it called, to extract an address from HTML via NLP

I have 300k+ html documents, which I want to extract postal addresses from. The data is different structures, so regex wont work.
I have done a heap of reading on NLP and NLTK for python, however I am still struggling on where to start with this.
Is this approach called Part-of-Speech Tagging or Chunking / Partial Parsing? I can't find any document on how to actually TAG a page so I can train a model on it, or even what I should be training.
So my questions;
What is this approach called?
How can I tag some documents to train from

Qn: Which NLP task is closely related with this task?
Ans: The task of detecting postal address can be viewed as a Name-Entity Recognition (NER) task. But I suggest viewing the task as simply sequence labeling on html (i.e. your input data) and then perform some standard machine learning classification.
Qn: How can I tag some documents to use as training data?
An: What you can do is to:
Label each word or each line as B egin I nside or O utside
Choose a supervised classification method
Decide what are the features (here's some hint: Feature selection)
Build the model (basically just run the classification software with the configured
features)
Voila, output should give you B and I and
O , then just delete all the instances labelled O and you will be left with the lines/words that are addresses

Apple calls their software that does this "Data Detectors" (be careful, it's patented -- they won an injunction against HTC Android phones over this). More generally, I think this application is called Information Extraction.

Strip the text out of the HTML page (unless there is a way from the HTML to identify the address text such as div with a particular class) then build a set of rules that match the address formats used.
If there are postal addresses in several countries then the formats may be markedly different but within a country, the format is the same (with some tweaks) or it is not valid.
Within the US, for example, addresses are 3 or 4 lines (including the person). There is usually a zip code (5 digits optionally followed by four more). Other countries have postal codes in different formats.
Unless your goal is 100% accuracy on all addresses then you probably should aim for extracting as many addresses as you can within the budget for the task.
It doesn't seem like a task for NLP unless you want to use Named Entity identification to find cities, countries etc.

Your task is called information extraction, but that's a very, very broad concept. Luckily your task is more limited (street addresses), but you don't give a lot of information:
What countries are the addresses in? An address in Tokyo looks very different from one in Cleveland. Your odds of succeeding are much better if you're interested in addresses from a limited number of countries-- you can develop a solution from each of them. If we're talking about a very limited number, you could code a recognizer manually.
What kind of webpages are we talking about? Are they a random collection, or can you group them into a limited number of websites and formats? Where do the addresses appear? (I.e., are there any contextual clues you can use to zero in on them?)
I'll take a worse-case scenario for question 2: The pages are completely disparate and the address could be anywhere. Not sure what the state of the art is, but I'd approach it as a chunking problem.
To get any kind of decent results, you need a training set. At a minimum, a large collection of addresses from the same locations and in the same style (informal, incomplete, complete) as the addresses you'll be extracting. Then you can try to coax decent performance out of a chunker (probably with post-processing).
PS. I wouldn't just discard the html mark-up. It contains information about document structure which could be useful. I'd add structural mark-up (paragraphs, emphasis, headings, lists, displays) before you throw out the html tags.

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)

That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

Suggest category for a piece of text

I've been searching for a opensource solution to suggest a category given a question or text.
For example, "who is Lady Gaga?" would probably return 'Entertainment', 'Music', or 'Celebrity'.
"How many strike out there are for baseball?" would give me 'Baseball', or 'Sport'.
The categorization doesn't have to be perfect but should be some what close.
Also is there anywhere I can get a list of popular categories?

This is a document classification problem - your "document" is simply the query or text.
You'll first need to decide what the list of possible categories is. "Who is Lady Gaga?" could be Entertainment, Celebrity, Questions-In-English, Biography, People, etc. Next you'll apply a decision framework to assign a score for each category to the text. The highest score is its category - as long as it's above a noise threshold and there isn't a second-place category that's too close to differentiate. Decision frameworks can include approaches like a Bayesian network or a set of custom rules.
Some open source projects that implement classifiers include:
Classifier4J
Matlab Classification Toolbox
POPFile (for email)
OpenNLP Maximum Entropy Package

Screen scrape Wolfram alpha.
http://www.wolframalpha.com/input/?i=Who+is+lady+gaga
http://www.wolframalpha.com/input/?i=What+is+baseball
You can probably get a good list of categories from dmoz.

Not much of an answer, but perhaps this categorized dictionary would help:
http://www.provalisresearch.com/wordstat/WordNet.html
I imagine you could extract the uncommon words from the string, look them up in the categorized dictionary, and return the categories that get the most matches on your terms. It'll be tricky to deal with pop culture references like "Lady Gaga", though...maybe you could do a Google search and analyze the results of that.

Others have done quite a bit of work on your behalf, so I'd suggest just using something like the OpenCalais API. There's a python wrapper to the API at http://code.google.com/p/python-calais/.
"Who is Lady Gaga?" seems to be too short a piece of text for them to give a decent response. However, if you took the trouble to do a two step process and grab the first paragraph from wikipedia for Lady Gaga, and then supply that to the OpenCalais API you get very good results.
You can check it out quickly by just cut and pasting the first paragraph from wikipedia into the OpenCalais viewer. The result is a classification into the topic "Entertainment Culture" with a 100% confidence estimate.
Similarly, the baseball example returns "sports" as the topic with further social tags of "recreation", "baseball" etc.
Edit Here's another thought prompted by Calais' use of social tags: sending the wikipedia url for Lady Gaga to the delicious API with
curl -k https://user:password#api.del.icio.us/v1/posts/suggest?url=http://en
.wikipedia.org/wiki/Lady_gaga
returns
<?xml version="1.0" encoding="UTF-8"?>
<suggest>
<recommended>music</recommended>
<recommended>wikipedia</recommended>
<recommended>wiki</recommended>
<recommended>people</recommended>
<recommended>bio</recommended>
<recommended>cool</recommended>
<recommended>facts</recommended>
<popular>music</popular>
<popular>gaga</popular>
<popular>ladygaga</popular>
<popular>wikipedia</popular>
<popular>lady</popular>
etc. Should be easy enough to ignore the wikipedia/wiki type entries.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.