LinkedIn Webscrape

LinkedIn Webscrape - python

I need to fetch basic profile data (complete page - html) of Linkedin profile. I tried python packages such as beautifulsoup but I get access denied.
I have generated the api tokens for linkedIn, but I am not sure how to incorporate those into the code.
Basically, I want to automate the process of scraping by just providing the company name.
Please help. Thanks!

Beautiful Soup is a web scraper. Typically, people use this library to parse data from public websites or websites that don't have APIs. For example, you could use it to scrape the top 10 Google Search results.
Unlike web scrapers, a API lets you retrieve data behind non-public websites. Furthermore, it returns the data in a easily readable XML or JSON format, so you don't have to "scrape" a HTML file for the specific data you care about.
To make a API call to LinkedIn, use need to use a python HTTP request library. See this stackoverflow post for examples.
Take a look at Step 4 of the LinkedIn API documentation. It shows a sample HTTP GET call.
GET /v1/people/~ HTTP/1.1
Host: api.linkedin.com
Connection: Keep-Alive
Authorization: Bearer AQXdSP_W41_UPs5ioT_t8HESyODB4FqbkJ8LrV_5mff4gPODzOYR
Note that you also need to send a "Authorization" header along with HTTP GET call. This is where your token would go. You're probably getting an access denied right now because you didn't set this header in your request.
Here's an example of how you would add that header to a request with the requests library.
And that should be it. When you make that request, it should return a XML or JSON that has the data you want. You can use an XML or JSON parser to get the specific fields you want.

Related

How do you make a URL request that requires access through an ezproxy server? (Python)

I'm doing a personal project where I am trying to scrape HTML tables from a financial data website using Python. I am able to successfully use the requests package in Python to access public websites and extract any information (using BeautfulSoup4 afterwards for processing), but the code I am using is shown below:
# import requests
import requests
# access website
url = 'https://financial-data-url.ezproxy1.library.uniname.edu.com/path/to/financial/data'
headers = example_header
page = requests.get(url, headers = headers)
However, trying to access the website normally requires login through my University's library database through an EZproxy server (shown in example url). When I attempt to request the URL of the financial data webpage after getting access through the library database, it returns what seems to be the University library EZproxy webpage. This is where I need to click "login" before being directed to the financial data webpage.
Is there some credential provision that I may be missing in the request function, or potentially a different way of passing the proxy server to the URL so that the request does not end up on the proxy server login page?

I found that the fastest and most effective work-around for this problem is to use the Selenium web-based automation package (https://selenium-python.readthedocs.io/)
Selenium makes it very easy to replicate a login as well as navigation within the browser just as a person would. IMO, the simplicity of it may far outweigh the benefits of calling the web-page directly depending on the use-case (not efficient when speed and efficiency is the primary goal, however if that is not a major constraint it works quite well)

How do I make my script, receiving only the webpage's URL, parse its POST request's response?

When I access a specific webpage, it sends a specific POST request, the response to which I want to parse. How do I make my script, receiving only the webpage's URL, parse that specific request's response?
(Ideally, in Python.)

So, I've found out that the 'seleniumwire' library for Python is one way to access requests made by a browser when loading a page.

How to scrape oauth secured api using scrapy

When scraping simple sites, a simple GET or POST request with a few parameters is enough to get the data from the site.
However, when scraping an oauth-secured site, in addition to the url-specific parameters it's necessary to send oauth parameters (oauth_consumer_key, oauth_signature, oauth_nonce, oauth_signature_method, oauth_timestamp, oauth_token)
How could scrapy be configured so that scrapy.Request automatically includes those parameters? (obviosly after supplying api keys, etc)
If that can't be done, is it possible to extract those parameters from a request created from another library (like oauth2) and put them into a scrapy request?

Scraping Biography.com using urllib2

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium

How to extract text from a web page that requires logging in using python and beautiful soup?

i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?

BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.