I am a newbie to python and web scraping.
I am trying to extract information about test components of clinical diagnostic tests from this link. https://labtestsonline.org/tests-index
Tests index has a list of names of test components for various clinical tests. Clicking on each of those names takes you to another page containing details about individual test component. From the this page i would like to extract part which has common questions.
and finally put together a data frame containing the names of the test components in one column and each question from the common questions as the rest of the columns (as shown below).
Names how_its_used when_it_is_ordered what_does_test_result_mean
SO far i have only managed to get the names of the test components.
import requests
from bs4 import BeautifulSoup
url = 'https://labtestsonline.org/tests-index'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
print(soup.prettify())
l = [] #get the names of the test components from the index
for i in soup.select("a[hreflang*=en]"):
l.append(i.text)
import pandas as pd
names = pd.DataFrame({'col':l}) # convert the above list to a dataframe
I suggest that you take a look at the open source web scraping library Scrapy. It will help you with many of the concerns that you might run in to when scraping websites such as:
Following the links on each page.
Scraping data from pages that match a particular pattern, e.g. you might only want to scrape the /detail page, while the other pages just scrape links to crawl.
lxml and css selectors.
Concurrency, allowing you to crawl multiple pages at the same time which will greatly speed up your scraper.
It's very easy to get going and there is a lot of resources out there of how to build simple to advanced web scrapers using the Scrapy library.
Related
New to web scrapers and I prefer to use Python. Does anyone have any ideas for the easiest way to scrape job descriptions and input them into an excel file? Which scraper would you use?
Depends, for a dynamic website Selenium is great. Selenium is a tool that automates web actions. Beautiful Soup is also another option. Beautiful Soup doesn't automate website actions, it will just scrape website data. In my opinion, Beautiful Soup is easier to learn. One basic introduction will be all you need. As for the excel file, there are several libraries you could use, that is more of a preference.
However, for your project I would go with beautiful soup.
As for the process of learning, YouTube is a great place to find tutorials, there are several for both. It's also really easy to find help with issues with either on here.
To give you a hint as to the general structure of your program, I would suggest something like this:
First Step: open an excel file, this file will remain open for the whole time
Second Step: webscraper locates the HTML tag for the job description
Third Step: use a for loop to cycle through each job description within this tag
Fourth Step: for each tag you retrieve the data and send it to an excel sheet
Fifth Step: once your done you close the excel sheet
Libraries I personally use: here
This is generally the boilerplate code most people probably use to start web scraping:
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
from os.path import dirname, join
current_dir = dirname(__file__)
print(current_dir)
code = 0
url_loop = "test.com"
r = (requests.get(url_loop))
error = "The page cannot be displayed because an internal server error has occurred."
soup = BeautifulSoup(r.text, 'html.parser')
Request is how you send HTTP Requests
BS4 is how you parse and extract specific info from the page such as all h1
tags
Pprint just formats the result nicely
As for using the collected data in excel: Here
Good luck!
When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.
So I am trying to learn scraping and was wondering how to get multiple webpages of info. I was using it on http://www.cfbstats.com/2014/player/index.html . I want to retrieve all the teams and then go within each teams link, which shows the roster, and then retrieve each players info and within their personal link their stats.
what I have so far is:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cfbstats.com/2014/player/index.html")
r.content
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
college = link.text
collegeurl = link.get("http")
c = requests.get(collegeurl)
c.content
campbells = BeautifulSoup(c.content)
Then I am lost from there. I know I have to do a nested for loop in there, but I don't want certain links such as terms and conditions and social networks.
Just trying to get player info and then their stats which is linked to their name.
You have to somehow filter the links and limit your for loop to the ones that correspond to teams. Then, you need to do the same to get the links to players. Using Chrome's "Developer tools" (or your browser's equivalent), I suggest that you (right-click) inspect one of the links that are of interest to you, then try to find something that distinguishes it from other links that are not of interest. For instance, you'll find out about the CFBstats page:
All team links are inside <div class="conference">. Furthermore, they all contain the substring "/team/" in the href. So, you can either xpath to a link contained in such a div, or filter the ones with such a substring, or both.
On team pages, player links are in <td class="player-name">.
These two should suffice. If not, you get the gist. Web crawling is an experimental science...
not familiar with BeautifulSoup, but certainly you can use regular expression to retrieve the data you want.
How would I scrape a domain to find all web pages and content?
For example: www.example.com, www.example.com/index.html, www.example.com/about/index.html and so on..
I would like to do this in Python and preferable with Beautiful Soup if possible..
You can't. Pages not only can pages be dynamically generated based on backend database data and search queries or other input that your program supplies to the website, but there is a nearly infinite list of possible pages, and the only way to know which ones exist is to test and see.
The closest you can get is to scrape a website based on hyperlinks between pages in the page content itself.
You could use the Python library newspaper
Install using sudo pip3 install newspaper3k
You can scrape all the articles on a particular website.
from newspaper import Article
url = "http://www.example.com"
built_page = newspaper.build( url )
print("%d articles in %s\n\n"%(built_page.size(), url))
for article in built_page.articles:
print(article.url)
From there you can use the Article object API to get all sorts of information from the page including the raw HTML.
I would like to use scraperwiki and python to build a scraper that will scrape large amounts of information off of different sites. I am wondering if it is possible to point to a single URL and then scrape the data off of each of the links within that site.
For example: A site would contain information about different projects, each within its own individual link. I don't need a list of those links but the actual data contained within them.
The scraper would be looking for the same attributes on each of the links.
Does anyone know how or if I could go about doing this?
Thanks!
Check out BeautifulSoup with urllib2.
http://www.crummy.com/software/BeautifulSoup/
An (very) rough example link scraper would look like this:
from bs4 import BeautifulSoup
import urllib2
c = urllib2.urlopen(url)
contents = c.read()
soup = BeautifulSoup(contents)
links = soup.find_all(a):
Then just write a for loop to do that many times over and you're set!