So I am trying to learn scraping and was wondering how to get multiple webpages of info. I was using it on http://www.cfbstats.com/2014/player/index.html . I want to retrieve all the teams and then go within each teams link, which shows the roster, and then retrieve each players info and within their personal link their stats.
what I have so far is:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cfbstats.com/2014/player/index.html")
r.content
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
college = link.text
collegeurl = link.get("http")
c = requests.get(collegeurl)
c.content
campbells = BeautifulSoup(c.content)
Then I am lost from there. I know I have to do a nested for loop in there, but I don't want certain links such as terms and conditions and social networks.
Just trying to get player info and then their stats which is linked to their name.
You have to somehow filter the links and limit your for loop to the ones that correspond to teams. Then, you need to do the same to get the links to players. Using Chrome's "Developer tools" (or your browser's equivalent), I suggest that you (right-click) inspect one of the links that are of interest to you, then try to find something that distinguishes it from other links that are not of interest. For instance, you'll find out about the CFBstats page:
All team links are inside <div class="conference">. Furthermore, they all contain the substring "/team/" in the href. So, you can either xpath to a link contained in such a div, or filter the ones with such a substring, or both.
On team pages, player links are in <td class="player-name">.
These two should suffice. If not, you get the gist. Web crawling is an experimental science...
not familiar with BeautifulSoup, but certainly you can use regular expression to retrieve the data you want.
Related
I am trying to crawl a covid-19 statistics website which has a bunch of links to pages regarding the statistics for different countries. The links all have a class name that makes them easy to access using css selectors ('mt_a'). There is no continuity between the countries so if you are on the webpage for one of them, there is no link to go to the next country. I am a complete beginner to scrapy and I'm not sure what I should do if my goal is to scrape all the (200 ish) links listed on the root page for the same few pieces of information. Any guidance on what I should be trying to do would be appreciated.
The link I'm trying to scrape: https://www.worldometers.info/coronavirus/
(scroll down to see country links)
What I would do is create two spiders. One would parse the home page and extract all specific links to country pages href within anchor tags, i.e. href="country/us/" and then create full urls from these relative links so that you get a proper url like https://www.worldometers.info/coronavirus/country/us/.
Then the second spider is given the list of all country urls and then goes on to crawl all individual pages and extract information from those.
For example, you get a list of urls from the first spider:
urls = ['https://www.worldometers.info/coronavirus/country/us/',
'https://www.worldometers.info/coronavirus/country/russia/']
Then in the second spider you give that list to the start_urls attribute.
I think others have already answered the question, but here is the page for Link extractors.
A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))
I've scraped a few web articles using Beautifulsoup. After scraping these Id like to find out what country the article is talking about. My current method is this
- Extract the raw text from that article
- Have a list of all 195 countries in a list
- Using the findall() function in Beautifulsoup check how many occurances there are.
def find_country(url_string):
html = urlopen(url_string)
bsObj = BeautifulSoup(html)
countryList = bsObj.find_all("p", string="UK")
print(len(countryList))
I tried this for a site such as this : https://www.bbc.co.uk/news/uk-politics-52701843 and didnt the get the correct result.
However, I read online that specify which parent/child the information should be obtained from. i.e I want to obtain the UKs in the region of the news website. However, I was wondering how I would implement this. So that find_all('p', string=UK) would find the correct amount of the key word UK in the news article.
Thanks for any help, highly appreciated.
I am a newbie to python and web scraping.
I am trying to extract information about test components of clinical diagnostic tests from this link. https://labtestsonline.org/tests-index
Tests index has a list of names of test components for various clinical tests. Clicking on each of those names takes you to another page containing details about individual test component. From the this page i would like to extract part which has common questions.
and finally put together a data frame containing the names of the test components in one column and each question from the common questions as the rest of the columns (as shown below).
Names how_its_used when_it_is_ordered what_does_test_result_mean
SO far i have only managed to get the names of the test components.
import requests
from bs4 import BeautifulSoup
url = 'https://labtestsonline.org/tests-index'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
print(soup.prettify())
l = [] #get the names of the test components from the index
for i in soup.select("a[hreflang*=en]"):
l.append(i.text)
import pandas as pd
names = pd.DataFrame({'col':l}) # convert the above list to a dataframe
I suggest that you take a look at the open source web scraping library Scrapy. It will help you with many of the concerns that you might run in to when scraping websites such as:
Following the links on each page.
Scraping data from pages that match a particular pattern, e.g. you might only want to scrape the /detail page, while the other pages just scrape links to crawl.
lxml and css selectors.
Concurrency, allowing you to crawl multiple pages at the same time which will greatly speed up your scraper.
It's very easy to get going and there is a lot of resources out there of how to build simple to advanced web scrapers using the Scrapy library.
I have been trying to use BS4 to scrape from this web page. I cannot find the data I want (player names in the table, ie, "Claiborne, Morris").
When I use:
soup = BeautifulSoup(r.content, "html.parser")
PlayerName = soup.find_all("table")
print (PlayerName)
None of the player's names are even in the output, it is only showing a different table.
When I use:
soup = BeautifulSoup(r.content, 'html.parser')
texts = soup.findAll(text=True)
print(texts)
I can see them.
Any advice on how to dig in and get player names?
The table you're looking for is dynamically filled by JavaScript when the page is rendered. When you retrieve the page using e.g. requests, it only retrieves the original, unmodified page. This means that some elements that you see in your browser will be missing.
The fact that you can find the player names in your second snippet of code, is because they are contained in the page's JavaScript source, as JSON. However you won't be able to retrieve them with BeautifulSoup as it won't parse the JavaScript.
The best option is to use something like Selenium, which mimics a browser as closely as possible and will execute JavaScript code, thereby rendering the same page content as you would see in your own browser.