I'm trying to write my first ever scraper and I'm facing a problem. all of the tutorials I've watched of course mention Tags in order to kind of catch the part you want to scrape and they mention something like this, or this is actually my code thus far, I'm trying to scrape the title, date, and country of each story:
import requests
import csv
from bs4 import BeautifulSoup
from itertools import zip_longest
result = requests.get("https://www.cdc.gov/globalhealth/healthprotection/stories-from-the-
field/stories-by-country.html?Sort=Date%3A%3Adesc")
source = result.content
soup = BeautifulSoup(source,"lxml")
--------------------------NOW COMES MY PROBLEM------------------------------------------
when I start looking to scrape the title it in a CDC Vietnam uses Technology Innovations to Improve COVID-19 Response like this!
When I try the code I learned :
title = soup.find_all("span__ngcontent-c0",{"class": ##I don't know what goes here!})
of course it doesn't work. I have searched and found this _ngcontent-c0 is actually angular but I don't know how to scrape it! Any help?
This web needs javascript to render all content you want to scrape.
It calls API to get all content. Just request this API.
You need to do something like this:
import requests
result = requests.get(
"https://www.cdc.gov/globalhealth/healthprotection/stories-from-the-field/dghp-stories-country.json")
for item in result.json()["items"]:
print("Title: " + item["Title"])
print("Date: " + item["Date"][0:10])
print("Country: " + ','.join(item["Country"]))
print()
OUTPUT:
Title: System Strengthening – A One Health Approach
Date: 2016-12-12
Country: Kenya,Multiple
Title: Early Warning Alert and Response Network Put the Brakes on Deadly Diseases
Date: 2016-12-12
Country: Somalia,Syria
I hope I have been able to help you.
Related
im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has.
this is what i wrote:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)
and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span.
i really appreciate it if someone could figure it out and help me.
thanks in advance.
Problem
Running your code and checking the res value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).
Solution
Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"} to the get requests does work.
res = requests.get(url, headers={"User-Agent": "Defined"})
Will return a 200 (OK) response.
The Twist
Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)) will likely show you the following:
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
The counter
But you can use selenium to simulate a human. A minimal working example for me was the following:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
Which prints out
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
A small thing when using selenium is that it is much slower than the requests library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless driver.
So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)
I want to retrieve different categories from a news website. I am using BeautifulSoup to get title of articles from right side. How can I loop to various categories available on the left side of the website? I just started learning this kind of code so much behind understanding how it works. Any help would be appreciated.This is the website I am working on. http://query.nytimes.com/search/sitesearch/#/*/
Below is my code which returns the headlines of various articles from the right side:
import json
from bs4 import BeautifulSoup
import urllib
from urllib2 import urlopen
from urllib2 import HTTPError
from urllib2 import URLError
import requests
resp = urlopen("https://query.nytimes.com/svc/add/v1/sitesearch.json")
content = resp.read()
j = json.loads(content)
articles = j['response']['docs']
headlines = [ article['headline']['main'] for article in articles ]
for article in articles:
print article['headline']['main']
If I understood you correctly, you can get those articles by changing the api query like this:
import requests
data_range = ['24hours', '7days', '30days', '365days']
news_feed = {}
with requests.Session() as s:
for rng in data_range:
news_feed[rng] = s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&facet=true'.format(rng)).json()
And access the values like this:
print(news_feed) #or print(news_feed['30days'])
EDIT
To query aditional pages, you may try this:
import requests
data_range = ['7days']
news_feed = {}
news_list = []
page = 1
with requests.Session() as s:
for rng in data_range:
while page < 20: #this is limited to 120
news_list.append(s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&page={}&facet=true'.format(rng, page)).json())
page += 1
news_feed[rng] = news_list
for new in news_feed['7days']:
print(new)
First of all, instead of using urllib + json to parse the JSON response, you can use the requests module and its built-in .json() function.
Example:
import requests
r = requests.get("https://query.nytimes.com/svc/add/v1/sitesearch.json")
json_data = r.json()
# rest of the code is same
Now, to scrape the Date Range tabs, first, go to Developer Tools > Network > XHR. Then, click on any of the tabs. For example, if you click on the Past 24 Hours tab, you'll see an AJAX request made to this URL:
http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=24hoursago&facet=true
If you click on Past 7 Days, you'll see this URL:
http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=7daysago&facet=true
In general, you can format these URLs using this:
url = "http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}&facet=true"
past_24_hours = url.format('24hoursago')
r = requests.get(past_24_hours)
data = r.json()
This will get you all the NEWS items in the JSON object data.
For example, you can get the NEWS titles like this:
for item in data['response']['docs']:
print(item['headline']['main'])
Output:
Austrian Lawmakers Vote to Hinder Smoking Ban in Restaurants and Bars
Soccer-Argentine World Cup Winner Houseman Dies Aged 64
Response to UK Spy Attack Not Expected at EU Summit: French Source
Florida Man Reunites With Pet Cat Lost 14 Years Ago
Citigroup Puts Restrictions on Gun Sales
EU Exemptions From U.S. Steel Tariffs 'Possible but Not Certain': French Source
Trump Initiates Trade Action Against China
Trump’s Trade Threats Put China’s Leader on the Spot
Poland Plans Concessions in Judicial Reforms to Ease EU Concerns: Lawmaker
Florida Bridge Collapse Victim's Family Latest to Sue
I am trying to webscrape information (points scored, tackles made, time played, position, etc...) about top14 rugby players from a website.
For each player I get info from this page :
http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon
For each player, I can get info for the 2015-2016 season easily, but I also need info for the 2014-2015 season.
Problem is, when I open the corresponding link (http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon#season=14535) the source code is the same and the info my program scrape is the 2015-2016 data.
I can't seem to find a way to get the info for previous seasons even though it appears on the webpage.
Anyone knows how to solve this ?
Here is my code for the player I gave as an example.
import bs4
from lxml import html
import requests
import string
import _pickle as pickle
from bs4 import BeautifulSoup
dic={}
url_player='http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon'
page = requests.get(url_player)
html=page.content
parsed_html = BeautifulSoup(html)
body=parsed_html.body
saison14_15=body.find('a',attrs={'data-title':'Saison 2014-2015'})
link=saison14_15['href']
url_season='http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon'+link
page_season = requests.get(url_season)
html_season=page_season.content
parsed_html_season = BeautifulSoup(html_season)
body_season=parsed_html_season.body
dic['nom']=body_season.find('h1',attrs={'id':'page-title'}).text
dic[body_season.find('span',attrs= {'class':'title'}).text]=body_season.find('span',attrs={'class':'text'}).text
info1=body_season.find('ul',attrs={'class':'infos-list small-bold'})
try:
for item in info1.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
info2=body_season.find('ul',attrs={'class':'fluid-block-grid-3 team-stats'})
if info2 is not None :
for item in info2.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
info3=body_season.find('ul',attrs={'class':'number-list small-block-grid-2'})
if info3 is not None :
for item in info3.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
except:
dic=dic`
When you choose the 2014-2015 season, the page makes an AJAX request to
http://www.lnr.fr/ajax_player_stats_detail?player=33249&compet_type=1&=undefined&season=14535&_filter_current_tab_id=panel-filter-season&ajax-target-selector=%23player_stats_detail_block
If you then switch back to 2015-2016, it makes an AJAX request to
http://www.lnr.fr/ajax_player_stats_detail?player=33249&compet_type=1&=undefined&season=18505&_filter_current_tab_id=panel-filter-season&ajax-target-selector=%23player_stats_detail_block
Each request returns a chunk of HTML which gets inserted into the page.
If you can figure out the parameters needed for player and season, I suggest you request the data directly (without loading the parent page at all).
I'm learning web scraping and I've been trying to write a program that extracts information from Steam's website as an exercise.
I want to write a program that just visits the page of each top 10 best selling game and extracts something, but my program just gets redirected to the age check page when it tries to visit M rated games.
My program looks something like this:
front_page = urlopen('http://store.steampowered.com/').read()
bs = BeautifulSoup(front_page, 'html.parser')
top_sellers = bs.select('#tab_topsellers_content a.tab_item_overlay')
for item in top_sellers:
game_page = urlopen(item.get('href'))
bs = BeautifulSoup(game_page.read(), 'html.parser')
#Now I'm on the age check page :(
I don't know how to get past the age check, I've tried filling out the age check by sending a POST request to it like this:
post_params = urlencode({'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1988', 'snr': '1_agecheck_agecheck__age-gate'}).encode('utf-8')
page = urlopen(agecheckurl, post_params)
But it dosn't work, I'm still on the age check page. Anyone that can help me out here, how can I get past it?
Okay, seems like Steam use cookies to save the age check result. It's using birthtime.
Since I don't know how to set cookies use urllib, here is an example using requests:
import requests
cookies = {'birthtime': '568022401'}
r = requests.get('http://store.steampowered.com/', cookies=cookies)
Now there is no age check.
I like to use Selenium Webdriver for form input, since it's an easy solution for clicks and keystrokes. You can look at the docs or checkout the examples here, on "Filling out and Submitting Forms".
https://automatetheboringstuff.com/chapter11/