I'm learning web scraping and I've been trying to write a program that extracts information from Steam's website as an exercise.
I want to write a program that just visits the page of each top 10 best selling game and extracts something, but my program just gets redirected to the age check page when it tries to visit M rated games.
My program looks something like this:
front_page = urlopen('http://store.steampowered.com/').read()
bs = BeautifulSoup(front_page, 'html.parser')
top_sellers = bs.select('#tab_topsellers_content a.tab_item_overlay')
for item in top_sellers:
game_page = urlopen(item.get('href'))
bs = BeautifulSoup(game_page.read(), 'html.parser')
#Now I'm on the age check page :(
I don't know how to get past the age check, I've tried filling out the age check by sending a POST request to it like this:
post_params = urlencode({'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1988', 'snr': '1_agecheck_agecheck__age-gate'}).encode('utf-8')
page = urlopen(agecheckurl, post_params)
But it dosn't work, I'm still on the age check page. Anyone that can help me out here, how can I get past it?
Okay, seems like Steam use cookies to save the age check result. It's using birthtime.
Since I don't know how to set cookies use urllib, here is an example using requests:
import requests
cookies = {'birthtime': '568022401'}
r = requests.get('http://store.steampowered.com/', cookies=cookies)
Now there is no age check.
I like to use Selenium Webdriver for form input, since it's an easy solution for clicks and keystrokes. You can look at the docs or checkout the examples here, on "Filling out and Submitting Forms".
https://automatetheboringstuff.com/chapter11/
Related
I'm trying to write my first ever scraper and I'm facing a problem. all of the tutorials I've watched of course mention Tags in order to kind of catch the part you want to scrape and they mention something like this, or this is actually my code thus far, I'm trying to scrape the title, date, and country of each story:
import requests
import csv
from bs4 import BeautifulSoup
from itertools import zip_longest
result = requests.get("https://www.cdc.gov/globalhealth/healthprotection/stories-from-the-
field/stories-by-country.html?Sort=Date%3A%3Adesc")
source = result.content
soup = BeautifulSoup(source,"lxml")
--------------------------NOW COMES MY PROBLEM------------------------------------------
when I start looking to scrape the title it in a CDC Vietnam uses Technology Innovations to Improve COVID-19 Response like this!
When I try the code I learned :
title = soup.find_all("span__ngcontent-c0",{"class": ##I don't know what goes here!})
of course it doesn't work. I have searched and found this _ngcontent-c0 is actually angular but I don't know how to scrape it! Any help?
This web needs javascript to render all content you want to scrape.
It calls API to get all content. Just request this API.
You need to do something like this:
import requests
result = requests.get(
"https://www.cdc.gov/globalhealth/healthprotection/stories-from-the-field/dghp-stories-country.json")
for item in result.json()["items"]:
print("Title: " + item["Title"])
print("Date: " + item["Date"][0:10])
print("Country: " + ','.join(item["Country"]))
print()
OUTPUT:
Title: System Strengthening – A One Health Approach
Date: 2016-12-12
Country: Kenya,Multiple
Title: Early Warning Alert and Response Network Put the Brakes on Deadly Diseases
Date: 2016-12-12
Country: Somalia,Syria
I hope I have been able to help you.
I am trying to web scrape some data from the website - https://boardgamegeek.com/browse/boardgame/page/1
After I have obtained a name of the games and their score, I would also like to open each of these pages and find out how many players are needed for each game. But, when I go into each of the games the URL has a unique number.
For example: When I click on the first game- Gloomhaven it opens the page - https://boardgamegeek.com/boardgame/**174430**/gloomhaven (The unique number is marked in bold).
random_no = r.randint(1000,300000)
url2 = "https://boardgamegeek.com/boardgame/"+str(random_no)+"/"+name[0]
page2 = requests.get(url2)
if page2.status_code==200:
print("this is it!")
break
So I generated a random number and plugged it into the URL and read the response. However, even the wrong number gives a correct response but does not open the correct page.
What is this unique number ? How can I get information about this? Or can I use an alternative to get the information I need?
Thanks in advance.
Try this
import requests
import bs4
s = bs4.BeautifulSoup(requests.get(
url = 'https://boardgamegeek.com/browse/boardgame/page/1',
).content, 'html.parser').find('table', {'id': 'collectionitems'})
urls = ['https://boardgamegeek.com'+x['href'] for x in s.find_all('a', {'class':'primary'})]
print(urls)
So, Recently I've been trying to get some marks from a result website (http://tnresults.nic.in/rgnfs.htm) for my school results.... My friends challenged me to get his marks for which I only know his DOB and not his Register Number.. How do I make a Python program to solve this by trying to input register numbers from a predefined range(I know his DOB, btw)?
I tried using requests, but it doesn't allow me to enter the register and DOB..
It creates a POST request with the following format after pushing the Submit button:
https://dge3.tn.nic.in/plusone/plusoneapi/marks/{registration number}/{DOB}
Sample (with 112231 as registration number and 01-01-2000 as DOB.
https://dge3.tn.nic.in/plusone/plusoneapi/marks/112231/01-01-2000
You can then iterate over different registration numbers with a predefined array.
Note: it has to be a POST request, not a regular GET request.
You probably have to do something like the following:
import requests
from bs4 import BeautifulSoup
DOB = '01-01-2000'
REGISTRATION_NUMBERS = ['1','2']
for reg_number in REGISTRATION_NUMBERS:
result = requests.post(f"https://dge3.tn.nic.in/plusone/plusoneapi/marks/{reg_number}/{DOB}")
content = result.content
print(content)
## BeautifulSoup logic
I don't know if that request is providing you the information you need, I don't have valid registration numbers combined with the correct date of birth, so I cannot really test it...
Update 2019-07-09:
Since you said the page is not working anymore and the website changed, I took a look.
It seems that some things have changed you now have to make a post request to http://tnresults.nic.in/rgnfs.asp. The fields 'regno', 'dob' and 'B1' (optional?) should be send as x-www-form-urlencoded.
Since that will return an 'Access Denied' you should set the 'Referer'-header to 'http://tnresults.nic.in/rgnfs.htm'. so:
import requests
from bs4 import BeautifulSoup
DOB = '23-10-2002'
REGISTRATION_NUMBERS = ['5709360']
headers = requests.utils.default_headers()
headers.update({'Referer': 'http://tnresults.nic.in/rgnfs.htm'})
for reg_number in REGISTRATION_NUMBERS:
post_data = {'regno': reg_number, 'dob': DOB}
result = requests.post(f"http://tnresults.nic.in/rgnfs.asp", data=post_data, headers=headers)
content = result.content
print(content)
## BeautifulSoup logic
Tested it myself successfully now you've provided a valid DOB and registration number.
I'm getting stuck at a webscraping project, I would like to webscrape the following website and the dates for each of the reviews. However I get 'January 1970' for all of the dates.
https://fairygodboss.com/company-reviews/ebay-inc
Here is my code:
page_link = 'https://fairygodboss.com/company-reviews/ebay-inc' # for work/life balance for EBAY
page_response = requests.get(page_link, verify=False, headers={'User-Agent': randomUserAgents()})
soup = BeautifulSoup(page_response.content, 'html.parser')
soup.find_all(class_='textColor6 w-700 p-b-10')
Many thanks!
I believe your problem is that, when you make your request, you are not logged in. When a user is not logged in, all the dates appear as January 1970, until you are redirected to a login page. You will first have to log in.
This can be a tricky problem, but there is a library for python called twill that may work for you: http://twill.idyll.org
Alternatively, you could use something like the Mechanize library, which twill is based on.
This StackOverflow question should help you out:
How to scrape a website that requires login first with Python
I have been trying to learn a bit of python, and I tried to create a small program that asks the user for subreddit and then prints all the front page headlines and links to the articles, here is the code
import requests
from bs4 import BeautifulSoup
subreddit = input('Type de subreddit you want to see : ')
link_visit = f'https://www.reddit.com/r/{subreddit}/'
print(link_visit)
base_url = link_visit
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
for article in soup.find_all('div', class_='top-matter'):
headline = article.find('p', class_='title')
print('HeadLine : ' , headline.text )
a = headline.find('a', href=True)
link = a['href'].split('/domain')
print('Link : ' , link[0])
My problem is that sometimes it prints the desired result, other times it does nothing, only asks the user for the subrredit and prints the link to said subreddit.
Can someone explain why is this happening?
Your request is being rejected by reddit in order to conserve their resources.
When you detect the failing case, print out the HTML. I think you'll see something like this:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 3 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>