Web scraping using beautiful soup Python - python

I am trying to web scrape some data from the website - https://boardgamegeek.com/browse/boardgame/page/1
After I have obtained a name of the games and their score, I would also like to open each of these pages and find out how many players are needed for each game. But, when I go into each of the games the URL has a unique number.
For example: When I click on the first game- Gloomhaven it opens the page - https://boardgamegeek.com/boardgame/**174430**/gloomhaven (The unique number is marked in bold).
random_no = r.randint(1000,300000)
url2 = "https://boardgamegeek.com/boardgame/"+str(random_no)+"/"+name[0]
page2 = requests.get(url2)
if page2.status_code==200:
print("this is it!")
break
So I generated a random number and plugged it into the URL and read the response. However, even the wrong number gives a correct response but does not open the correct page.
What is this unique number ? How can I get information about this? Or can I use an alternative to get the information I need?
Thanks in advance.

Try this
import requests
import bs4
s = bs4.BeautifulSoup(requests.get(
url = 'https://boardgamegeek.com/browse/boardgame/page/1',
).content, 'html.parser').find('table', {'id': 'collectionitems'})
urls = ['https://boardgamegeek.com'+x['href'] for x in s.find_all('a', {'class':'primary'})]
print(urls)

Related

Web scraping with BeautifulSoup only scrapes the first page

I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}
Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame
If page_links is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.
It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.
As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Python Request Module - Displaying More Results

I'm currently working on a learner project for webscraping
I've picked my site:
https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers#Page0
On this page, there is a button on the bottom that displays the list of the next 10 products there without this button being clicked it does not display the next batch of products however the URL does not change when the button is clicked.
I wanted to ask how I will solve this dilemma using requests module.
My code is below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all=soup.find_all("div",{"class":"product"})
for item in all:
print(item.find({"h2": "productInfo"}).text.replace('\h2','').replace(" ", ""))
print(item.find("span",{"class": "condition"}).text + " " + item.find("span",{"class": "value"}).text )
try:
print(item.find_all("span",{"class": "condition"})[1].text + " " + item.find_all("span",{"class": "value"})[1].text )
except:
print("No Preowned")
print(" ")
Try this code to get all the items available in that page. You can make use of chrome dev tools to retrieve this url in which there is an option for page number increment.
from bs4 import BeautifulSoup
import requests
page_link = "https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers&pageNumber={}&pageMode=true"
page_no = 0
while True:
page_no+=1
res = requests.get(page_link.format(page_no))
soup = BeautifulSoup(res.text,'lxml')
container = soup.select(".productInfo h2")
if len(container)<=1:break
for content in container:
print(content.text)
Output of the last few titles:
ARK Survival Evolved
Kingdom Come Deliverance Special Edition
Halo 5 Guardians
Sonic Forces
The Elder Scrolls Online: Summerset - Digital
you need to use a webcrawler that supports javascript/jquery execution - i.e. selenium (it uses BoutifulSoup under the hood)
The problem you're facing is that the content you try to access gets created dynamically via javascript when the mentioned button is clicked.
When you request the page the additional html elements you want to read from are not created so BoutifulSoup cant find them.
Using selenium you can click buttons/fill out forms and much more. You can also wait for the server to create the content you want to access.
The documentation of selenium should be self explaining...

Python Beautiful Soup - Getting past Steam's age check

I'm learning web scraping and I've been trying to write a program that extracts information from Steam's website as an exercise.
I want to write a program that just visits the page of each top 10 best selling game and extracts something, but my program just gets redirected to the age check page when it tries to visit M rated games.
My program looks something like this:
front_page = urlopen('http://store.steampowered.com/').read()
bs = BeautifulSoup(front_page, 'html.parser')
top_sellers = bs.select('#tab_topsellers_content a.tab_item_overlay')
for item in top_sellers:
game_page = urlopen(item.get('href'))
bs = BeautifulSoup(game_page.read(), 'html.parser')
#Now I'm on the age check page :(
I don't know how to get past the age check, I've tried filling out the age check by sending a POST request to it like this:
post_params = urlencode({'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1988', 'snr': '1_agecheck_agecheck__age-gate'}).encode('utf-8')
page = urlopen(agecheckurl, post_params)
But it dosn't work, I'm still on the age check page. Anyone that can help me out here, how can I get past it?
Okay, seems like Steam use cookies to save the age check result. It's using birthtime.
Since I don't know how to set cookies use urllib, here is an example using requests:
import requests
cookies = {'birthtime': '568022401'}
r = requests.get('http://store.steampowered.com/', cookies=cookies)
Now there is no age check.
I like to use Selenium Webdriver for form input, since it's an easy solution for clicks and keystrokes. You can look at the docs or checkout the examples here, on "Filling out and Submitting Forms".
https://automatetheboringstuff.com/chapter11/

Scraping website only provides a partial or random data into CSV

I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.
Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.

Categories