Python BeautifulSoup inconsistent result

Python BeautifulSoup inconsistent result - python

I have been trying to learn a bit of python, and I tried to create a small program that asks the user for subreddit and then prints all the front page headlines and links to the articles, here is the code
import requests
from bs4 import BeautifulSoup
subreddit = input('Type de subreddit you want to see : ')
link_visit = f'https://www.reddit.com/r/{subreddit}/'
print(link_visit)
base_url = link_visit
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
for article in soup.find_all('div', class_='top-matter'):
headline = article.find('p', class_='title')
print('HeadLine : ' , headline.text )
a = headline.find('a', href=True)
link = a['href'].split('/domain')
print('Link : ' , link[0])
My problem is that sometimes it prints the desired result, other times it does nothing, only asks the user for the subrredit and prints the link to said subreddit.
Can someone explain why is this happening?

Your request is being rejected by reddit in order to conserve their resources.
When you detect the failing case, print out the HTML. I think you'll see something like this:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 3 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>

Related

Web scraping using beautiful soup Python

I am trying to web scrape some data from the website - https://boardgamegeek.com/browse/boardgame/page/1
After I have obtained a name of the games and their score, I would also like to open each of these pages and find out how many players are needed for each game. But, when I go into each of the games the URL has a unique number.
For example: When I click on the first game- Gloomhaven it opens the page - https://boardgamegeek.com/boardgame/**174430**/gloomhaven (The unique number is marked in bold).
random_no = r.randint(1000,300000)
url2 = "https://boardgamegeek.com/boardgame/"+str(random_no)+"/"+name[0]
page2 = requests.get(url2)
if page2.status_code==200:
print("this is it!")
break
So I generated a random number and plugged it into the URL and read the response. However, even the wrong number gives a correct response but does not open the correct page.
What is this unique number ? How can I get information about this? Or can I use an alternative to get the information I need?
Thanks in advance.

Try this
import requests
import bs4
s = bs4.BeautifulSoup(requests.get(
url = 'https://boardgamegeek.com/browse/boardgame/page/1',
).content, 'html.parser').find('table', {'id': 'collectionitems'})
urls = ['https://boardgamegeek.com'+x['href'] for x in s.find_all('a', {'class':'primary'})]
print(urls)

Wait Before Scraping using Beatifulsoup

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")

The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.

I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.

Scraping a table without id with python and beautifulsoup

I am currently working on creating a telegram bot .
I now want to add the command /drop Sorties, but I need bs4 to scrape a table from an this page.
The bot should answer something like
Rifle Riven Mod Rare (6.79%)
Ayatan Anasa Sculpture Uncommon (28.00%)
4000 Endo Uncommon (12.10%)
etc etc etc..
I should define something in the code to look ONLY for user input in that defined page, and reply with the next table he find in that page.
Example html from the link provided above
<h3 id="sortieRewards">Sorties:</h3>
<table><tbody><tr><th colspan="2">Sortie</th></tr><tr><td>Rifle Riven Mod</td><td>Rare (6.79%)</td></tr><tr><td>Ayatan Anasa Sculpture</td><td>Uncommon (28.00%)</td></tr><tr><td>4000 Endo</td><td>Uncommon (12.10%)</td>
The bot should reply with the content of the table even if the input from the user is Sortie and not Sorties:

soup = BeautifulSoup(page, 'lxml')
sorties_header = soup.find('h3',{'id':'sortieRewards'})
sorties_table = sorties_header.find_next('table')
# First row is header. We need to skip it
for sortie in sorties_table.find_all('tr')[1:]:
data = sortie.find_all('td')
item = data[0].text
drop_rate = data[1].text
print(item,drop_rate)
The output is
Rifle Riven Mod Rare (6.79%)
Ayatan Anasa Sculpture Uncommon (28.00%)
4000 Endo Uncommon (12.10%)

Is it possible to fetch the hidden info in webpage using python requests(?srn=true) library?

Here is the url
"https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994"
Login details :
usrname : life#tech69.com
pwd : shiva#123
While opening the page with above credentials, we can get the info like
Contact details
0770228XXXX
However if adding the ?srn = true at the end of url will give the following info
(https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true)
Contact details
07702287887
The code I've used is below:
import requests
from bs4 import BeautifulSoup
s = requests.session()
login_data = dict(email='life#tech69.com', password='shiva#123')
s.post('https://my.gumtree.com/login', data=login_data)
r = s.get('https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true')
soup = BeautifulSoup(r.content, 'lxml')
y = soup.find('strong' , 'txt-large txt-emphasis form-row-label').text
print str(y)
However the above python code still giving the partial info as
0770228XXXX
How to fetch the full info using python code.

that site is protected by recaptcha, a technology that is specifically designed to prevent autologins
so the line s.post('https://my.gumtree.com/login', data=login_data)
results in this
so when you try to go to the other url you are not actually logged in, and it will not reveal the number...
there may be ways to circumvent this, but im not sure of any offhand...

Python Beautiful Soup - Getting past Steam's age check

I'm learning web scraping and I've been trying to write a program that extracts information from Steam's website as an exercise.
I want to write a program that just visits the page of each top 10 best selling game and extracts something, but my program just gets redirected to the age check page when it tries to visit M rated games.
My program looks something like this:
front_page = urlopen('http://store.steampowered.com/').read()
bs = BeautifulSoup(front_page, 'html.parser')
top_sellers = bs.select('#tab_topsellers_content a.tab_item_overlay')
for item in top_sellers:
game_page = urlopen(item.get('href'))
bs = BeautifulSoup(game_page.read(), 'html.parser')
#Now I'm on the age check page :(
I don't know how to get past the age check, I've tried filling out the age check by sending a POST request to it like this:
post_params = urlencode({'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1988', 'snr': '1_agecheck_agecheck__age-gate'}).encode('utf-8')
page = urlopen(agecheckurl, post_params)
But it dosn't work, I'm still on the age check page. Anyone that can help me out here, how can I get past it?

Okay, seems like Steam use cookies to save the age check result. It's using birthtime.
Since I don't know how to set cookies use urllib, here is an example using requests:
import requests
cookies = {'birthtime': '568022401'}
r = requests.get('http://store.steampowered.com/', cookies=cookies)
Now there is no age check.

I like to use Selenium Webdriver for form input, since it's an easy solution for clicks and keystrokes. You can look at the docs or checkout the examples here, on "Filling out and Submitting Forms".
https://automatetheboringstuff.com/chapter11/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.