Scraping from yahoo finance with beautifulsoup resulting in status code : 404 - python

I am trying to scrape some data from yahoo finance using beautifulsoup, but I've run into a problem. I am trying to run the following code,
import xlwings as xw
import requests
import bs4 as bs
r = requests.get('https://finance.yahoo.com/quote/DKK=X?p=DKK=X&.tsrc=fin-srch')
soup = bs.BeautifulSoup(r.content,'lxml',from_encoding='utf-8')
However, when inspecting my output from "soup", I get the following status code in the section,
<body>
<!-- status code : 404 -->
<!-- Not Found on Server -->
I've run the exact same piece of code on another trading pair on yahoo finance with no problem whatsoever.
Could anyone tell me what I am doing wrong?
Thanks in advance!

You need to inject user agent to get 200 response.
#import xlwings as xw
import requests
import bs4 as bs
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
r = requests.get('https://finance.yahoo.com/quote/DKK=X?p=DKK=X&.tsrc=fin-srch',headers=headers)
print(r)
soup = bs.BeautifulSoup(r.content,'lxml')
Output:
<Response [200]>

Related

Web scrapping with Beautifulsoup returns no text eventhough it is in the html

I'm new to web scrapping and using Beautifulsoup. I need help as I don't understand why my code is returning no text when there is text in the inspect view on the website.
Here is my simple code:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nummerplade.net/nummerplade/Dd97487.html")
soup = BeautifulSoup(source.text,"html.parser")
name = soup.find("span",id="debitorer_name1")
print(name)
The output of running my code is:
<span id="debitorer_name1"></span>
When I inspect the HTML on the website I can see the desired name I want to extract, but not when running my script. Can anyone help me solve this issue?
Thanks!
If you reload site the data is reflecting in right side pane it takes same time so where it is uses dynamic data loading and it will not be visible in soup
How to find URL which renders dynamic data:
Go to Network tab and reload site and in left side just type the data that you want to search it will give you URL
Now go to Headers and copy user-agent, referer for headers and it will return data as in json format and you can extract what so data you want
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", "referer": "https://www.nummerplade.net/"}
res=requests.get("https://data3.nummerplade.net/bilbogen2.php?stelnr=salza2bt3nh162519",headers=headers)
Output:
'Sebastian Carl Schwabe'
Image:

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

I tried to do web scraping with python and the output is empty. did i do something wrong?

this is the code that i used, you can see that i copied the result page and tried to print it and the output is [ ]. I'm trying to learn web-scraping. so, i'm trying to web-scrape the name of the orphanage and copy inta csv file. but, i couldn't get the first phase. The "result_page" exists.
import os
os.system('cls')
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.justdial.com/Chennai/Orphanages/nct-10344906')
soup = BeautifulSoup(page.content,'html.parser')
MainContent = soup.find_all(class_="result_page")
print(MainContent)
If you are getting this error even though the page has result_page at the root, it indicates that the page has been blocked
[UPDATE1]
I tried this for about access or not :
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.justdial.com/Chennai/Orphanages/nct-10344906')
soup = bs(r.content,'html.parser')
print(soup)
and output:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.justdial.com/Chennai/Orphanages/nct-10344906" on this server.<p>
Reference #18.95a0de52.1603091762.1ae82063
</p></body>
</html>
[UPDATE2]
finally unblocked
code:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
r = requests.get('https://www.justdial.com/Chennai/Orphanages/nct-10344906',headers=headers).text
soup = bs(r,'html.parser')
soup = soup.find("div",{"class":"result_page"})
print(soup)
NOT: If you still get errors, make sure you are using the correct user-agent.
go to google site press to F12 and Network, refresh the page and press a thing

How to avoid 403 problem using BeautifulSoup and headers?

I am using the combination of request and beautifulsoup to develop a web-scraping program in python.
Unfortunately, I got 403 problem (even using header).
Here my code:
from bs4 import BeautifulSoup
from requests import get
headers_m = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sapo_m = "https://www.idealista.it/vendita-case/milano-milano/"
response_m = get(sapo_m, headers=headers_m)
This is not general python question. The site blocks such straightforward attempts of scraping, you need to find a set of headers (specific for this site) that will pass validation.
Regards,
Simply use Chrome as User-Agent.
from bs4 import BeautifulSoup
BeautifulSoup(requests.get("https://...", headers={"User-Agent": "Chrome"}).content, 'html.parser')

Web Scraping using Python sometimes fetch result sometimes results in HTTP 429

I am trying to scrape reddit pages for the videos. I am using python and beautiful soup to do the job.The following code sometimes return the result and sometimes not when I rerun the code.I'm not sure where i'm going wrong. Can someone help? I'm a newbie to python so please bear with me.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
if you do print (page) after your page = requests.get('https:/.........'), you'll see you get a successful <Response [200]>
But if you run it quickly again, you'll get the <Response [429]>
"The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting")." Source here
Additonally, if you look at the html source, you'd see:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>
To add headers and avoid the 429 add in:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
Full code:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
Output:
<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]
and have had no issues rerunning multiple times after waiting a second or 2
I have tried below code and it is working for me at every request, Added timeout of 30 sec.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'lxml')
source_tags = soup.find_all('source')
print(source_tags)
else:
print(page.status_code, page)

Categories