I was writing a Python script to grab lyrics of a song from azlyrics using the request module. This is the script I wrote:
import requests, re
from bs4 import BeautifulSoup as bs
url = "http://search.azlyrics.com/search.php"
payload = {'q' : 'shape of you'}
r = requests.get(url, params = payload)
soup = bs(r.text,"html.parser")
try:
link = soup.find('a', {'href':re.compile('http://www.azlyrics.com/lyrics/edsheeran/shapeofyou.html')})['href']
link = link.replace('http', 'https')
print(link)
raw_data = requests.get(link)
except Exception as e:
print(e)
but I got an exception stating :
Max retries exceeded with url: /lyrics/edsheeran/shapeofyou.html (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fbda00b37f0>: Failed to establish a new connection: [Errno 111] Connection refused',))
I read on the internet that I am probably trying to send too many requests. So I made the script sleep for some time :
import requests, re
from bs4 import BeautifulSoup as bs
from time import sleep
url = "http://search.azlyrics.com/search.php"
payload = {'q' : 'shape of you'}
r = requests.get(url, params = payload)
soup = bs(r.text,"html.parser")
try:
link = soup.find('a', {'href':re.compile('http://www.azlyrics.com/lyrics/edsheeran/shapeofyou.html')})['href']
link = link.replace('http', 'https')
sleep(60)
print(link)
raw_data = requests.get(link)
except Exception as e:
print(e)
but no luck!
So I tried the same with urllib.request
import requests, re
from bs4 import BeautifulSoup as bs
from time import sleep
from urllib.request import urlopen
url = "http://search.azlyrics.com/search.php"
payload = {'q' : 'shape of you'}
r = requests.get(url, params = payload)
soup = bs(r.text,"html.parser")
try:
link = soup.find('a', {'href':re.compile('http://www.azlyrics.com/lyrics/edsheeran/shapeofyou.html')})['href']
link = link.replace('http', 'https')
sleep(60)
print(link)
raw_data = urlopen(link).read()
except Exception as e:
print(e)
but then got different exception stating :
<urlopen error [Errno 111] Connection refused>
Can anyone one tell me whats wrong with it and how do I fix it?
Try it in your web browser; when you try to visit http://www.azlyrics.com/lyrics/edsheeran/shapeofyou.html it'll work fine, but when you try to visit https://www.azlyrics.com/lyrics/edsheeran/shapeofyou.html it won't work.
So remove your link = link.replace('http', 'https') line and try again.
Related
# requirements
import pandas as pd
from urllib.request import Request, urlopen
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
ua = UserAgent()
ua.ie
req = Request(df["URL"][0], headers={"User-Agent" : ua.ie})
html = urlopen(req).read()
soup_tmp = BeautifulSoup(html, "html.parser")
soup_tmp.find("p", "addy") #soup_find.select_one(".addy")
URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
I'm a student who studying python on vscode.
I don't know what I'm missing TT.
df["URL"][0] <- worked ..
anybody help me ..?
+
i solve it !!!!!
import requests
req = requests. get(df["URL"]49, headers={'user-agent' :ua.ie})
soup_tmp = BeautifulSoup(req.content, 'html.parser')
soup_tmp.select_one('.addy')
it works !!!!!!
Obviously, the problem is df["URL"][0] in the line:
req = Request(df["URL"][0], headers={"User-Agent" : ua.ie})
At the same time, you didn't provide the url you used. I used Google to test that it worked well:
url='https://www.google.com'
req = Request(url, headers={"User-Agent" : ua.ie})
You need to check whether the url you use is correct, which is not a problem with the codes.
This is the code I wrote in python for opening a url.
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import time
import requests
from random import randint
import urllib.parse
class AmazonReviews():
def __init__(self):
self.headers = {'User-Agent' : 'Mozilla/5.0'}
def open_url(self,url):
values = {}
data = urllib.parse.urlencode(values).encode('utf-8')
req = urllib.request.Request(url, data, self.headers)
response = urllib.request.urlopen(req)
html = response.read()
return html
def fetch_reviews(self,all_reviews_link):
try:
url = "https://www.amazon.in" + all_reviews_link
print(url)
html = self.open_url(url)
except HTTPError as e:
print(e)
review = AmazonReviews()
review.fetch_reviews('/gp/profile/amzn1.account.AFBWOEM2CWLC7ZRQ7WK6FQYXH6AA/ref=cm_cr_arp_d_gw_btm?ie=UTF8')
I am passing url as such because in the main project this url is scraped using href attribute that gives the relative path.
If there is any method to get absolute url please suggest.
Output -
https://www.amazon.in/gp/profile/amzn1.account.AFBWOEM2CWLC7ZRQ7WK6FQYXH6AA/ref=cm_cr_arp_d_gw_btm?ie=UTF8
HTTP Error 404: NotFound
Link of the code
https://onlinegdb.com/SyFPXzWVI
Use Selenium instead:
from selenium import webdriver
import os
browser = webdriver.Chrome(executable_path=os.path.abspath(os.getcwd()) + "/chromedriver")
link = "https://www.amazon.in/gp/profile/amzn1.account.AFBWOEM2CWLC7ZRQ7WK6FQYXH6AA/ref=cm_cr_arp_d_gw_btm?ie=UTF8"
browser.get(link)
name = browser.find_element_by_xpath('//*[#id="customer-profile-name-header"]/div[2]/span').text
Output:
Dheeraj Malhotra
I get a "Connection Error" error while capturing data. It works fine for a while then gives error, how can I overcome this error.
import requests
from bs4 import BeautifulSoup
url = "https://www.example.com"
for page in range(0,951,50):
new_url = url +page + "&pagingSize=50"
r = requests.get(new_url)
source = BeautifulSoup(r.content,"html.parser")
content = source.select('tr.searchResultsItem:not(.nativeAd, .classicNativeAd)')
print(content)
When I get this error, I want it to wait for a while and continue where it left off
Error:
ConnectionError: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')"))
You can workaround connection resets (and other networking problems) by implementing retries. Basically, you can tell requests to automatically retry if a problem occurs.
Here's how you can do it:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
# in case of error, retry at most 3 times, waiting
# at least half a second between each retry
retry = Retry(total=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
Then, instead of:
r = requests.get(new_url)
you can use:
r = session.get(new_url)
See also the documentation for Retry for a full overview of the scenarios it supports.
I was not able to fetch url from biblegateway.com here it shows error as
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure> Please don't make duplicate as i went throught the sites which shown in duplicate i didn't understand by visiting that site .
here is my code
import urllib2
url = 'https://www.biblegateway.com/passage/?search=Deuteronomy+1&version=NIV'
response = urllib2.urlopen(url)
html = response.read()
print html
Here is a good reference for fetching url.
In python 3 you can have:
from urllib.request import urlopen
URL = 'https://www.biblegateway.com/passage/?search=Deuteronomy+1&version=NIV'
f = urlopen(URL)
myfile = f.read()
print(myfile)
Not sure it clears a ssl problem though. Maybe some clues here.
I am trying to get html source by using urllib.request.open.
I don't know why my code did not stop.
from urllib.request import urlopen
url = "https://google.com"
try:
page = urlopen(url)
f = open("google.html","wb")
f.write(page.read())
except:
print("Error")
But with timeout, it received the source.
from urllib.request import urlopen
url = "https://google.com"
try:
page = urlopen(url, timeout=0.1)
f = open("google.html","wb")
f.write(page.read())
except:
print("Error")