So I'm building a Python script to scrape some data (World Cup scores) from a url using Requests and BeautifulSoup4 and while I'm testing my code I'm making more requests than the website would like, resulting in this error periodically:
requests.exceptions.ConnectionError: Max retries exceeded with url
I don't actually need to keep calling the page, surely I only need to call it once and save the returned data locally and feed it into beautiful soup. Surely I'm not the first to do this, is there another way? This is probably trivial but I'm pretty new to this- thanks.
Here's what I'm working with:
import requests
from bs4 import BeautifulSoup
url = "https://www.telegraph.co.uk/world-cup/2018/06/26/world-cup-2018-fixtures-complete-schedule-match-results-far/"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
Store the HTML in a file once:
response = requests.get(url)
with open('cache.html', 'wb') as f:
f.write(response.content)
Then, next time, simply load it from the file:
with open('cache.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
You can try to wait 1 or 2 sec if the error appear:
import requests
from bs4 import BeautifulSoup
url = "https://www.telegraph.co.uk/world-cup/2018/06/26/world-cup-2018-fixtures-complete-schedule-match-results-far/"
try:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
except:
print("Connection refused by the server..")
print("Let me sleep for 2 seconds")
time.sleep(2)
print("Continue...")
continue
I couldn't test it, so maybe it will not work like this.
Related
I'm new to web scrape and I tried doing multiple sites but the results aren't the same. This is the basic lines of code I always use:
from bs4 import BeautifulSoup
import requests
url = 'https://edition.cnn.com/world'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.find_all(class_= 'cd__content')
print(data)
for i in data:
print(i.getText())
If I replace the url with https://www.cnn.com/ , the result I receive is empty. They both have the same class 'cd__content'. Can somebody explain me why?
I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.
So saving soup would be... tough, and out of my experience (read more about the pickleing process if interested). You can save the page as follows:
page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
f.write(page.content)
Then later, when you want to do analysis on it:
with open('path/to/saving.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'lxml')
Something like that, anyway.
The following code iterates over url_list and saves all the responses into the list all_pages, which is stored to the response.pickle file.
import pickle
import requests
from bs4 import BeautifulSoup
all_pages = []
for url in url_list:
all_pages.append(requests.get(url))
with open("responses.pickle", "wb") as f:
pickle.dump(all_pages, f)
Then later on, you can load this data, "soupify" each response and do whatever you need with it.
with open("responses.pickle", "rb") as f:
all_pages = pickle.load(f)
for page in all_pages:
soup = BeautifulSoup(page.text, 'lxml')
# do stuff
Working with our request:
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
you can use this also:
f=open("path/page.html","w")
f.write(page.prettify())
f.close
I am using requests-HTML and beautiful to scrape a website, below is the code. The weird thing is I can get the text sometimes from the web when using print(soup.get_text()) and I get some random codes when using print(soup) - in the image attached.
session = HTMLSession()
r = session.get(url)
soup = bs(r.content, "html.parser")
print(soup.get_text())
#print(soup)
The program return this when I tried to look at the soup
I think the site is javascript protected..well try this..it might help
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
print(r.text)
#if you want the whole content you can just do slicing stuff on the response stored in r or rather just do it with bs4
soup = BeautifulSoup(r.text, "html.parser")
print(soup.text)
Previously, I posted a question on how do I get the data from an AJAX website which is from this link: Scraping AJAX e-commerce site using python
I understand a bit on how to get the response which is using the chrome F12 in Network tab and do some coding with python to display the data. But I barely can't find the specific API url for it. The JSON data is not coming from a URL like the previous website, but it is in the Inspect Element in Chrome F12.
My real question actually is how do I get ONLY the JSON data using BeautifulSoup or anything related to it? After I can get only the JSON data from the application/id+json then I will convert it to be a JSON data that python can recognize so that I can display the products into table form.
One more problem is after several time I run the code, the JSON data is missing. I think the website will block my IP address. How to I solve this problem?
Here is the website link:
https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc
Here is my code
from bs4 import BeautifulSoup import requests
page_link =
'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content)
You can just use the find method with the pointer to your <script> tag with the attr type=application/json
Then you can use the json package to load the value inside a dict
Here is a code sample:
from bs4 import BeautifulSoup as soup
import requests
import json
page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")
json_tag = page_content.find('script',{'type':'application/json'})
json_text = json_tag.get_text()
json_dict = json.loads(json_text)
print(json_dict)
EDIT: My bad, I didn't see you search type=application/ld+json attr
As it seems to have several <script>with this attr, you can simply use the find_all method:
from bs4 import BeautifulSoup as soup
import requests
import json
page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")
json_tags = page_content.find_all('script',{'type':'application/ld+json'})
for jtag in json_tags:
json_text = jtag.get_text()
json_dict = json.loads(json_text)
print(json_dict)
You will have to parse data from HTML manually from your Soup as other websites will restrict their json API from other parties.
You can find out more details here in the documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Try:
import requests
response = requests.get(url)
data = response.json()
When I inspect the elements on my browser, I can obviously see the exact web content. But when I try to run the below script, I cannot see the some of the web page details. In the web page I see there are "#document" elements and that is missing while I run the script. How can I see the details of #document elements or extract with the script.?
from bs4 import BeautifulSoup
import requests
response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()
You need to make additional requests to get the frame page contents as well:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://123.123.123.123/'
with requests.Session() as session:
response = session.get(BASE_URL)
soup = BeautifulSoup(response.content, 'html.parser')
for frame in soup.select("frameset frame"):
frame_url = urljoin(BASE_URL, frame["src"])
response = session.get(frame_url)
frame_soup = BeautifulSoup(response.content, 'html.parser')
print(frame_soup.prettify())