Save html to file to work with later using Beautiful Soup - python

I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.

So saving soup would be... tough, and out of my experience (read more about the pickleing process if interested). You can save the page as follows:
page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
f.write(page.content)
Then later, when you want to do analysis on it:
with open('path/to/saving.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'lxml')
Something like that, anyway.

The following code iterates over url_list and saves all the responses into the list all_pages, which is stored to the response.pickle file.
import pickle
import requests
from bs4 import BeautifulSoup
all_pages = []
for url in url_list:
all_pages.append(requests.get(url))
with open("responses.pickle", "wb") as f:
pickle.dump(all_pages, f)
Then later on, you can load this data, "soupify" each response and do whatever you need with it.
with open("responses.pickle", "rb") as f:
all_pages = pickle.load(f)
for page in all_pages:
soup = BeautifulSoup(page.text, 'lxml')
# do stuff

Working with our request:
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
you can use this also:
f=open("path/page.html","w")
f.write(page.prettify())
f.close

Related

Extracting link from soup python

I'm trying make an app gets the source links on bandcamp but im kinda stuck. Is there a way to get the source link with beautifulsoup.
The link im trying to get
Bandcamp
The data is within the <script> tags in json format. So use BeautifulSoup to get the 'script'. The data you are after is in the data-tralbum attribute.
Onece you get thet, have json read it in, then just iterate through the json structure:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://vine.bandcamp.com/album/another-light'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = str(soup.find_all('script')[4]['data-tralbum'])
jsonData = json.loads(script)
trackinfo = jsonData['trackinfo']
links = []
for each in trackinfo:
links.append(each['file']['mp3-128'])
Output:
print(links)
['https://t4.bcbits.com/stream/efbba461835eff472bd04a2f9e9910a9/mp3-128/1761020287?p=0&ts=1638288735&t=8ae6343808036ab513cd5436ea009e5d0de784e4&token=1638288735_9139d56ec86f2d44b83a11f3eed8caf7075d6039', 'https://t4.bcbits.com/stream/3e5ef92e6d83e853958ed01955c95f5f/mp3-128/1256475880?p=0&ts=1638288735&t=745a6c701cf1c5772489da5467f6cae5d3622818&token=1638288735_7e86a32c635ba92e0b8320ef56a457d988286cff', 'https://t4.bcbits.com/stream/bbb49d4a72cb80feaf759ec7890abbb6/mp3-128/3439518541?p=0&ts=1638288735&t=dcc7ef7d1d7823e227339fb3243385089478ebe7&token=1638288735_5db36a29c58ea038828d7b34b67e13bd80597dd8', 'https://t4.bcbits.com/stream/8c8a69959337f6f4809f6491c2822b45/mp3-128/1330130896?p=0&ts=1638288735&t=d108dac84dfaac901a546c5fcf5064240cca376b&token=1638288735_8d9151aa82e7a00042025f924660dd3a093c2f74', 'https://t4.bcbits.com/stream/4d4253633405f204d7b1c101379a73be/mp3-128/2478242466?p=0&ts=1638288735&t=a8cd539d0ce8ff417f9b69740070870ed9a182a5&token=1638288735_ad8b5e93c8ffef6623615ce82a6754678fa67b67', 'https://t4.bcbits.com/stream/6c4feee38e289aea76080e9ddc997fa5/mp3-128/2243532902?p=0&ts=1638288735&t=83417c3aba0cef0f969f93bac5165e582f24a588&token=1638288735_c1d9d43b4e10cc6d02c822de90eda3a52c382df2', 'https://t4.bcbits.com/stream/a24dc5dad7b619d47b006e26084ff38f/mp3-128/3054008347?p=0&ts=1638288735&t=4563c326a272c9f5b8462fef1d082e46fac7f605&token=1638288735_55978e7edbe0410ff745913224b8740becad59d5', 'https://t4.bcbits.com/stream/6221790d7f55d3b1f006bd5fac5458fe/mp3-128/1500140939?p=0&ts=1638288735&t=9ecc210c53af05f4034ee00cd1a96a043312a4a7&token=1638288735_0f2faba41da8952f841669513d04bdaaae35a629', 'https://t4.bcbits.com/stream/030506909569626a0d2d7d182b61c691/mp3-128/1707615013?p=0&ts=1638288735&t=c8dcbb2c491789928f5cb6ef8b755df999cb58b8&token=1638288735_b278ba825129ae1b5588b47d5cda345ef2db4e58', 'https://t4.bcbits.com/stream/d1ae0cbc281fc81ddd91f3a3e3d80973/mp3-128/2808772965?p=0&ts=1638288735&t=1080ff51fc40bb5b7afb3a2460f3209cbda549e3&token=1638288735_c93249c847acba5cf23521fa745e05b426a5ba05', 'https://t4.bcbits.com/stream/1b9d50f8210bdc3cf4d2e33986f319ae/mp3-128/2751220220?p=0&ts=1638288735&t=9f24f06dfc5c8a06f24f28664438a6f1a75a038c&token=1638288735_f3a98a20b3c344dc5a37a602a41572d5fe8539c1', 'https://t4.bcbits.com/stream/203cd15629ba03e3249f850d5e1ac42e/mp3-128/4188265472?p=0&ts=1638288735&t=4b4bc2f2194c63a1d3b957e3dd6046bd764c272a&token=1638288735_53a70e7d83ce8c2800baeaf92a5c19db4e146e3f', 'https://t4.bcbits.com/stream/c63b5c9ca090b233e675974c7e7ee4b2/mp3-128/258670123?p=0&ts=1638288735&t=a81ae9dc33dea2b2660d13dbbec93dbcb06e6b63&token=1638288735_446d0ae442cbbadbceb342fe4f7b69d0fbab2928', 'https://t4.bcbits.com/stream/2e824d3c643658c8e9e24b548bc8cb0b/mp3-128/2332945345?p=0&ts=1638288735&t=5bdf0264b9ffe4616d920c55f5081744bf0822d4&token=1638288735_872191bb67a3438ef0fd1ce7e8a9e5ca09e6c37e']

How do I filter tags with class in Python and BeautifulSoup?

I'm trying to scrape images from a site using beautifulsoup HTML parser.
There are 2 kinds of image tags for each image on the site. One is for the thumbnail and the other is the bigger size image that only appears after I click on the thumbnail and expand. The bigger size tag contains a class="expanded-image" attribute.
I'm trying to parse through the HTML and get the "src" attribute of the expanded image which contains the source for the image.
When I try to execute my code, nothing happens. It just says the process finished without scraping any image. But when I don't try to filter the code and just give tag as an argument, it downloads all the thumbnails.
Here's my code:
import webbrowser, requests, os
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')
list = []
for i in soup.find_all("img",{"class":"expanded-thumb"}):
list.append(i['src'].replace("//","https://"))
def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)
filename = os.path.join(pathname, url.split("/")[-1])
response = requests.get(url, stream=True)
with open(filename, "wb") as f:
f.write(response.content)
for a in list:
download(a,"file")
You might be running into a problem using "list" as a variable name. It's a type in python. Start with this (replacing TEST_4CHAN_URL with whatever thread you want), incorporating my suggestion from the comment above.
import requests
from bs4 import BeautifulSoup
TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")
src_list = []
for i in soup.find_all("a", {"class":"fileThumb"}):
src_list.append(i['href'].replace("//", "https://"))
print(src_list)

How to get video src using BeautifulSoup in Python

I am trying to find a downloadable video links in a website. For example, I am working with urls like these https://www.loc.gov/item/2015669100/. You can see that there is a m3u8 video link under mejs__mediaelement div tag.
However my code is not printing anything. Meaning that it's not finding the Video urls for the website.
My code is below
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
with open('pages2crawl.txt', 'r') as inFile:
lines = [line.rstrip() for line in inFile]
for page in lines:
req = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(), 'html.parser')
pages = soup.findAll('div', attrs={'class' : 'mejs__mediaelement'})
for e in pages:
video = e.find("video").get("src")
if video.endswith("m3u8"):
print(video)
If you just want to make a simple script it would probably be easier to use regex.
import re, requests
s = requests.Session() #start the session
data = s.get(url) #http get request to download data
data = data.text #get the raw text
vidlinks = re.findall("src='(.*?).m3u8'/>", data) #find all between the two parts in the data
print(vidlinks[0] + ".m3u8") #print the full link with extension
You can use CSS selector source[type="application/x-mpegURL"] to extract MPEG link (or source[type="video/mp4"] to extract mp4 link):
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
link_mpeg = soup.select_one('source[type="application/x-mpegURL"]')["src"]
link_mp4 = soup.select_one('source[type="video/mp4"]')["src"]
print(link_mpeg)
print(link_mp4)
Prints:
https://tile.loc.gov/streaming-services/iiif/service:afc:afc2010039:afc2010039_crhp0001:afc2010039_crhp0001_mv04/full/full/0/full/default.m3u8
https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001/afc2010039_crhp0001_mv04.mp4

requests.get(url) in python behaving differently when used in loop

I'm new in python programming and trying to scrape every link available in my Urls.txt file.
the code I wrote is :
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
user_agent = UserAgent()
fp = open("Urls.txt", "r")
values = fp.readlines()
fin = open("soup.html", "a")
for link in values:
print( link )
page = requests.get(link, headers={"user-agent": user_agent.chrome})
html = page.content
soup = BeautifulSoup(html, "html.parser")
fin.write(str(soup))
The code works absolutely fine when the links are provided directly as string instead of as variable but when used as it is the output differs.
Maybe the string you read from the file has a line break.
To remove it use link.strip("\n")

Save requests.get() response locally for use in Beautiful Soup

So I'm building a Python script to scrape some data (World Cup scores) from a url using Requests and BeautifulSoup4 and while I'm testing my code I'm making more requests than the website would like, resulting in this error periodically:
requests.exceptions.ConnectionError: Max retries exceeded with url
I don't actually need to keep calling the page, surely I only need to call it once and save the returned data locally and feed it into beautiful soup. Surely I'm not the first to do this, is there another way? This is probably trivial but I'm pretty new to this- thanks.
Here's what I'm working with:
import requests
from bs4 import BeautifulSoup
url = "https://www.telegraph.co.uk/world-cup/2018/06/26/world-cup-2018-fixtures-complete-schedule-match-results-far/"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
Store the HTML in a file once:
response = requests.get(url)
with open('cache.html', 'wb') as f:
f.write(response.content)
Then, next time, simply load it from the file:
with open('cache.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
You can try to wait 1 or 2 sec if the error appear:
import requests
from bs4 import BeautifulSoup
url = "https://www.telegraph.co.uk/world-cup/2018/06/26/world-cup-2018-fixtures-complete-schedule-match-results-far/"
try:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
except:
print("Connection refused by the server..")
print("Let me sleep for 2 seconds")
time.sleep(2)
print("Continue...")
continue
I couldn't test it, so maybe it will not work like this.

Categories