BeautifulSoup: Get the HTML Code of Modal Footer - python

I'm new to Web scraping in Python and try to scrape all htm document-links from an SEC Edgar full-text search. I can see the link in the Modal Footer, but BeautifulSoup won't parse the href Element with the link.
Is there an easy solution to parse the links of the documents?
url = 'https://www.sec.gov/edgar/search/#/q=ex10&category=custom&forms=10-K%252C10-Q%252C8-K'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for a in soup.find_all(id = "open-file"):
print(a)

That data is loaded dynamically using javascript. There is a lot of information about scraping this kind of page (see one of many examples here); in this case, the following should get you there:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
}
data = '{"q":"ex10","category":"custom","forms":["10-K","10-Q","8-K"],"startdt":"2020-10-08","enddt":"2021-10-08"}'
#obvioulsy, you need to change "startdt" and "enddt" as necessary
response = requests.post('https://efts.sec.gov/LATEST/search-index', headers=headers, data=data)
The response is in json format. Your urls are hidden in there:
data = json.loads(response.text)
hits = data['hits']['hits']
for hit in hits:
cik = hit['_source']['ciks'][0]
file_data = hit['_id'].split(":")
filing = file_data[0].replace('-','')
file_name = file_data[1]
url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{filing}/{file_name}'
print(url)
Output:
https://www.sec.gov/Archives/edgar/data/0001372183/000158069520000415/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001372183/000138713120009670/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001540615/000154061520000006/ex10.htm
https://www.sec.gov/Archives/edgar/data/0001552189/000165495421004948/ex10-1.htm
etc.

Related

Extract 10K filings url for a company using CIK number python

I am working on a project to find the latest 10K filings url for a company using CIK number. Please find the code below:
import requests
from bs4 import BeautifulSoup
# CIK number for Apple is 0001166559
cik_number = "0001166559"
url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik_number}&type=10-K&dateb=&owner=exclude&count=40"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the link to the latest 10-K filing
link = soup.find('a', {'id': 'documentsbutton'})
filing_url = link['href']
print(filing_url)
I am getting HTTP 403 error. Please help me
Thanks
I was able to get a 200 response by reusing your same snippet. You may have missed to add the headers:
import requests
from bs4 import BeautifulSoup
# CIK number for Apple is 0001166559
cik_number = "0001166559"
url = f'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik_number}&type=10-K&dateb=&owner=exclude&count=40'
# add this
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
response = requests.get(url, headers=headers)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Output:
NOTE:
You can read more on why do we need to add User-Agent in our headers from here. Basically what you need to do is to make sure that the request looks like that it's coming from a browser, so just add an the extra header parameter:

Can't scrape listing links from a webpage using the requests module

I'm trying to scrape different listings for this search Oxford, Oxfordshire from this webpage using requests module. This is how the inputbox looks before I click the search button.
I've defined an accurate selector to locate the listings, but the script fails to grab any data.
import requests
from pprint import pprint
from bs4 import BeautifulSoup
link = 'https://www.zoopla.co.uk/search/'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,bn;q=0.8',
'Referer': 'https://www.zoopla.co.uk/for-sale/',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
params = {
'view_type': 'list',
'section': 'for-sale',
'q': 'Oxford, Oxfordshire',
'geo_autocomplete_identifier': 'oxford',
'search_source': 'home'
}
res = requests.get(link,params=params,headers=headers)
soup = BeautifulSoup(res.text,"html5lib")
for item in soup.select("[id^='listing'] a[href^='/for-sale/details/']:has(h2[data-testid='listing-title'])"):
print(item.get("href"))
EDIT:
If I try something like the following, the script seems to be working flawlessly. The only and main problem is that I had to use hardcoded cookies within the headers, which expire within a few minutes.
import json
from pprint import pprint
from bs4 import BeautifulSoup
import cloudscraper
base = 'https://www.zoopla.co.uk{}'
link = 'https://www.zoopla.co.uk/for-sale/'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'cookie': 'ajs_anonymous_id=caa7072ed7f64911a51dda2b525a3ca3; zooplapsid=cafe5156dd8f4cdda14e748c9270f623; base_device_id=68f7b6b7-27b8-429e-af66-366a4b64bac4; g_state={"i_p":1675619616576,"i_l":2}; zid=31173482e60549da9ccc1632e52a264c; zooplasid=31173482e60549da9ccc1632e52a264c; base_session_start_page=https://www.zoopla.co.uk/; base_request=https://www.zoopla.co.uk/; base_session_id=2315eaf2-6d59-4075-aeaa-6288af3efef7; base_session_count=8; forced_features={}; forced_experiments={}; active_session=anon; _gid=GA1.3.821027055.1675853830; __cf_bm=6.bEGFdT2vYz3G3iO7swuTFwSfhyzA0DvGoCjB6KvVg-1675853990-0-AQqWHydhL+/hqq8KRqOpCKDNtd6E96qjLgyOF77S8f7DpqCbMFoxAycD8ahQd7FOShSq0oHD//gpDj095eQPdtccDyZ0qu6GvxiSpjNP0+D7sblJP1e3Mlmxw5YroG3O4OuJHgBco3zThrx2SRyVDfx7M1zNlwi/1OVfww/u2wfb5DCW+gGz1b18zEvpNRszYQ==; cookie_consents={"schemaVersion":4,"content":{"brand":1,"consents":[{"apiVersion":1,"stored":false,"date":"Wed, 08 Feb 2023 10:59:02 GMT","categories":[{"id":1,"consentGiven":true},{"id":3,"consentGiven":false},{"id":4,"consentGiven":false}]}]}}; _ga=GA1.3.1980576228.1675275335; _ga_HMGEC3FKSZ=GS1.1.1675853830.7.1.1675853977.0.0.0'
}
params = {
'q': 'Oxford, Oxfordshire',
'search_source': 'home',
'pn': 1
}
scraper = cloudscraper.create_scraper()
res = scraper.get(url,params=params,headers=headers)
print(res.status_code)
soup = BeautifulSoup(res.text,"lxml")
container = soup.select_one("script[id='__NEXT_DATA__']").contents[0]
items = json.loads(container)['props']['pageProps']['initialProps']['regularListingsFormatted']
for item in items:
print(item['address'],base.format(item['listingUris']['detail']))
How can I get content from that site without using hardcoded cookies within the headers?
The following code example is working smoothly without adding headers and params parameters. The website's data isn't dynamic meaning you can grab the required data from the the static HTML dom but the main hindrance is that they are using Cloudflare protection. So to get rid of such restiction you can use either cloudscraper instead of requests module or selenium. Here I use cloudscraper and it's working fine.
Script:
import pandas as pd
from bs4 import BeautifulSoup
import cloudscraper
scraper = cloudscraper.create_scraper()
kw= ['Oxford', 'Oxfordshire']
data = []
for k in kw:
for page in range(1,3):
url = f"https://www.zoopla.co.uk/for-sale/property/oxford/?search_source=home&q={k}&pn={page}"
page = scraper.get(url)
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
for card in soup.select('[data-testid="regular-listings"] [id^="listing"]'):
link = "https://www.zoopla.co.uk" + card.a.get("href")
print(link)
#data.append({'link':link})
# df = pd.DataFrame(data)
# print(df)
Output:
https://www.zoopla.co.uk/for-sale/details/63903233/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898182/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898168/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63898177/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897930/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63897571/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896910/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896858/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63896815/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63893187/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/47501048/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63891727/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63890876/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63889459/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63888298/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887586/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63887525/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/59469692/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63882084/?search_identifier=4bf3e1bf22483c835fd89a5f17e16e2d
https://www.zoopla.co.uk/for-sale/details/63878480/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
https://www.zoopla.co.uk/for-sale/details/63877980/?search_identifier=cbe92a4f0868061e26dff87f97442c6a
... so on
You could just set the browser type and read the contents with a simple request:
# Url for 'Oxford, Oxfordshire'
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
print(webpage)
This also works just fine. The only thing is that you will have to write a couple lines of code to extract what exactly you want from each listing yourself. Or make the class field dynamic if necessary.
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.zoopla.co.uk/for-sale/property/oxford/?q=Oxford%2C%20Oxfordshire&search_source=home'
result = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urllib.request.urlopen(result).read()
soup = BeautifulSoup(webpage, "html.parser")
webpage_listings = soup.find_all("div", class_="f0xnzq0")
if webpage_listings:
for item in webpage_listings:
print(item)
else:
print("Empty list")

HTMLs not found by BeautifulSoup

I'm trying to write a program, that downloads the most upvoted picture from a subreddit, but for some reason the BeautifulSoup does not find all the links on a website, I know I could try it with other methods but I'm curious why isn't it finding all the link every time.
Here is the code as well.
from PIL import Image
import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/r/wallpaper/top/'
result = requests.get(url)
soup = BeautifulSoup(result.text,'html.parser')
for link in soup.find_all('a'):
print (link.get('href'))
Site is loaded with JavaScript, bs4 will not be able to render JavaScript therefor, I've been able to locate the data within script tag.
import requests
import re
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
r = requests.get(url, headers=headers)
match = re.search(r"window.___r = ({.+})", r.text).group(1)
data = json.loads(match)
# print(data.keys())
# humanreadable = json.dumps(data, indent=4)
main("https://www.reddit.com/r/wallpaper/top/")
Shorter version:
match = re.finditer(r'permalink":"(.+?)"', r.text)
for item in match:
print(item.group(1))
Output:
https://www.reddit.com/r/wallpaper/comments/fv9ubr/khyber_pakhtunkhwa_pakistan_balakot_1920x1024/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/?instanceId=t3_p%3DgAAAAABeiiTtw4FM0zBerf9DDiq5tmonjJbAwzQb_UwA-VHlw2J8zUxw-y6Doa6j-jPP0qt05lRZfyReQwnLH9pN6wdSBBvqhgxgRS3uKyKCRvkk6WNwns5wpad0ijMgHwqVnZSGMT0KWP4WB15zBNkb3j96ifm23pT4uACb6cpNVh-TE05GiTtDnD9UUMir02Z7hOr0x4f_wLJEIplafXRp2yiAFPh5VzH_4VSsPx9zV7v3IJwN5ctYLfIcdCW5Z3W-z3bbOVUCU2HqqRAoh0XEj0LrgdicMexa9fzPbtWOshfx3kIazwFhYXoSowPBRZUquSs9zEaQwP1B-wg951edNb7RSjYTrDpQ75zsMfIkasKvAOH-V58%3D
https://www.reddit.com/r/wallpaper/comments/fv6wew/lone_road_in_nowhere_arizona_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvaqaa/the_hobbit_house_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fvcs4j/something_i_made_in_illustrator_5120_2880/
https://www.reddit.com/r/wallpaper/comments/fv09u2/bath_time_in_rocky_mountain_national_park_1280x720/
https://www.reddit.com/r/wallpaper/comments/fuyomz/up_is_still_my_favorite_film_grandpa_carl_cams/
https://www.reddit.com/r/wallpaper/comments/fvagex/beautiful_and_colorful_nature_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv3nnn/maroon_bells_co_photo_credit_to/
https://www.reddit.com/r/wallpaper/comments/fuyg0z/volcano_lightening_19201080/
https://www.reddit.com/r/wallpaper/comments/fvgohk/doctor_strange1920x1080/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/?instanceId=t3_p%3DgAAAAABeiiTt9isPY03zwoimtzcC7w3uLzUDCuoD5cU6ekeEYt48cRAqoMsc1ZDBJ6OeK1U3Bs2Zo1ZSWzdQ4DOux21vGvWzJkxNWQ14XzDWag_GlrE-t_4rpFA_73kW94xGUQchsXL7f4VkbbHIyn8SMlUlTtt3j3lJCViwINOQgIF3p5N8Q4ri-swtJC-JyEUYa4dJazlZ9xLYyOHSvMkiR3k9lDx0NEKqpqfbQ9__f3xLUzgS4yF4OngMDFUVFa5nyH3I32mkP3KezXLxOR6H8CSGI_jqRA4dBV-AnHLuzPlgENRpfaMhWJ04vTEOjmG4sm4xs65OZCumqNstzlDEvR7ryFwL6LeH02a9E3czck5jfKY7HXQ%3D
https://www.reddit.com/r/wallpaper/comments/fuzjza/ghost_cloud_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvg88o/park_autumn_tress_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv47r8/audi_quattro_s1_3840x2160_fh4/
https://www.reddit.com/r/wallpaper/comments/fuybjs/spacecrafts_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fv043i/dragonfly_1280x720/
https://www.reddit.com/r/wallpaper/comments/fv06ud/muskrat_swim_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvdafk/natural_beauty_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbnuc/cigar_man_19201080/
https://www.reddit.com/r/wallpaper/comments/fvcww4/thunder_road_3840_x_2160/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/?instanceId=t3_p%3DgAAAAABeiiTtxVzGp9KwvtRNa1pOVCgz2IBkTGRxqdyXk4WTsjAkWS9wzyDVF_1aSOz36HqHOVrngfj3z_9O1cAkzz-0fwhxyJ_8jePT3F88mrveLChf_YRIbAtxb-Ln_OaeeXUnyrFVl-OPN7cqXvtgh3LoymBx3doL-bEVnECOWkcSXvUIwpMn-flVZ5uNcGL1nKEiszUcORqq1oQ32BnrmWHomrDb3Q%3D%3D
https://www.reddit.com/r/wallpaper/comments/fv3xqs/social_distancing_log_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbcpl/neon_city_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fvbhdb/sunrise_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fv2eno/second_heavy_bike_in_ghost_recon_breakpoint/

Can't parse a Google search result page using BeautifulSoup

I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, the first division had class = 'r' I wrote this code:
import requests
site = requests.get('<url>')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
But the command prompt returned just []
What could've gone wrong and how to correct it?
EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same [].
Here's the new code:
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('<url>', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children) , it works fine.
Some website requires User-Agent header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such
# Define a dictionary of http request headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)
Note: List of user agents can be found here
>>> give_me_everything = soup.find_all('div', class_='yuRUbf')
Prints a bunch of stuff.
>>> give_me_everything_v2 = soup.select('.yuRUbf')
Prints a bunch of stuff.
Note that you can't do something like this:
>>> give_me_everything = soup.find_all('div', class_='yuRUbf').text
AttributeError: You're probably treating a list of elements like a single element.
>>> for all in soup.find_all('div', class_='yuRUbf'):
print(all.text)
Prints a bunch of stuff.
Code:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
give_me_everything = soup.find_all('div', class_='yuRUbf')
print(give_me_everything)
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The main difference is that you don't have to come with a different solution when something isn't working thus don't have to maintain the parser.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": 'narendra modi" "scams" "frauds" "corruption" "modi" -lalit -nirav',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
print(f'{title}\n{link}\n{displayed_link}\n')
----------
Opposition Corners Modi Govt On Jay Shah Issue, Rafael ...
https://www.outlookindia.com/website/story/no-confidence-vote-opposition-corners-modi-govt-on-jay-shah-issue-rafael-deals-c/313790
https://www.outlookindia.com
Modi, Rahul and Kejriwal describe one another as frauds ...
https://www.business-standard.com/article/politics/modi-rahul-and-kejriwal-describe-one-another-as-frauds-114022400019_1.html
https://www.business-standard.com
...
Disclaimer, I work for SerpApi.

Python Screen Scraping Forbes.com

I'm writing a Python program to extract and store metadata from interesting online tech articles: "og:title", "og:description", "og:image", og:url, and og:site_name.
This is the code I'm using...
# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
# Create the Request
http = urllib3.PoolManager()
# Create the Response
response = http.request('GET ', url, headers)
# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')
# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
if len(tag.get("content", None)) > len(title):
title = tag.get("content", None)
The program runs fine on all but one site. On "forbes.com", I can't get to the articles using Python:
url=
https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
I can't bypass this consent page; which seems to be the "Cookie Consent Manager" solution from "TrustArc". On a computer, you basically provide your consent... and each consecutive run, you're able to access the articles.
If I reference the "toURL" url:
https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
And bypass the "https://www.forbes.com/consent/" page, I'm redirected back to this page.
I've tried to see if there is a cookie I could set in the header, but couldn't find the magic key.
Can anyone help me?
There is a required cookie notice_gdpr_prefs that needs to be sent to view the data :
import requests
from bs4 import BeautifulSoup
src = requests.get(
"https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
headers= {
"cookie": "notice_gdpr_prefs"
})
soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta", property="og:title")
print(title["content"])

Categories