I would like to create the exact same table as the one shown in the following webpage: https://247sports.com/college/penn-state/Season/2022-Football/Commits/
I am currently using Selenium and Beautiful Soup to start making it happen on a Google Colab notebook because I am getting forbidden errors when performing a "read_html" command. I have just started to get some output, but I only want to grab the text and not the external stuff surrounding it.
Here is my code so far...
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status
...and here is my output...
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]
Any assistance on this is greatly appreciated.
There's no need to use Selenium, to get a response from the website you need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.
To create a DataFrame see this sample:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())
Output:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021
Related
I am trying to collect a list of hrefs from the Netflix careers site: https://jobs.netflix.com/search. Each job listing on this site has an anchor and a class: <a class=css-2y5mtm essqqm81>. To be thorough here, the entire anchor is:
<a class="css-2y5mtm essqqm81" role="link" href="/jobs/244837014" aria-label="Manager, Written Communications"\>\
<span tabindex="-1" class="css-1vbg17 essqqm80"\>\<h4 class="css-hl3xbb e1rpdjew0"\>Manager, Written Communications\</h4\>\</span\>\</a\>
Again, the information of interest here is the hrefs of the form href="/jobs/244837014". However, when I perform the standard BS commands to read the HTML:
html_page = urllib.request.urlopen("https://jobs.netflix.com/search")
soup = BeautifulSoup(html_page)
I don't see any of the hrefs that I'm interested in inside of soup.
Running the following loop does not show the hrefs of interest:
for link in soup.findAll('a'):
print(link.get('href'))
What am I doing wrong?
That information is being fed dynamically in page, via XHR calls. You need to scrape the API endpoint to get jobs info. The following code will give you a dataframe with all jobs currently listed by Netflix:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'referer': 'https://jobs.netflix.com/search',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 20)):
url = f'https://jobs.netflix.com/api/search?page={x}'
r = s.get(url)
df = pd.json_normalize(r.json()['records']['postings'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df[['text', 'team', 'external_id', 'updated_at', 'created_at', 'location', 'organization' ]])
Result:
100%
19/19 [00:29<00:00, 1.42s/it]
text team external_id updated_at created_at location organization
0 Events Manager - SEA [Publicity] 244936062 2022-11-23T07:20:16+00:00 2022-11-23T04:47:29Z Bangkok, Thailand [Marketing and PR]
1 Manager, Written Communications [Publicity] 244837014 2022-11-23T07:20:16+00:00 2022-11-22T17:30:06Z Los Angeles, California [Marketing and Publicity]
2 Manager, Creative Marketing - Korea [Marketing] 244740829 2022-11-23T07:20:16+00:00 2022-11-22T07:39:56Z Seoul, South Korea [Marketing and PR]
3 Administrative Assistant - Philippines [Netflix Technology Services] 244683946 2022-11-23T07:20:16+00:00 2022-11-22T01:26:08Z Manila, Philippines [Corporate Functions]
4 Associate, Studio FP&A - APAC [Finance] 244680097 2022-11-23T07:20:16+00:00 2022-11-22T01:01:17Z Seoul, South Korea [Corporate Functions]
... ... ... ... ... ... ... ...
365 Software Engineer (L4/L5) - Content Engineering [Core Engineering, Studio Technologies] 77239837 2022-11-23T07:20:31+00:00 2021-04-22T07:46:29Z Mexico City, Mexico [Product]
366 Distributed Systems Engineer (L5) - Data Platform [Data Platform] 201740355 2022-11-23T07:20:31+00:00 2021-03-12T22:18:57Z Remote, United States [Product]
367 Senior Research Scientist, Computer Graphics / Computer Vision / Machine Learning [Data Science and Engineering] 227665988 2022-11-23T07:20:31+00:00 2021-02-04T18:54:10Z Los Gatos, California [Product]
368 Counsel, Content - Japan [Legal and Public Policy] 228338138 2022-11-23T07:20:31+00:00 2020-11-12T03:08:04Z Tokyo, Japan [Corporate Functions]
369 Associate, FP&A [Financial Planning and Analysis] 46317422 2022-11-23T07:20:31+00:00 2017-12-26T19:38:32Z Los Angeles, California [Corporate Functions]
370 rows × 7 columns
For each job, the url would be https://jobs.netflix.com/jobs/{external_id}
Hi everyone I would like to scrape N.01. But to no avail. I'm a self learning newbie, so would help to solve this problem :)
This is the HTML part that I am interested in:
<div class="justify-between font-semibold">
<div id="14001" class="tab-dun-item grid grid-cols-2 justify-between hover:bg-gray-300 font-semibold rounded-lg p-2 mx-2 mt-2 cursor-pointer bg-gray-300" onclick="dunSelected(this.id)">
<span id="kod-dun" class="">N.01</span>
<span id="nama-dun" class="">BULOH KASAP</span>
</div>
</div>
This is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun').text
soup = BeautifulSoup(html_text,'lxml')
dun = soup.find('div', class_='justify-between font-semibold')
a = dun.find('span', attrs={'id': 'kod-dun'})
b = dun.find('span', attrs={'class': ''})
print(a)
print(b)
The result is this:
<span id="kod-dun"><!--Kod DUN--></span>
<span id="kod-dun"><!--Kod DUN--></span>
As stated, the best way is API. And as stated, it was a made a little more complicated ni that you need to provide some token and a few other parameters. Good news is, you can grab that info in the site.
Note, you'll need to pip install choice to implement the user input option I provide:
import requests
from bs4 import BeautifulSoup
import re
import json
import choice
import pandas as pd
s = requests.Session()
url = 'https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = s.get(url, headers=headers).text
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers.update({'Cookie':cookieStr})
soup = BeautifulSoup(response, 'html.parser')
options = soup.find_all('option')[1:]
optionsDict = {each.text: [each['value'],each['data-negeri']] for each in options}
getInput = choice.Menu(list(optionsDict.keys())).ask()
jsonStr = re.search('var formData = ({.*});', response, flags=re.DOTALL).group(1).split(',')[0] +'}'
token = json.loads(jsonStr)['_token']
postURL = 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun'
payload = {
'_token':token,
'PilihanrayaId':optionsDict[getInput][0],
'RefNegeriId':optionsDict[getInput][1]}
jsonData = s.post(postURL, headers=headers, data=payload).json()
df = pd.json_normalize(jsonData['dun'])
Output:
Make a choice:
0: PRU DEWAN NEGERI JOHOR KE-15
1: PRU DUN SARAWAK KE-12 (2021)
2: PRU DUN MELAKA KE-15 (2021)
3: PRU DUN SABAH KALI KE-16 (2020)
4: PRU DUN SARAWAK KE-11 (2016)
Enter number or name; return for next page
? 0
Dataframe:
print(df)
BahagianPilihanRayaId ... AppendKod
0 956EC2DB-2952-414E-835D-831611009C27 ... 14001
1 956EC2DB-EAD5-4656-B9A5-3DB0AA626BC1 ... 14002
2 956EC2DC-18ED-4711-BCD1-4D2CFB473EE1 ... 14103
3 956EC2DB-5E73-47FB-9F23-4A7F6EE26EDC ... 14104
4 956EC2DB-DEC7-45F9-9F60-9B398D678D2F ... 14205
5 956EC2DC-5776-4062-8E43-E8F18454DAC1 ... 14206
6 956EC2DC-B0FD-45D3-90A2-F1F8C4CA31B2 ... 14307
7 956EC2DC-D419-4801-A11D-65C04F7CF7FC ... 14308
8 956EC2DB-F683-4300-B9A0-7679C676DCCF ... 14409
9 956EC2DA-827A-42DF-8D34-3842D49113FE ... 14410
10 956EC2DB-CE81-4EB4-A6A4-D003EF5F845F ... 14411
11 956EC2DC-0DF7-45A9-B57D-28F252EB4349 ... 14512
12 956EC2DA-9C70-4DB1-A67F-BB5D0DC7AB49 ... 14513
13 956EC2DC-01E6-4D8E-A9A0-930EE420CE04 ... 14514
14 956EC2DC-630F-4EBF-9C1B-8F93FFAEDA96 ... 14615
15 956EC2DA-D04A-4E9E-BBA6-F5487168BB4C ... 14616
16 956EC2DB-6B1F-4EFC-9D19-16A84738E680 ... 14717
17 956EC2DB-35E7-4C14-AA60-F0B79B6E7CB2 ... 14718
18 956EC2DB-927C-4705-A347-F0D81D5D3BE4 ... 14819
19 956EC2DA-EB67-4359-9FC1-EF749DD6E296 ... 14820
20 956EC2DC-A4FC-42F8-9FB8-7669D120EC5F ... 14921
21 956EC2DA-C381-47C8-9014-25EFCD8B064C ... 14922
22 956EC2DC-EC01-4D68-8171-8BE026CCBF83 ... 15023
23 956EC2DC-6E20-4EB1-9C47-88D9B72E5889 ... 15024
24 956EC2DB-523C-4998-AC02-33F602E07305 ... 15025
25 956EC2DC-C7DC-47EC-BDD8-73A1440788BE ... 15126
26 956EC2DB-84E2-4BB6-97FD-C996983A76B1 ... 15127
27 956EC2DA-59D1-420C-8155-80F2156FB446 ... 15228
28 956EC2DA-70E6-40D8-A3A3-E8E4FBC45E40 ... 15229
29 956EC2DC-4799-4624-9C07-18D1DAC6B0F7 ... 15330
30 956EC2DD-20B8-4F28-90B9-1519628EF751 ... 15331
31 956EC2DA-4BFD-4909-8D93-8A2207FD247B ... 15432
32 956EC2DA-B4A3-49BD-8CC0-BDCE81B7071E ... 15433
33 956EC2DA-A8EE-4244-9052-9D14ECDD1A8C ... 15534
34 956EC2DC-3B33-4627-BFB5-DAF87988909A ... 15535
35 956EC2DC-3009-44FA-874D-C575599210DC ... 15636
36 956EC2DC-BC61-4202-962A-66415635FDDE ... 15637
37 956EC2DC-24AE-41BC-AD3E-7D48D8E752D3 ... 15738
38 956EC2DB-02E6-439A-9E97-B18F65EB9432 ... 15739
39 956EC2DC-928E-4837-A049-18BFD698A8B2 ... 15840
40 956EC2DB-782A-4E8E-ADE3-07A9AC786F9E ... 15841
41 956EC2DC-8620-4CD0-8225-60DB217E5EB1 ... 15942
42 956EC2DA-F787-4965-A099-E29794CBA951 ... 15943
43 956EC2DB-BF64-4B83-9240-49701039EA2A ... 16044
44 956EC2DC-7A25-4870-9276-598C7041CD27 ... 16045
45 956EC2DB-1238-45A6-A935-469825659736 ... 16146
46 956EC2DD-06B6-42E1-9BB8-FC300EF40688 ... 16147
47 956EC2DC-DFD2-45A8-8B15-5C831DD0E14C ... 16248
48 956EC2DA-DE6C-4A14-93CA-BB757A5085F7 ... 16249
49 956EC2DA-901D-465F-B795-7CF2D807E394 ... 16350
50 956EC2DB-42DB-4949-B4DE-025EAAB779B9 ... 16351
51 956EC2DB-A09B-4A7A-A122-15208F9D0155 ... 16352
52 956EC2DA-6599-4283-8DC2-1496100717E3 ... 16453
53 956EC2DB-1DD6-4659-AF97-D448477A546C ... 16454
54 956EC2DB-AF15-4069-86A4-035F2E3A7ED4 ... 16555
55 956EC2DC-F814-4B4C-A65F-79B75AD25C62 ... 16556
[56 rows x 4 columns]
You cannot do this directly with Beautiful Soup, as the N.01 value is not contained in the HTML you get back from https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun. The N.01 value is downloaded using an AJAX request, and then inserted into the DOM using JavaScript.
$.ajax({
type: "POST",
url: 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun',
dataType: "json",
data: formData,
success: function (data) {
if(data.data == true) {
$("#parlimenDunSection").removeClass('hidden');
$("#tab-dun-web").empty();
$.each(data['dun'], function (key, value) {
//Keluarkan Dun
$("#tab-dun-web").append('<div class="justify-between font-semibold"><div id="'+value.AppendKod+'" class="tab-dun-item grid grid-cols-2 justify-between hover:bg-gray-300 font-semibold rounded-lg p-2 mx-2 mt-2 cursor-pointer" onclick="dunSelected(this.id)"><span id="kod-dun">N.'+value.KodBahagianPilihanRaya+'</span><span id="nama-dun">'+value.NamaBahagianPilihanRaya+'</span></div></div>');
});
}
},
});
Note that the HTML added using the append method contains the following snippet:
<span id="kod-dun">N.'+value.KodBahagianPilihanRaya+'</span>
There are two basic routes you can take to scrape this:
Use the API at https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun directly. This is the simplest way in theory, but is made more complicated due to the API requiring XSRF tokens before it will return the data to you. See e.g. this post for how to explore the API using the developer tools in your browser.
Use Selenium, possibly with Selenium IDE, to control an actual browser using code. This has the advantage that the browser will run the JavaScript for you, so you can fetch the value from the main page after the JavaScript has finished running. However, this approach takes a bit more work to set up, and the resulting scripts tend to be slower and more error-prone. (You need to script the browser interactions necessary to get the data, such as clicking buttons and opening menus, and these scripts are prone to synchronisation problems where the script tries to interact with the browser before the browser is ready. Also, these scripts tend to break more easily if the web page changes.)
Good time of the day!
While working on a scraping project, I have faced some issues.
Currently I am working on a draft:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import requests
from bs4 import BeautifulSoup
import time
import random
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"
BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))
btn = driver.find_elements_by_xpath('//*[#id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")
def reader(url):
ls = list()
ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()
Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
HouseType = url.find(class_="classified__title").text.strip()
LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
ls.append(Furnished)
OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
ls.append(GardenOrientation)
GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
ls.append(PlotSurface)
FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
return ls
print(reader(soup))
I start facing issues, when the code reaches "Locality", I receive an Exception has occurred: AttributeError 'NoneType' object has no attribute 'find, though it is clear that the mentioned element is present on HTML code. I am adamant, that it is a synthax issue, but I can not put a finger on it.
It brings me to my second question:
Since this code will be running on multiple pages, those pages might not have requested elements. How can I place the None value if it occurs.
Thank you very much in advance!
Source Code:
<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
3 bedrooms
<span aria-hidden="true">|</span>
199
<span class="abbreviation"><span aria-hidden="true">
m² </span> <span class="sr-only">
square meters </span></span></p> <div class="classified__information--financial"><!----> <!----> <span class="mortgage-banner__text">Request your mortgage loan</span></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—</span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>
Isn't that data all within the <table> tags of the site? Can just use pandas:
import requests
import pandas as pd
url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)
df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]
Output:
print(df.to_string())
0 Address As built plan Available as of Available date Basement Bathrooms Bedrooms CO₂ emission Construction year Double glazing Elevator Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street Living area Living room surface Number of frontages Outdoor parking spaces Price Primary energy consumption Reference number of the EPC report Shower rooms Surface of the plot Toilets Website Yearly theoretical total energy consumption
0 Grand' Route 69 A 1435 - Corbais No To be defined December 31 2022 - 12:00 AM Yes 1 3 Not specified 2020 Yes Yes Not specified 8566 - 4443 No Yes East Yes Gas No Installed No 199 m² square meters 48 m² square meters 2 1 € 410,000 410000 € Not specified Not specified 1 150 m² square meters 3 http://www.gilmont.be Not specified
I am trying to access the elements in the Ingredients list of the following website: https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/
<div class="col-md-12 ingredient-wrapper">
<ul class="ingred-list ">
<li>
3 large free-range egg yolks
</li>
<li>
40 g Parmesan cheese, plus extra to serve
</li>
<li>
1 x 150 g piece of higher-welfare pancetta
</li>
<li>
200g dried spaghetti
</li>
<li>
1 clove of garlic
</li>
<li>
extra virgin olive oil
</li>
</ul>
</div
I first tried just using requests and beautiful soup but my code didn't find the list elements. I then tried using Selenium and it still didn't work. My code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.jamieoliver.com/recipes/pasta-recipes/cracker-ravioli/"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for ultag in soup.findAll('div', {'class': "col-md-12 ingredient-wrapper"}):
# for ultag in soup.findAll('ul', {'class': 'ingred_list '}):
for litag in ultag.findALL('li'):
print(litag.text)
To get the ingredients list, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for li in soup.select('.ingred-list li'):
print(' '.join(li.text.split()))
Prints:
3 large free-range egg yolks
40 g Parmesan cheese , plus extra to serve
1 x 150 g piece of higher-welfare pancetta
200 g dried spaghetti
1 clove of garlic
extra virgin olive oil
I am trying to extract information from the latest SEC EDGAR Schedule 13 forms filings.
Link of the filing as an example:
1) Saba Capital_27-Dec-2019_SC13
The information I am trying to extract (and the parts of the filing with the information)
1) Names of reporting persons: Saba Capital Management, L.P.
<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>
2) Name of issuer : WESTERN ASSET HIGH INCOME FUND II INC
<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>
3) CUSIP Number: 95766J102 (managed to get)
<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>
4) Percentage of class represented by amount : 11.3% (managed to get)
<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>
5) Date of Event Which requires filing of this statement: December 24, 2019
<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>
.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')
## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*##]{3}[0-9]', soup.text)
### get %
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])
How can I extract the 5 pieces of information from the filing? Thanks in advance
Try the following Code to get all the values.Using find() and css selector select_one()
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
UPDATED
If you don't want to use position then try below one.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
Update 2:
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)
Using lxml, it should work this way:
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019