Webscrape SPAN using python and soup - python

Hi everyone I would like to scrape N.01. But to no avail. I'm a self learning newbie, so would help to solve this problem :)
This is the HTML part that I am interested in:
<div class="justify-between font-semibold">
<div id="14001" class="tab-dun-item grid grid-cols-2 justify-between hover:bg-gray-300 font-semibold rounded-lg p-2 mx-2 mt-2 cursor-pointer bg-gray-300" onclick="dunSelected(this.id)">
<span id="kod-dun" class="">N.01</span>
<span id="nama-dun" class="">BULOH KASAP</span>
</div>
</div>
This is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun').text
soup = BeautifulSoup(html_text,'lxml')
dun = soup.find('div', class_='justify-between font-semibold')
a = dun.find('span', attrs={'id': 'kod-dun'})
b = dun.find('span', attrs={'class': ''})
print(a)
print(b)
The result is this:
<span id="kod-dun"><!--Kod DUN--></span>
<span id="kod-dun"><!--Kod DUN--></span>

As stated, the best way is API. And as stated, it was a made a little more complicated ni that you need to provide some token and a few other parameters. Good news is, you can grab that info in the site.
Note, you'll need to pip install choice to implement the user input option I provide:
import requests
from bs4 import BeautifulSoup
import re
import json
import choice
import pandas as pd
s = requests.Session()
url = 'https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = s.get(url, headers=headers).text
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers.update({'Cookie':cookieStr})
soup = BeautifulSoup(response, 'html.parser')
options = soup.find_all('option')[1:]
optionsDict = {each.text: [each['value'],each['data-negeri']] for each in options}
getInput = choice.Menu(list(optionsDict.keys())).ask()
jsonStr = re.search('var formData = ({.*});', response, flags=re.DOTALL).group(1).split(',')[0] +'}'
token = json.loads(jsonStr)['_token']
postURL = 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun'
payload = {
'_token':token,
'PilihanrayaId':optionsDict[getInput][0],
'RefNegeriId':optionsDict[getInput][1]}
jsonData = s.post(postURL, headers=headers, data=payload).json()
df = pd.json_normalize(jsonData['dun'])
Output:
Make a choice:
0: PRU DEWAN NEGERI JOHOR KE-15
1: PRU DUN SARAWAK KE-12 (2021)
2: PRU DUN MELAKA KE-15 (2021)
3: PRU DUN SABAH KALI KE-16 (2020)
4: PRU DUN SARAWAK KE-11 (2016)
Enter number or name; return for next page
? 0
Dataframe:
print(df)
BahagianPilihanRayaId ... AppendKod
0 956EC2DB-2952-414E-835D-831611009C27 ... 14001
1 956EC2DB-EAD5-4656-B9A5-3DB0AA626BC1 ... 14002
2 956EC2DC-18ED-4711-BCD1-4D2CFB473EE1 ... 14103
3 956EC2DB-5E73-47FB-9F23-4A7F6EE26EDC ... 14104
4 956EC2DB-DEC7-45F9-9F60-9B398D678D2F ... 14205
5 956EC2DC-5776-4062-8E43-E8F18454DAC1 ... 14206
6 956EC2DC-B0FD-45D3-90A2-F1F8C4CA31B2 ... 14307
7 956EC2DC-D419-4801-A11D-65C04F7CF7FC ... 14308
8 956EC2DB-F683-4300-B9A0-7679C676DCCF ... 14409
9 956EC2DA-827A-42DF-8D34-3842D49113FE ... 14410
10 956EC2DB-CE81-4EB4-A6A4-D003EF5F845F ... 14411
11 956EC2DC-0DF7-45A9-B57D-28F252EB4349 ... 14512
12 956EC2DA-9C70-4DB1-A67F-BB5D0DC7AB49 ... 14513
13 956EC2DC-01E6-4D8E-A9A0-930EE420CE04 ... 14514
14 956EC2DC-630F-4EBF-9C1B-8F93FFAEDA96 ... 14615
15 956EC2DA-D04A-4E9E-BBA6-F5487168BB4C ... 14616
16 956EC2DB-6B1F-4EFC-9D19-16A84738E680 ... 14717
17 956EC2DB-35E7-4C14-AA60-F0B79B6E7CB2 ... 14718
18 956EC2DB-927C-4705-A347-F0D81D5D3BE4 ... 14819
19 956EC2DA-EB67-4359-9FC1-EF749DD6E296 ... 14820
20 956EC2DC-A4FC-42F8-9FB8-7669D120EC5F ... 14921
21 956EC2DA-C381-47C8-9014-25EFCD8B064C ... 14922
22 956EC2DC-EC01-4D68-8171-8BE026CCBF83 ... 15023
23 956EC2DC-6E20-4EB1-9C47-88D9B72E5889 ... 15024
24 956EC2DB-523C-4998-AC02-33F602E07305 ... 15025
25 956EC2DC-C7DC-47EC-BDD8-73A1440788BE ... 15126
26 956EC2DB-84E2-4BB6-97FD-C996983A76B1 ... 15127
27 956EC2DA-59D1-420C-8155-80F2156FB446 ... 15228
28 956EC2DA-70E6-40D8-A3A3-E8E4FBC45E40 ... 15229
29 956EC2DC-4799-4624-9C07-18D1DAC6B0F7 ... 15330
30 956EC2DD-20B8-4F28-90B9-1519628EF751 ... 15331
31 956EC2DA-4BFD-4909-8D93-8A2207FD247B ... 15432
32 956EC2DA-B4A3-49BD-8CC0-BDCE81B7071E ... 15433
33 956EC2DA-A8EE-4244-9052-9D14ECDD1A8C ... 15534
34 956EC2DC-3B33-4627-BFB5-DAF87988909A ... 15535
35 956EC2DC-3009-44FA-874D-C575599210DC ... 15636
36 956EC2DC-BC61-4202-962A-66415635FDDE ... 15637
37 956EC2DC-24AE-41BC-AD3E-7D48D8E752D3 ... 15738
38 956EC2DB-02E6-439A-9E97-B18F65EB9432 ... 15739
39 956EC2DC-928E-4837-A049-18BFD698A8B2 ... 15840
40 956EC2DB-782A-4E8E-ADE3-07A9AC786F9E ... 15841
41 956EC2DC-8620-4CD0-8225-60DB217E5EB1 ... 15942
42 956EC2DA-F787-4965-A099-E29794CBA951 ... 15943
43 956EC2DB-BF64-4B83-9240-49701039EA2A ... 16044
44 956EC2DC-7A25-4870-9276-598C7041CD27 ... 16045
45 956EC2DB-1238-45A6-A935-469825659736 ... 16146
46 956EC2DD-06B6-42E1-9BB8-FC300EF40688 ... 16147
47 956EC2DC-DFD2-45A8-8B15-5C831DD0E14C ... 16248
48 956EC2DA-DE6C-4A14-93CA-BB757A5085F7 ... 16249
49 956EC2DA-901D-465F-B795-7CF2D807E394 ... 16350
50 956EC2DB-42DB-4949-B4DE-025EAAB779B9 ... 16351
51 956EC2DB-A09B-4A7A-A122-15208F9D0155 ... 16352
52 956EC2DA-6599-4283-8DC2-1496100717E3 ... 16453
53 956EC2DB-1DD6-4659-AF97-D448477A546C ... 16454
54 956EC2DB-AF15-4069-86A4-035F2E3A7ED4 ... 16555
55 956EC2DC-F814-4B4C-A65F-79B75AD25C62 ... 16556
[56 rows x 4 columns]

You cannot do this directly with Beautiful Soup, as the N.01 value is not contained in the HTML you get back from https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun. The N.01 value is downloaded using an AJAX request, and then inserted into the DOM using JavaScript.
$.ajax({
type: "POST",
url: 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun',
dataType: "json",
data: formData,
success: function (data) {
if(data.data == true) {
$("#parlimenDunSection").removeClass('hidden');
$("#tab-dun-web").empty();
$.each(data['dun'], function (key, value) {
//Keluarkan Dun
$("#tab-dun-web").append('<div class="justify-between font-semibold"><div id="'+value.AppendKod+'" class="tab-dun-item grid grid-cols-2 justify-between hover:bg-gray-300 font-semibold rounded-lg p-2 mx-2 mt-2 cursor-pointer" onclick="dunSelected(this.id)"><span id="kod-dun">N.'+value.KodBahagianPilihanRaya+'</span><span id="nama-dun">'+value.NamaBahagianPilihanRaya+'</span></div></div>');
});
}
},
});
Note that the HTML added using the append method contains the following snippet:
<span id="kod-dun">N.'+value.KodBahagianPilihanRaya+'</span>
There are two basic routes you can take to scrape this:
Use the API at https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun directly. This is the simplest way in theory, but is made more complicated due to the API requiring XSRF tokens before it will return the data to you. See e.g. this post for how to explore the API using the developer tools in your browser.
Use Selenium, possibly with Selenium IDE, to control an actual browser using code. This has the advantage that the browser will run the JavaScript for you, so you can fetch the value from the main page after the JavaScript has finished running. However, this approach takes a bit more work to set up, and the resulting scripts tend to be slower and more error-prone. (You need to script the browser interactions necessary to get the data, such as clicking buttons and opening menus, and these scripts are prone to synchronisation problems where the script tries to interact with the browser before the browser is ready. Also, these scripts tend to break more easily if the web page changes.)

Related

Using beautifulsoup4, avoid an AttributeError on a non-existing element

Good time of the day!
While working on a scraping project, I have faced some issues.
Currently I am working on a draft:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import requests
from bs4 import BeautifulSoup
import time
import random
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"
BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))
btn = driver.find_elements_by_xpath('//*[#id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")
def reader(url):
ls = list()
ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()
Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
HouseType = url.find(class_="classified__title").text.strip()
LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
ls.append(Furnished)
OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
ls.append(GardenOrientation)
GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
ls.append(PlotSurface)
FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
return ls
print(reader(soup))
I start facing issues, when the code reaches "Locality", I receive an Exception has occurred: AttributeError 'NoneType' object has no attribute 'find, though it is clear that the mentioned element is present on HTML code. I am adamant, that it is a synthax issue, but I can not put a finger on it.
It brings me to my second question:
Since this code will be running on multiple pages, those pages might not have requested elements. How can I place the None value if it occurs.
Thank you very much in advance!
Source Code:
<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
3 bedrooms
<span aria-hidden="true">|</span>
199
<span class="abbreviation"><span aria-hidden="true">
m² </span> <span class="sr-only">
square meters </span></span></p> <div class="classified__information--financial"><!----> <!----> <span class="mortgage-banner__text">Request your mortgage loan</span></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—</span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>
Isn't that data all within the <table> tags of the site? Can just use pandas:
import requests
import pandas as pd
url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)
df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]
Output:
print(df.to_string())
0 Address As built plan Available as of Available date Basement Bathrooms Bedrooms CO₂ emission Construction year Double glazing Elevator Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street Living area Living room surface Number of frontages Outdoor parking spaces Price Primary energy consumption Reference number of the EPC report Shower rooms Surface of the plot Toilets Website Yearly theoretical total energy consumption
0 Grand' Route 69 A 1435 - Corbais No To be defined December 31 2022 - 12:00 AM Yes 1 3 Not specified 2020 Yes Yes Not specified 8566 - 4443 No Yes East Yes Gas No Installed No 199 m² square meters 48 m² square meters 2 1 € 410,000 410000 € Not specified Not specified 1 150 m² square meters 3 http://www.gilmont.be Not specified

Grab table from football recruiting website

I would like to create the exact same table as the one shown in the following webpage: https://247sports.com/college/penn-state/Season/2022-Football/Commits/
I am currently using Selenium and Beautiful Soup to start making it happen on a Google Colab notebook because I am getting forbidden errors when performing a "read_html" command. I have just started to get some output, but I only want to grab the text and not the external stuff surrounding it.
Here is my code so far...
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status
...and here is my output...
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]
Any assistance on this is greatly appreciated.
There's no need to use Selenium, to get a response from the website you need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.
To create a DataFrame see this sample:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())
Output:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021

beautifulsoup select method returns traceback

well im still learning beautifulsoup module and im replcating this from the book automate the boring stuff with python i tried replcating the get amazon price script but i get a traceback on the .select() method
the error 'TypeError: 'NoneType' object is not callable'
its getiing devastated with this error as i couldnt find much about it
import bs4
import requests
header = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
def site(url):
x = requests.get(url, headers=header)
x.raise_for_status()
soup = bs4.BeautifulSoup(x.text, "html.parser")
p = soup.Select('#buyNewSection > a > h5 > div > div.a-column.a-span8.a-text-right.a-span-last > div > span.a-size-medium.a-color-price.offer-price.a-text-normal')
abc = p[0].text.strip()
return abc
price = site('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
print('price is' + str(price))
it must return a list value containing the price but im stuck with this error
If you use soup.select as opposed to soup.Select, your code does work, it just returns an empty list.The reason can see if we inspect the function you are using:
help(soup.Select)
Out[1]:
Help on NoneType object:
class NoneType(object)
| Methods defined here:
|
| __bool__(self, /)
| self != 0
|
| __repr__(self, /)
| Return repr(self).
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
Compared to:
help(soup.select)
Out[2]:
Help on method select in module bs4.element:
select(selector, namespaces=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
:param selector: A string containing a CSS selector.
:param namespaces: A dictionary mapping namespace prefixes
used in the CSS selector to namespace URIs. By default,
Beautiful Soup will use the prefixes it encountered while
parsing the document.
:param limit: After finding this number of results, stop looking.
:param kwargs: Any extra arguments you'd like to pass in to
soupsieve.select().
Having that said, it seems that the page structure is actually different than the one you are trying to get, missing the <a> tag.
<div id="buyNewSection" class="rbbHeader dp-accordion-row">
<h5>
<div class="a-row">
<div class="a-column a-span4 a-text-left a-nowrap">
<span class="a-text-bold">Buy New</span>
</div>
<div class="a-column a-span8 a-text-right a-span-last">
<div class="inlineBlock-display">
<span class="a-letter-space"></span>
<span class="a-size-medium a-color-price offer-price a-text-normal">$16.83</span>
</div>
</div>
</div>
</h5>
</div>
So this should work:
p = soup.select('#buyNewSection > h5 > div > div.a-column.a-span8.a-text-right.a-span-last > div.inlineBlock-display > span.a-size-medium.a-color-price.offer-price.a-text-normal')
abc = p[0].text.strip()
abc
Out[2]:
'$16.83'
Additionally, you could consider using a more granular approach that let's you debug your code better. For instance:
buySection = soup.find('div', attrs={'id':'buyNewSection'})
buySpan = buySection.find('span', attrs={'class': 'a-size-medium a-color-price offer-price a-text-normal'})
print (buyScan)
Out[1]:
'$16.83'

How to get unlabeled data by BeautifulSoup Python

<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>]
In the code above, I want to extract the answer ("It's 4:38") and the timestamp. For the question, I used
for link in soup.find_all('a'): Questions.append(link.text);
but I couldn't do the same with the answers and timestamp. How do I resolve this issue?
You can see that the text you want is a descendant of the div element. You can also see that it immediately follows the first br element that is such a descendant. Then one way to find it is simply to iterate through the descendants of the div looking for that br. When you see that take the next item.
Here's how it will play out.
>>> import bs4
>>> soup = bs4.BeautifulSoup(open('sumin.htm').read(), 'lxml')
>>> div = soup.find('div')
>>> for element in div.descendants:
... element.name, element
...
(None, '\n Contents\n ')
('a', <a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>)
(None, '\n Google what time is it\n ')
(None, '\n')
('br', <br/>)
(None, "\n It's 4:38.\n ")
('br', <br/>)
(None, '\n 2018. 2. 5. 5:38:41 PM\n ')
Notice that elements such as br have the name property but navigable strings do not (this property is None).
How to get unlabeled data
Actually, it is not unlabeled. The text you want is located inside the <div> tag. To get that text you can check if the text is NavigableString.
If you check the type of each content,
Contents #<class 'bs4.element.NavigableString'>
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a> #<class 'bs4.element.Tag'>
#<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
It's 4:38. #<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
2018. 2. 5. 5:38:41 PM #<class 'bs4.element.NavigableString'>
Code:
>>> from bs4 import BeautifulSoup, NavigableString
>>> html = '''<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
... Contents
... <a href="https://www.google.com/search?q=Google+what+time+is+it">
... Google what time is it
... </a>
... <br/>
... It's 4:38.
... <br/>
... 2018. 2. 5. 5:38:41 PM
... </div>'''
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', class_='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1')
>>> contents = [x.strip() for x in div.contents if isinstance(x, NavigableString)]
>>> contents
['Contents', '', "It's 4:38.", '2018. 2. 5. 5:38:41 PM']
From this, you can understand what NavigableString is. Now, to get the date and time you can simply join the last 2 elements of the list.
>>> ' '.join(contents[-2:])
"It's 4:38. 2018. 2. 5. 5:38:41 PM"
text = '<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">\n Contents\n \n Google what time is it\n \n <br/>\n It\'s 4:38.\n <br/>\n 2018. 2. 5. 5:38:41 PM\n </div>'
s = BeautifulSoup(text,"lxml")
>>> s.find("br").findNext("br").next
'\n 2018. 2. 5. 5:38:41 PM\n '
>>> s.find("br").next
"\n It's 4:38.\n "
Use select_one() in combo with SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser:
soup.select_one('.YwPhnf').text
# 09:06
Or you can use stripped_strings, but it's not as pretty as using CSS selectors:
html = '''
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# returns a generator object
current_time = list(soup.select_one('.content-cell').stripped_strings)[2]
print(current_time)
# It's 4:38.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "what time it is", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
current_time = soup.select_one('.YwPhnf').text
current_date = soup.select_one('.KfQeJ:nth-child(1)').text
print(f'{current_time}\n{current_date}')
# 2:11 AM
# September 11, 2021
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to get the data you want from the structured JSON, rather than figuring out to extract things and maintain the parser over time if something won't work correctly because of some changes in the HTML.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "what time it is",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(results['answer_box']['result'])
# 2:07 AM
Disclaimer, I work for SerpApi.

Extracting specific information from fetched HTML code using python

I'm a relatively newb in python. I need some advice for a bioinformatics project. It's about converting certain enzyme IDs to others.
What I already did and what works, is fetch the html code for a list of IDs from the Rhea database:
53 url2 = "http://www.rhea-db.org/reaction?id=16952"
54 f_xml2 = open("xml_tempfile2.txt", "w")
55
56 fetch2 = pycurl.Curl()
57 fetch2.setopt(fetch2.URL, url2)
58 fetch2.setopt(fetch.WRITEDATA, f_xml2)
59 fetch2.perform()
60 fetch2.close
So the HTML code is saved to a temporary txt file (I know, possibly not the most elegant way to do stuff, but it works for me ;).
Now what I am interested in is the following part from the HTML:
<p>
<h3>Same participants, different directions</h3>
<div>
<span>RHEA:16949</span>
<span class="icon-question">myo-inositol + NAD(+) <?> scyllo-inosose + H(+) + NADH</span>
</div><div>
<span>RHEA:16950</span>
<span class="icon-arrow-right">myo-inositol + NAD(+) => scyllo-inosose + H(+) + NADH</span>
</div><div>
<span>RHEA:16951</span>
<span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH => myo-inositol + NAD(+)</span>
</div>
</p>
I want to go through the code until the class "icon-arrow-right" is reached (this expression is unique in the HTML). Then I want to extract the information of "RHEA:XXXXXX" from the line above. So in this example, I want to end up with 16950.
Is there a simple way to do this? I've already experimented with HTMLparser but couldn't get it to work in a way that it looks for a certain class and then gives me the ID from the line above.
Thank you very much in advance!
You can use an HTML parser like BeautifulSoup to do this:
>>> from bs4 import BeautifulSoup
>>> html = """ <p>
... <h3>Same participants, different directions</h3>
... <div>
... <span>RHEA:16949</span>
... <span class="icon-question">myo-inositol + NAD(+) <?> scyllo-inosose + H(+) + NADH</span>
... </div><div>
... <span>RHEA:16950</span>
... <span class="icon-arrow-right">myo-inositol + NAD(+) => scyllo-inosose + H(+) + NADH</span>
... </div><div>
... <span>RHEA:16951</span>
... <span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH => myo-inositol + NAD(+)</span>
... </div>
... </p>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('span', class_='icon-arrow-right').find_previous_sibling().get_text()
'RHEA:16950'

Categories