BeautifultSoup Python get content - python

I don't usually play with BeautifulSoup in Python so I am struggling to find the value 8.133,00 that matches with the Ibex 35 in the web page: https://es.investing.com/indices/indices-futures
So far I am getting all the info of the page, but I can't filter to get that value:
site = 'https://es.investing.com/indices/indices-futures'
hardware = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101
Firefox/106.0'}
request = Request(site,headers=hardware)
page = urlopen(request)
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I appreciate a hand to get that value.
Regards

Here is a way of getting that bit of information - a dataframe with all the info in that table containing IBEX 35, DAX, and so on, you can then slice that dataframe as you wish.
import pandas as pd
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
url = 'https://es.investing.com/indices/indices-futures'
r = scraper.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[class="datatable_table__D_jso quotes-box_table__nndS2 datatable_table--mobile-basic__W2ilt"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
0 1 2 3 4
0 IBEX 35derived 8.098,10 -3510 -0,43% NaN
1 US 500derived 3.991,90 355 +0,90% NaN
2 US Tech 100derived 11.802,20 1962 +1,69% NaN
3 Dow Jones 33.747,86 3249 +0,10% NaN
4 DAXderived 14.224,86 7877 +0,56% NaN
5 Índice dólarderived 106255 -1837 -1,70% NaN
6 Índice euroderived 11404 89 +0,79% NaN
See https://pypi.org/project/cloudscraper/

Related

Pagination not showing up in parsed content (BeautifulSoup)

I am new to python programming and I have a problem with pagination while using beautiful soup. all the parsed content show up except the pagination contents. image of content not showing up I have highlighted the lines which does not show up.
Website link.
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
from lxml import html
url = "https://www.yellowpages.lk/Medical.php"
result = requests.get(url)
time.sleep(5)
doc = BeautifulSoup(result.content, "lxml")
time.sleep(5)
Table = doc.find('table',{'id':'MedicalFacility'}).find('tbody').find_all('tr')
Page = doc.select('.col-lg-10')
C_List = []
D_List = []
N_List = []
A_List = []
T_List = []
W_List = []
V_List = []
M_List = []
print(doc.prettify())
print(Page)
while True:
for i in range(0,25):
Sort = Table[i]
Category = Sort.find_all('td')[0].get_text().strip()
C_List.insert(i,Category)
District = Sort.find_all('td')[1].get_text().strip()
D_List.insert(i,District)
Name = Sort.find_all('td')[2].get_text().strip()
N_List.insert(i,Name)
Address = Sort.find_all('td')[3].get_text().strip()
A_List.insert(i,Address)
Telephone = Sort.find_all('td')[4].get_text().strip()
T_List.insert(i,Telephone)
Whatsapp = Sort.find_all('td')[5].get_text().strip()
W_List.insert(i,Whatsapp)
Viber = Sort.find_all('td')[6].get_text().strip()
V_List.insert(i,Viber)
MoH_Division = Sort.find_all('td')[7].get_text().strip()
M_List.insert(i,MoH_Division)
I tried using .find() with class and .select('.class') to see if the pagination contents show up so far nothing has worked
The pagination is more or less superfluous in that page: the data is loaded anyway, and Javascript is generating pagination just for display purposes: Requests will get full data anyway.
Here is one way of getting that information in full:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www.yellowpages.lk/Medical.php'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[id="MedicalFacility"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
Category District Name Address Telephone WhatsApp Viber MoH Division
0 Pharmacy Gampaha A & B Pharmacy 171 Negambo Road Veyangoda 0778081515 9.477808e+10 9.477808e+10 Aththanagalla
1 Pharmacy Trincomalee A A Pharmacy 350 Main Street Kanthale 0755576998 9.475558e+10 9.475558e+10 Kanthale
2 Pharmacy Colombo A Baur & Co Pvt Ltd 55 Grandpass Rd Col 14 0768200100 9.476820e+10 9.476820e+10 CMC
3 Pharmacy Colombo A Colombo Pharmacy Ug 93 97 Peoples Park Colombo 11 0773771446 9.477377e+10 NaN CMC
4 Pharmacy Trincomalee A R Pharmacy Main Street Kinniya-3 0771413838 9.477500e+10 9.477500e+10 Kinniya
... ... ... ... ... ... ... ... ...
1968 Pharmacy Ampara Zam Zam Pharmacy Main Street Akkaraipattu 0672277698 9.477756e+10 9.477756e+10 Akkaraipattu
1969 Pharmacy Batticaloa Zattra Pharmacy Jummah Mosque Rd Oddamawadi-1 0766689060 9.476669e+10 NaN Oddamavady
1970 Pharmacy Puttalam Zeenath Pharmacy Norochcholei 0728431622 NaN NaN Kalpitiya
1971 Pharmacy Puttalam Zidha Pharmacy Norochcholei 0773271222 NaN NaN Kalpitiya
1972 Pharmacy Gampaha Zoomcare Pharmacy & Grocery 182/B/1 Rathdoluwa Seeduwa 0768378112 NaN NaN Seeduwa
1973 rows × 8 columns
See pandas documentation here. Also BeautifulSoup documentation, and lastly, Requests documentation.
If you are using pandas, all you need is just a couple of lines of code to put the entire table into a dataframe.
All you need is pandas.read_html() function as follows:
Code:
import pandas as pd
df = pd.read_html("https://www.yellowpages.lk/Medical.php")[0]
print(df)
Output:

How to extract all h2 texts from some URLs and store to CSV?

Need to extract all h2 text from some links and I tried it by using BeautifulSoup, but it didn't worked.
I also want to output them to CSV
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv
r01 = requests.get("https://www.seikatsu110.jp/library/vermin/vr_termite/23274/")
r02 = requests.get("https://yuko-navi.com/termite-control-subsidies")
soup_content01 = BeautifulSoup(r01.content, "html.parser")
soup_content02 = BeautifulSoup(r02.content, "html.parser")
alltxt01 = soup_content01.get_text()
alltxt02 = soup_content02.get_text()
with open('h2.csv', 'w+',newline='',encoding='utf-8') as f:
n = 0
for subheading01 in soup_content01.find_all('h2'):
sh01 = subheading01.get_text()
writer = csv.writer(f, lineterminator='\n')
writer.writerow([n, sh01])
n += 1
for subheading02 in soup_content02.find_all('h2'):
sh02 = subheading02.get_text()
writer = csv.writer(f, lineterminator='\n')
writer.writerow([n, sh01, sh02])
n += 1
pass
expected csv output is as below:
sh01
sh02
シロアリ駆除に適用される補助金や保険は?
1章 シロアリ駆除工事に補助金はない!
シロアリ駆除の費用を補助金なしで抑える方法
2章 確定申告時に「雑損控除」申請がおすすめ
シロアリ駆除の費用ってどれくらいかかる?
3章 「雑損控除」として負担してもらえる金額
要件を満たせば加入できるシロアリ専門の保険がある?
4章 「雑損控除」の申請方法
シロアリには5年保証がある!
5章 損したくないなら信頼できる業者を選ぼう!
まとめ
まとめ
この記事の監修者 ナカザワ氏について
この記事の監修者 ナカザワ氏について
シロアリ駆除のおすすめ記事
関連記事カテゴリ一覧
シロアリ駆除の記事アクセスランキング
シロアリ駆除の最新記事
カテゴリ別記事⼀覧
シロアリ駆除の業者を地域から探す
関連カテゴリから業者を探す
シロアリ駆除業者ブログ
サービスカテゴリ
生活110番とは
加盟希望・ログイン
Please somebody tell me what's wrong with this code.
Just in addation to approach of #Barry the Platipus with itertools, that is great. - pandas is also my favorite and there is an alternative way with native dict comprehension.
Iterate your urls and create a dict that holds number or url as key and a list of heading texts as value. These could be easily transformed to a DataFrame and exported to CSV:
d = {}
for e,url in enumerate(urls,1):
soup = BeautifulSoup(requests.get(url).content)
d[f'sh{e}'] = [h.get_text() for h in soup.find_all('h2')]
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items()]))#.to_csv('yourfile.csv', index = False)
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
urls = ['https://www.seikatsu110.jp/library/vermin/vr_termite/23274/','https://yuko-navi.com/termite-control-subsidies']
d = {}
for e,url in enumerate(urls,1):
soup = BeautifulSoup(requests.get(url).content)
d[f'sh{e}'] = [h.get_text() for h in soup.find_all('h2')]
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items()]))#.to_csv('yourfile.csv', index = False)
Output
sh1
sh2
シロアリ駆除に適用される補助金や保険は?
1章 シロアリ駆除工事に補助金はない!
シロアリ駆除の費用を補助金なしで抑える方法
2章 確定申告時に「雑損控除」申請がおすすめ
シロアリ駆除の費用ってどれくらいかかる?
3章 「雑損控除」として負担してもらえる金額
要件を満たせば加入できるシロアリ専門の保険がある?
4章 「雑損控除」の申請方法
シロアリには5年保証がある!
5章 損したくないなら信頼できる業者を選ぼう!
まとめ
まとめ
この記事の監修者 ナカザワ氏について
nan
この記事の監修者 ナカザワ氏について
nan
シロアリ駆除のおすすめ記事
nan
関連記事カテゴリ一覧
nan
シロアリ駆除の記事アクセスランキング
nan
シロアリ駆除の最新記事
nan
カテゴリ別記事⼀覧
nan
シロアリ駆除の業者を地域から探す
nan
関連カテゴリから業者を探す
nan
シロアリ駆除業者ブログ
nan
サービスカテゴリ
nan
生活110番とは
nan
加盟希望・ログイン
nan
This is one way to reach your goal as stated:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from itertools import zip_longest
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r01 = requests.get("https://www.seikatsu110.jp/library/vermin/vr_termite/23274/", headers=headers)
r02 = requests.get("https://yuko-navi.com/termite-control-subsidies", headers=headers)
first_url_headings = [x.get_text(strip=True) for x in bs(r01.text, 'html.parser').select('h2')]
second_url_headings = [x.get_text(strip=True) for x in bs(r02.text, 'html.parser').select('h2')]
df_list = list(zip_longest(first_url_headings, second_url_headings))
df = pd.DataFrame(df_list, columns = ['First site', 'Second site'])
df.to_csv('termites_stuffs.csv')
print(df)
Result in terminal (also saved as a csv file):
First site
Second site
0
シロアリ駆除に適用される補助金や保険は?
1章 シロアリ駆除工事に補助金はない!
1
シロアリ駆除の費用を補助金なしで抑える方法
2章 確定申告時に「雑損控除」申請がおすすめ
2
シロアリ駆除の費用ってどれくらいかかる?
3章 「雑損控除」として負担してもらえる金額
3
要件を満たせば加入できるシロアリ専門の保険がある?
4章 「雑損控除」の申請方法
4
シロアリには5年保証がある!
5章 損したくないなら信頼できる業者を選ぼう!
5
まとめ
まとめ
6
この記事の監修者 ナカザワ氏について
7
この記事の監修者 ナカザワ氏について
8
シロアリ駆除のおすすめ記事
9
関連記事カテゴリ一覧
10
シロアリ駆除の記事アクセスランキング
11
シロアリ駆除の最新記事
12
カテゴリ別記事⼀覧
13
シロアリ駆除の業者を地域から探す
14
関連カテゴリから業者を探す
15
シロアリ駆除業者ブログ
16
サービスカテゴリ
17
生活110番とは
18
加盟希望・ログイン
Documentation for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also for Requests: https://requests.readthedocs.io/en/latest/
And for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

How to scrape website while iterate on multiple pages

Trying to scrape this website using python beautifulsoup:
https://www.leandjaya.com/katalog
having some challenges in navigating the multiple pages of the website and scrape it
using python, this website has 11 pages, and curious to know the best option to
achieve this like use for loop and will break the loop if the page doesnt exist.
this is my initial code, I have set a big number 50, however seems this is not a good option.
page = 1
while page != 50:
url=f"https://www.leandjaya.com/katalog/ss/1/{page}/"
main = requests.get(url)
pmain = BeautifulSoup(main.text,'lxml')
page = page + 1
Sample output:
https://www.leandjaya.com/katalog/ss/1/1/
https://www.leandjaya.com/katalog/ss/1/2/
https://www.leandjaya.com/katalog/ss/1/3/
https://www.leandjaya.com/katalog/ss/1/<49>/
This is one way to extract that info and display it in a dataframe, based on an unknown number of pages with data:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
cars_list = []
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 1
while True:
try:
print('page:', counter)
url = f'https://www.leandjaya.com/katalog/ss/1/{counter}/'
r = s.get(url)
soup = bs(r.text, 'html.parser')
cars_cards = soup.select('div.item')
if len(cars_cards) < 1:
print('all done, no cars left')
break
for car in cars_cards:
car_name = car.select_one('div.item-title').get_text(strip=True)
car_price = car.select_one('div.item-price').get_text(strip=True)
cars_list.append((car_name, car_price))
counter = counter + 1
except Exception as e:
print('all done')
break
df = pd.DataFrame(cars_list, columns = ['Car', 'Price'])
print(df)
Result:
page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
all done, no cars left
Car Price
0 HONDA CRV 4X2 2.0 AT 2001 DP20jt
1 DUJUAL XPANDER 1.5 GLS 2018 MANUAL DP53jt
2 NISSAN JUKE 1.5 CVT 2011 MATIC DP33jt
3 Mitsubishi Xpander 1.5 Exceed Manual 2018 DP50jt
4 BMW X1 2.0 AT SDRIVE 2011 DP55jt
... ... ...
146 Daihatsu Sigra 1.2 R AT DP130jt
147 Daihatsu Xenia Xi 2010 DP85jt
148 Suzuki Mega Carry Pick Up 1.5 DP90jt
149 Honda Mobilio Tipe E Prestige DP150jt
150 Honda Freed Tipe S Rp. 170jtRp. 165jt
151 rows × 2 columns
The relevant documentations for the packages used above can be found at:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html
https://requests.readthedocs.io/en/latest/
https://pandas.pydata.org/pandas-docs/stable/index.html

Web scraping multiple pages in python

So I'm trying to web scrape a website that has around 500 pages for used cars and each page has around 22 cars, I managed to extract the first 22 cars from the first page, but how can make my code iterate through all the pages so I can get all cars? (I'm a beginner so sorry if my code is not well structured)
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
website = 'https://ksa.yallamotor.com/used-cars/search'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0'
}
response = requests.get(website, headers=headers)
links = []
car_name = []
model_year = []
cars = []
soup = BeautifulSoup(response.text, 'lxml')
cars = soup.find_all('div', class_='singleSearchCard m24t p12 bg-w border-gray border8')
for c in cars:
l = "https://ksa.yallamotor.com/" + c.find('a', class_='black-link')['href']
links.append(l)
for i in range(0,22):
url = links[i]
session_object = requests.Session()
result = session_object.get(url, headers=headers)
soup = BeautifulSoup(result.text, 'lxml')
name = soup.find('h1', class_="font24")
car_name.append(name.text)
y = soup.find_all('div', class_="font14 text-center font-b m2t")[0]
model_year.append(y.text)
Website is under Cloudflare protection, so you would need something like cloudscraper (pip install cloudscraper). The following code will get you your data (you can further analyse each car, get the details you need, etc):
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
for x in range(1, 501):
r = scraper.get(f'https://ksa.yallamotor.com/used-cars/search?page={x}&sort=updated_desc')
soup = BeautifulSoup(r.text, 'html.parser')
cars = soup.select('.singleSearchCard')
for car in cars:
url = car.select_one('a.black-link')
print(url.get_text(strip=True), url['href'])
Result printed in terminal:
Used BMW 7 Series 730Li 2018 /used-cars/bmw/7-series/2018/used-bmw-7-series-2018-jeddah-1294758
Used Infiniti QX80 5.6L Luxe (8 Seats) 2020 /used-cars/infiniti/qx80/2020/used-infiniti-qx80-2020-jeddah-1295458
Used Chevrolet Suburban 5.3L LS 2WD 2018 /used-cars/chevrolet/suburban/2018/used-chevrolet-suburban-2018-jeddah-1302084
Used Chevrolet Silverado 2016 /used-cars/chevrolet/silverado/2016/used-chevrolet-silverado-2016-jeddah-1297430
Used GMC Yukon 5.3L SLE (2WD) 2018 /used-cars/gmc/yukon/2018/used-gmc-yukon-2018-jeddah-1304469
Used GMC Yukon 5.3L SLE (2WD) 2018 /used-cars/gmc/yukon/2018/used-gmc-yukon-2018-jeddah-1304481
Used Chevrolet Impala 3.6L LS 2018 /used-cars/chevrolet/impala/2018/used-chevrolet-impala-2018-jeddah-1297427
Used Infiniti Q70 3.7L Luxe 2019 /used-cars/infiniti/q70/2019/used-infiniti-q70-2019-jeddah-1295235
Used Chevrolet Tahoe LS 2WD 2018 /used-cars/chevrolet/tahoe/2018/used-chevrolet-tahoe-2018-jeddah-1305486
Used Mercedes-Benz 450 SEL 2018 /used-cars/mercedes-benz/450-sel/2018/used-mercedes-benz-450-sel-2018-jeddah-1295830
[...]

Trying to scrape a table from a website with <div tags

I am trying to scrape this table https://momentranks.com/topshot/account/mariodustice?limit=250
I have tried this:
import requests
from bs4 import BeautifulSoup
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
But it returns an empty list. Can someone give advice on how to approach this?
Selenium is a bit overkill when there is an available api. Just get the data directly:
import requests
import pandas as pd
url = 'https://momentranks.com/api/account/details'
rows = []
page = 0
while True:
payload = {
'filters': {'page': '%s' %page, 'limit': "250", 'type': "moments"},
'flowAddress': "f64f1763e61e4087"}
jsonData = requests.post(url, json=payload).json()
data = jsonData['data']
rows += data
print('%s of %s' %(len(rows),jsonData['totalCount'] ))
if len(rows) == jsonData['totalCount']:
break
page += 1
df = pd.DataFrame(rows)
Output:
print(df)
_id flowId ... challenges priceFloor
0 619d2f82fda908ecbe74b607 24001245 ... NaN NaN
1 61ba30837c1f070eadc0f8e4 25651781 ... NaN NaN
2 618d87b290209c5a51128516 21958292 ... NaN NaN
3 61aea763fda908ecbe9e8fbf 25201655 ... NaN NaN
4 60c38188e245f89daf7c4383 15153366 ... NaN NaN
... ... ... ... ...
1787 61d0a2c37c1f070ead6b10a8 27014524 ... NaN NaN
1788 61d0a2c37c1f070ead6b10a8 27025557 ... NaN NaN
1789 61e9fafcd8acfcf57792dc5d 28711771 ... NaN NaN
1790 61ef40fcd8acfcf577273709 28723650 ... NaN NaN
1791 616a6dcb14bfee6c9aba30f9 18394076 ... NaN NaN
[1792 rows x 40 columns]
The data is indexed into the page using js code you cant use requests alone however you can use selenium
Keep in mind that Selenium's driver.get dosnt wait for the page to completley load which means you need to wait
Here to get you started with selenium
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = driver.get(url)
time.sleep(5) #edit the time of this depending on your case (in seconds)
soup = BeautifulSoup(page.source, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
The source HTML you see in your browser is rendered using javascript. When you use requests this does not happen which is why your script is not working. If you print the HTML that is returned, it will not contain the information you wanted.
All of the information is though available via the API which your browser makes calls to to build the page. You will need to take a detailed look at the JSON data structure returned to decide which information you wish to extract.
The following example shows how to get a list of the names and MRvalue of each player:
import requests
from bs4 import BeautifulSoup
import json
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
req_main = s.get(url, headers=headers)
soup = BeautifulSoup(req_main.content, 'lxml')
data = soup.find('script', id='__NEXT_DATA__')
json_data = json.loads(data.string)
account = json_data['props']['pageProps']['account']['flowAddress']
post_data = {"flowAddress" : account,"filters" : {"page" : 0, "limit":"250", "type":"moments"}}
req_json = s.post('https://momentranks.com/api/account/details', headers=headers, data=post_data)
player_data = req_json.json()
for player in player_data['data']:
name = player['moment']['playerName']
mrvalue = player['MRvalue']
print(f"{name:30} ${mrvalue:.02f}")
Giving you output starting:
Scottie Barnes $672.38
Cade Cunningham $549.00
Josh Giddey $527.11
Franz Wagner $439.26
Trae Young $429.51
A'ja Wilson $387.07
Ja Morant $386.00
The flowAddress is needed from the first page request to allow the API to be used correctly. This happens to be embedded in a <script> section at the bottom of the HTML.
All of this was worked out by using the browser's network tools to watch how the actual webpage made requests to the server to build its page.

Categories