I am new to python programming and I have a problem with pagination while using beautiful soup. all the parsed content show up except the pagination contents. image of content not showing up I have highlighted the lines which does not show up.
Website link.
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
from lxml import html
url = "https://www.yellowpages.lk/Medical.php"
result = requests.get(url)
time.sleep(5)
doc = BeautifulSoup(result.content, "lxml")
time.sleep(5)
Table = doc.find('table',{'id':'MedicalFacility'}).find('tbody').find_all('tr')
Page = doc.select('.col-lg-10')
C_List = []
D_List = []
N_List = []
A_List = []
T_List = []
W_List = []
V_List = []
M_List = []
print(doc.prettify())
print(Page)
while True:
for i in range(0,25):
Sort = Table[i]
Category = Sort.find_all('td')[0].get_text().strip()
C_List.insert(i,Category)
District = Sort.find_all('td')[1].get_text().strip()
D_List.insert(i,District)
Name = Sort.find_all('td')[2].get_text().strip()
N_List.insert(i,Name)
Address = Sort.find_all('td')[3].get_text().strip()
A_List.insert(i,Address)
Telephone = Sort.find_all('td')[4].get_text().strip()
T_List.insert(i,Telephone)
Whatsapp = Sort.find_all('td')[5].get_text().strip()
W_List.insert(i,Whatsapp)
Viber = Sort.find_all('td')[6].get_text().strip()
V_List.insert(i,Viber)
MoH_Division = Sort.find_all('td')[7].get_text().strip()
M_List.insert(i,MoH_Division)
I tried using .find() with class and .select('.class') to see if the pagination contents show up so far nothing has worked
The pagination is more or less superfluous in that page: the data is loaded anyway, and Javascript is generating pagination just for display purposes: Requests will get full data anyway.
Here is one way of getting that information in full:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www.yellowpages.lk/Medical.php'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[id="MedicalFacility"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
Category District Name Address Telephone WhatsApp Viber MoH Division
0 Pharmacy Gampaha A & B Pharmacy 171 Negambo Road Veyangoda 0778081515 9.477808e+10 9.477808e+10 Aththanagalla
1 Pharmacy Trincomalee A A Pharmacy 350 Main Street Kanthale 0755576998 9.475558e+10 9.475558e+10 Kanthale
2 Pharmacy Colombo A Baur & Co Pvt Ltd 55 Grandpass Rd Col 14 0768200100 9.476820e+10 9.476820e+10 CMC
3 Pharmacy Colombo A Colombo Pharmacy Ug 93 97 Peoples Park Colombo 11 0773771446 9.477377e+10 NaN CMC
4 Pharmacy Trincomalee A R Pharmacy Main Street Kinniya-3 0771413838 9.477500e+10 9.477500e+10 Kinniya
... ... ... ... ... ... ... ... ...
1968 Pharmacy Ampara Zam Zam Pharmacy Main Street Akkaraipattu 0672277698 9.477756e+10 9.477756e+10 Akkaraipattu
1969 Pharmacy Batticaloa Zattra Pharmacy Jummah Mosque Rd Oddamawadi-1 0766689060 9.476669e+10 NaN Oddamavady
1970 Pharmacy Puttalam Zeenath Pharmacy Norochcholei 0728431622 NaN NaN Kalpitiya
1971 Pharmacy Puttalam Zidha Pharmacy Norochcholei 0773271222 NaN NaN Kalpitiya
1972 Pharmacy Gampaha Zoomcare Pharmacy & Grocery 182/B/1 Rathdoluwa Seeduwa 0768378112 NaN NaN Seeduwa
1973 rows × 8 columns
See pandas documentation here. Also BeautifulSoup documentation, and lastly, Requests documentation.
If you are using pandas, all you need is just a couple of lines of code to put the entire table into a dataframe.
All you need is pandas.read_html() function as follows:
Code:
import pandas as pd
df = pd.read_html("https://www.yellowpages.lk/Medical.php")[0]
print(df)
Output:
Related
I don't usually play with BeautifulSoup in Python so I am struggling to find the value 8.133,00 that matches with the Ibex 35 in the web page: https://es.investing.com/indices/indices-futures
So far I am getting all the info of the page, but I can't filter to get that value:
site = 'https://es.investing.com/indices/indices-futures'
hardware = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101
Firefox/106.0'}
request = Request(site,headers=hardware)
page = urlopen(request)
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I appreciate a hand to get that value.
Regards
Here is a way of getting that bit of information - a dataframe with all the info in that table containing IBEX 35, DAX, and so on, you can then slice that dataframe as you wish.
import pandas as pd
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
url = 'https://es.investing.com/indices/indices-futures'
r = scraper.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[class="datatable_table__D_jso quotes-box_table__nndS2 datatable_table--mobile-basic__W2ilt"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
0 1 2 3 4
0 IBEX 35derived 8.098,10 -3510 -0,43% NaN
1 US 500derived 3.991,90 355 +0,90% NaN
2 US Tech 100derived 11.802,20 1962 +1,69% NaN
3 Dow Jones 33.747,86 3249 +0,10% NaN
4 DAXderived 14.224,86 7877 +0,56% NaN
5 Índice dólarderived 106255 -1837 -1,70% NaN
6 Índice euroderived 11404 89 +0,79% NaN
See https://pypi.org/project/cloudscraper/
So I'm trying to web scrape a website that has around 500 pages for used cars and each page has around 22 cars, I managed to extract the first 22 cars from the first page, but how can make my code iterate through all the pages so I can get all cars? (I'm a beginner so sorry if my code is not well structured)
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
website = 'https://ksa.yallamotor.com/used-cars/search'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0'
}
response = requests.get(website, headers=headers)
links = []
car_name = []
model_year = []
cars = []
soup = BeautifulSoup(response.text, 'lxml')
cars = soup.find_all('div', class_='singleSearchCard m24t p12 bg-w border-gray border8')
for c in cars:
l = "https://ksa.yallamotor.com/" + c.find('a', class_='black-link')['href']
links.append(l)
for i in range(0,22):
url = links[i]
session_object = requests.Session()
result = session_object.get(url, headers=headers)
soup = BeautifulSoup(result.text, 'lxml')
name = soup.find('h1', class_="font24")
car_name.append(name.text)
y = soup.find_all('div', class_="font14 text-center font-b m2t")[0]
model_year.append(y.text)
Website is under Cloudflare protection, so you would need something like cloudscraper (pip install cloudscraper). The following code will get you your data (you can further analyse each car, get the details you need, etc):
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
for x in range(1, 501):
r = scraper.get(f'https://ksa.yallamotor.com/used-cars/search?page={x}&sort=updated_desc')
soup = BeautifulSoup(r.text, 'html.parser')
cars = soup.select('.singleSearchCard')
for car in cars:
url = car.select_one('a.black-link')
print(url.get_text(strip=True), url['href'])
Result printed in terminal:
Used BMW 7 Series 730Li 2018 /used-cars/bmw/7-series/2018/used-bmw-7-series-2018-jeddah-1294758
Used Infiniti QX80 5.6L Luxe (8 Seats) 2020 /used-cars/infiniti/qx80/2020/used-infiniti-qx80-2020-jeddah-1295458
Used Chevrolet Suburban 5.3L LS 2WD 2018 /used-cars/chevrolet/suburban/2018/used-chevrolet-suburban-2018-jeddah-1302084
Used Chevrolet Silverado 2016 /used-cars/chevrolet/silverado/2016/used-chevrolet-silverado-2016-jeddah-1297430
Used GMC Yukon 5.3L SLE (2WD) 2018 /used-cars/gmc/yukon/2018/used-gmc-yukon-2018-jeddah-1304469
Used GMC Yukon 5.3L SLE (2WD) 2018 /used-cars/gmc/yukon/2018/used-gmc-yukon-2018-jeddah-1304481
Used Chevrolet Impala 3.6L LS 2018 /used-cars/chevrolet/impala/2018/used-chevrolet-impala-2018-jeddah-1297427
Used Infiniti Q70 3.7L Luxe 2019 /used-cars/infiniti/q70/2019/used-infiniti-q70-2019-jeddah-1295235
Used Chevrolet Tahoe LS 2WD 2018 /used-cars/chevrolet/tahoe/2018/used-chevrolet-tahoe-2018-jeddah-1305486
Used Mercedes-Benz 450 SEL 2018 /used-cars/mercedes-benz/450-sel/2018/used-mercedes-benz-450-sel-2018-jeddah-1295830
[...]
I want to crawl maritime news from Fleetmon.com news as well as with detail pages and save it in text file. I tried BeautifulSoup in python but it not work properly..
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl = 'https://www.fleetmon.com/maritime-news/'
headers = {'User-Agent': 'Mozilla/5.0'}
newslinks = [] # put all item in this array
for x in range(1): # set page range
response = requests.get(
f'https://www.fleetmon.com/maritime-news/?page={x}') # url of next page
soup = BeautifulSoup(response.content, 'html.parser')
newslist = soup.find_all('article')
# loop to get all href from ul
for item in newslist:
for link in item.find_all('a', href=True):
newslinks.append(link['href'])
newslinks = list(set(newslinks))
print(newslinks)
# news details pages
newsdata = []
for link in newslinks:
print(link)
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
shipName = soup.find('div', {'class': 'uk-article-story'}).text.strip()
fieldsets = soup.find_all('article')
row = {'Ship Name': shipName}
for fieldset in fieldsets:
dts = fieldset.find_all('h1')
for dt in dts:
row.update({dt.text.strip(): dt.find_next('p').text.strip()})
newsdata.append(row)
#text or csv
df = pd.DataFrame(newsdata)
df.to_csv (r'C:\Users\Usuario\Desktop\news.csv', index = False, header=True)
print(df)
Help me to improve my code to get all data in text form.
Also is it possible to crawl data and save it csv like this:
Column1:News_title:value
column2:category: accidents
column3:publish_date_time:June 28, 2022 at 13:31
column4:news:full news here
Go to the details page (here I use req2 to go to the details page) and I 've made the pagination using for loop and range function and you can increase or decrease the page numbers with no time.
P/S: If you click on any title link then you can see the details page and from the details pages are scraped all required data items.
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
url='https://www.fleetmon.com/maritime-news/?page={page}'
data=[]
for page in range(1,11):
req = requests.get(url.format(page=page),headers=headers)
soup = BeautifulSoup(req.text, 'lxml')
for link in soup.select('.news-headline h2 a') :
link='https://www.fleetmon.com' + link.get('href')
req2 = requests.get(link,headers=headers)
soup2 = BeautifulSoup(req2.text, 'lxml')
title= soup2.find('h1',class_="uk-article-title margin-t-0").text
cat=soup2.select_one('p.uk-article-meta span a strong').text
date=soup2.select_one('[class="uk-text-nowrap"]:nth-child(3)').text
details=soup2.select_one('.uk-article-story ').get_text(strip=True)
data.append({
'title':title,
'category':cat,
'date':date,
'details_news':details
})
df = pd.DataFrame(data)#.to_csv('news.csv',index=False)
print(df)
Output:
Cruise ship NORWEGIAN SUN hit iceberg, damaged... ... Cruise ship NORWEGIAN SUN hit an
iceberg size ...
1 Yang Ming and HMM Were Accused of Collusion to... ... YM WARRANTY by ship spotter phduck2kYM WARRANT...
2 Fire in bulk carrier cargo hold, Florida ... At around 2350 LT Jun 26 firefighters responde...
3 Chlorine gas tank fell on Chinese cargo ship, ... ... Tank with 25 tons of chlorine gas fell onto ca...
4 Heavy vehicle fell onto cargo deck during offl... ... Heavy machinery vehicle (probably mobile crane...
.. ... ...
...
195 Yara Plans 15 Ammonia Bunkering Terminals in S... ... VIKING ENERGY by ship spotter PattayaVIKING EN...
196 World’s Largest Electric Cruise Ship Sets Sail... ... ©Wuxi Saisiyi Electric Technology,©Wuxi Saisiy...
197 The Supply Chain Crisis Brewing at Israeli Ports ... Port Haifa in FleetMon ExplorerPort Haifa in F...
198 CDC Drops Its “Cruise Ship Travel Health Notic... ... AIDADIVA by ship spotter Becks93AIDADIVA by sh...
199 Scorpio Tankers Take the Path of Shipboard Car... ... CORONA UTILITY by ship spotter canonbenqCORONA...
[200 rows x 4 columns]
Here's link for scraping : https://stockanalysis.com/stocks/
I'm trying to get all the rows of the table (6000+ rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.
I tried almost everything I can. I'm , ALSO, a beginner in web scraping.
My code :
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000
What should I do?
Take a look down the bottom of the html and you will see this
<script id="__NEXT_DATA__" type="application/json">
Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.
As stated, it's in the <script> tags. Pull it and read it in.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])
Output:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))