a data collection with web scraping

a data collection with web scraping - python

I'am trying to extract data from a site and then to create a DataFrame out of it. the program doesnt work properly. I'am new in web scraping. Hope somoene help me out and find the problem.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/chart/top/?sort=rk,asc&mode=simple&page=1'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
film_in= soup.find('tbody').findAll('tr')
#print(film_in)
film = film_in[0]
#print(film)
titre = film.find("a",{'title':'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'})
print(titre.text)
rang = film.find("td",{'class':'ratingColumn imdbRating'}).find('strong').text
#print(rang)
def remove_parentheses(string):
return string.replace("(","").replace(")","")
année = film.find("span",{'class':'secondaryInfo'}).text
#print(année)
imdb =[]
for films in film_in:
titre = film.find("a",{'title':'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'})
rang = film.find("td",{'class':'ratingColumn imdbRating'}).find('strong').text
année =(remove_parentheses(film.find("span",{'class':'secondaryInfo'}).text))
dictionnaire = {'film': film,
'rang': rang,
'année':année
}
imdb.append(dictionnaire)
df_imdb = pd.DataFrame(imdb)
print(df_imdb)
I'am trying to extract data from a site and then to create a DataFrame out of it. the program doesnt work properly. I need to solve it using urllib, is there a way. thanks in advance
I'am new in web scraping.

You can try the next example:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://www.imdb.com/chart/top/?sort=rk,asc&mode=simple&page=1'
#soup = BeautifulSoup(requests.get(url).text,'html.parser')# It's the perfect and powerful
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
imdb = []
film_in = soup.select('table[class="chart full-width"] tr')
for film in film_in[1:]:
titre = film.select_one('.titleColumn a').get_text(strip=True)
rang = film.select_one('[class="ratingColumn imdbRating"] > strong').text
année =film.find("span",{'class':'secondaryInfo'}).get_text(strip=True)
dictionnaire = {'titre': titre,
'rang': rang,
'année':année
}
imdb.append(dictionnaire)
df_imdb = pd.DataFrame(imdb)
print(df_imdb)
Output:
titre rang année
0 The Shawshank Redemption 9.2 (1994)
1 The Godfather 9.2 (1972)
2 The Dark Knight 9.0 (2008)
3 The Godfather Part II 9.0 (1974)
4 12 Angry Men 9.0 (1957)
.. ... ... ...
245 Dersu Uzala 8.0 (1975)
246 Aladdin 8.0 (1992)
247 The Help 8.0 (2011)
248 The Iron Giant 8.0 (1999)
249 Gandhi 8.0 (1982)
[250 rows x 3 columns]

Related

Pagination not showing up in parsed content (BeautifulSoup)

I am new to python programming and I have a problem with pagination while using beautiful soup. all the parsed content show up except the pagination contents. image of content not showing up I have highlighted the lines which does not show up.
Website link.
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
from lxml import html
url = "https://www.yellowpages.lk/Medical.php"
result = requests.get(url)
time.sleep(5)
doc = BeautifulSoup(result.content, "lxml")
time.sleep(5)
Table = doc.find('table',{'id':'MedicalFacility'}).find('tbody').find_all('tr')
Page = doc.select('.col-lg-10')
C_List = []
D_List = []
N_List = []
A_List = []
T_List = []
W_List = []
V_List = []
M_List = []
print(doc.prettify())
print(Page)
while True:
for i in range(0,25):
Sort = Table[i]
Category = Sort.find_all('td')[0].get_text().strip()
C_List.insert(i,Category)
District = Sort.find_all('td')[1].get_text().strip()
D_List.insert(i,District)
Name = Sort.find_all('td')[2].get_text().strip()
N_List.insert(i,Name)
Address = Sort.find_all('td')[3].get_text().strip()
A_List.insert(i,Address)
Telephone = Sort.find_all('td')[4].get_text().strip()
T_List.insert(i,Telephone)
Whatsapp = Sort.find_all('td')[5].get_text().strip()
W_List.insert(i,Whatsapp)
Viber = Sort.find_all('td')[6].get_text().strip()
V_List.insert(i,Viber)
MoH_Division = Sort.find_all('td')[7].get_text().strip()
M_List.insert(i,MoH_Division)
I tried using .find() with class and .select('.class') to see if the pagination contents show up so far nothing has worked

The pagination is more or less superfluous in that page: the data is loaded anyway, and Javascript is generating pagination just for display purposes: Requests will get full data anyway.
Here is one way of getting that information in full:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www.yellowpages.lk/Medical.php'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[id="MedicalFacility"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
Category District Name Address Telephone WhatsApp Viber MoH Division
0 Pharmacy Gampaha A & B Pharmacy 171 Negambo Road Veyangoda 0778081515 9.477808e+10 9.477808e+10 Aththanagalla
1 Pharmacy Trincomalee A A Pharmacy 350 Main Street Kanthale 0755576998 9.475558e+10 9.475558e+10 Kanthale
2 Pharmacy Colombo A Baur & Co Pvt Ltd 55 Grandpass Rd Col 14 0768200100 9.476820e+10 9.476820e+10 CMC
3 Pharmacy Colombo A Colombo Pharmacy Ug 93 97 Peoples Park Colombo 11 0773771446 9.477377e+10 NaN CMC
4 Pharmacy Trincomalee A R Pharmacy Main Street Kinniya-3 0771413838 9.477500e+10 9.477500e+10 Kinniya
... ... ... ... ... ... ... ... ...
1968 Pharmacy Ampara Zam Zam Pharmacy Main Street Akkaraipattu 0672277698 9.477756e+10 9.477756e+10 Akkaraipattu
1969 Pharmacy Batticaloa Zattra Pharmacy Jummah Mosque Rd Oddamawadi-1 0766689060 9.476669e+10 NaN Oddamavady
1970 Pharmacy Puttalam Zeenath Pharmacy Norochcholei 0728431622 NaN NaN Kalpitiya
1971 Pharmacy Puttalam Zidha Pharmacy Norochcholei 0773271222 NaN NaN Kalpitiya
1972 Pharmacy Gampaha Zoomcare Pharmacy & Grocery 182/B/1 Rathdoluwa Seeduwa 0768378112 NaN NaN Seeduwa
1973 rows × 8 columns
See pandas documentation here. Also BeautifulSoup documentation, and lastly, Requests documentation.

If you are using pandas, all you need is just a couple of lines of code to put the entire table into a dataframe.
All you need is pandas.read_html() function as follows:
Code:
import pandas as pd
df = pd.read_html("https://www.yellowpages.lk/Medical.php")[0]
print(df)
Output:

Scraping data for Company details

I am trying to scrape Company name, Postcode, phone number and web address from:
https://www.matki.co.uk/matki-dealers/ Finding it difficult as the information is only retrieved upon clicking the region on the page. If anyone could help it would be much appreciated. Very new to both Python and especially scraping!
!pip install beautifulsoup4
!pip install urllib3
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.matki.co.uk/matki-dealers/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

I guess this is what you wanted to do: (you can put the result after in a file or a database, or even parse it and use it directly)
import requests
from bs4 import BeautifulSoup
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
# parse HTML
soup = BeautifulSoup(page.content, "html.parser")
# extract the HTML results
results = soup.find(class_="dealer-region")
company_elements = results.find_all("article")
# Loop through the results and extract the wanted informations
for company_element in company_elements:
# some cleanup before printing the info:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
# the results ..
print(company_info)
Output:
ESP Bathrooms & Interiors, Queens Retail Park, Queens Street, Preston, PR1 4HZ, 01772 200400, www.espbathrooms.co.uk
Paul Scarr & Son Ltd, Supreme Centre, Haws Hill, Lancaster Road A6, Carnforth, LA5 9DG, 01524 733788,
Stonebridge Interiors, 19 Main Street, Ponteland, NE20 9NH, 01661 520251, www.stonebridgeinteriors.com
Bathe Distinctive Bathrooms, 55 Pottery Road, Wigan, WN3 5AA, www.bathe-showroom.co.uk
Draw A Bath Ltd, 68 Telegraph Road, Heswall, Wirral, CH60 7SG, 0151 342 7100, www.drawabath.co.uk
Acaelia Home Design, Unit 4 Fence Avenue Industrial Estate, Macclesfield, Cheshire, SK10 1LT, 01625 464955, www.acaeliahomedesign.co.uk
...

how can i get all product name by scraping this

import requests as r
from bs4 import BeautifulSoup as bs
url=r.get("https://www.consumerreports.org/cro/coffee-makers.htm")
soup=bs(url.content)
product=soup.find('div',class_="row product-type-container")
pclass=product.find('div',class_="product-type-item col-xs-4")
pname=pclass.find('div',class_="product-type-info-container").h3.text
print(pname)
i am scraping all the product name and details but can only scrape one product at a time how can i scrape

To get titles of all products in all categories you can use next example:
import requests
from bs4 import BeautifulSoup
def get_products(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
out = []
for title in soup.select(".crux-component-title"):
out.append(title.get_text(strip=True))
return out
url = "https://www.consumerreports.org/cro/coffee-makers.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for category_link in soup.select("h3.crux-product-title a"):
u = "https://www.consumerreports.org" + category_link["href"]
print("Getting {}".format(u))
all_data.extend(get_products(u))
for i, title in enumerate(all_data, 1):
print("{:<5} {}".format(i, title))
Prints:
1 Bella 14755 with Brew Strength Selector
2 Bella Pro Series 90061
3 Betty Crocker 12-cup Stainless Steel BC-2809CB
4 Black+Decker 12-cup Programmable CM1331S
5 Black+Decker 12-Cup Thermal Programmable CM2046S
6 Black+Decker CM2036S 12-cup Thermal
7 Black+Decker CM4000S
8 Black+Decker DLX1050B
9 Black+Decker Even Stream CM2035B
10 Black+Decker Honeycomb Collection CM1251W
11 Black+Decker Programmable CM1331BS (Walmart Exclusive)
12 Bonavita BV1901TS 8-Cup One-Touch
13 Braun Brew Sense KF7150BK
14 Braun BrewSense 12-cup Programmable KF7150
15 Braun BrewSense 12-cup Programmable KF7000BK
...and so on.

Why is that: find(..) returns only first object which matches your criteria.
Solution: Try using find_all(..) method.

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?

DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]

To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

I can´t get all the data while Web Scraping

I'm trying to web scrape this URL = https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php.
I need to gather the values of "N°cole." column and "Nombre Colegiado" column.
I'm using BeautifulSoup but I get only values of "N°cole." column. How can I fix that?
Thanks!
This is my code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("span",{'class':'colColegiado'})
numero_col = []
for i in data:
data_num = i.text.strip()
numero_col.append(data_num)
numero_col
['Nº cole.',
'6478',
'13107',
'7341',
'12110',
'5625',
'4877',
'4700',
'9126',
'8444',
'13120',
'5023',
'12235',
'7747',
'17701',
'17391',
'17944',
'17772',
'7230',
'11729',
'17275']

You're currently fetching the values from the wrong html elements - it should be from all <p>s with the resalto class.
import requests
from bs4 import BeautifulSoup
#import pandas as pd
#import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("p",{'class':'resalto'})
schools = []
for result in data:
data_num = result.contents[0].text.strip()
#numero_col.append(data_num)
data_name = str(result.contents[1])
schools.append((data_num,data_name))
print(schools)

Instead of selecting all p at once, you can loop through the paragraphs in the table only. The following code takes page number and saves the table to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
pageno = 1
res = requests.get(f'https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php?nombre=&ap=&colegio=&col=&nif=&pagina={pageno}')
soup = BeautifulSoup(res.text,"html.parser")
header = soup.find("div", {"id":"contactaForm"}).find("h4")
cols = [header.find("span").get_text(), header.get_text().replace(header.find("span").get_text(),"")]
data = []
for p in soup.find("div", {"id":"contactaForm"}).find_all("p"):
if len(p['class']) == 0 or p['class'][0] == "resalto":
child = list(p.children)
data.append([child[0].get_text(strip=True), child[1]])
df = pd.DataFrame(data,columns=cols)
df.to_csv("data.csv", index=False)
print(df)
Output:
Nº cole. Nombre colegiado
0 6478 GUADALUPE LAZARO LAZARO
1 13107 JOSE MARIA PIÑA MANZANO
2 7341 HEIKE ELFRIEDE BIRKHOLZ
3 12110 ESTHER TIZON ROLDAN
4 5625 MARIA DOLORES TOMAS GARCIA-VAQUERO
5 4877 MARIA CARMEN CASADO LLAVONA
6 4700 MANUEL GUILABERT ORTEGA-VILLAIZAN
7 9126 MARIA ESPERANZA ASENSIO ALMAZAN
8 8444 CONCEPCION VIALARD RODRIGUEZ
9 13120 NURIA VILLAESCUSA SANCHEZ
10 5023 ARTURO BONET BLANCO
11 12235 ALFONSO JIMENEZ LOPEZ
12 7747 JACOBUS PETRUS SINNIGE
13 17701 ANIA BRAVO FIGUEREDO
14 17391 LUSINE DAMIRCHYAN
15 17944 ISALKOU DJIL MERHBA
16 17772 CARLA DENISSE FIGUEROA PIEDRA
17 7230 MARIA ISABEL VISO CABAÑERO
18 11729 PILAR GARCIA SALAZAR
19 17275 MARIA LOURDES MALLEN LLUIS

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

a data collection with web scraping - python

Related

Pagination not showing up in parsed content (BeautifulSoup)

Scraping data for Company details

how can i get all product name by scraping this

How to get all products from a beautifulsoup page

I can´t get all the data while Web Scraping

Categories

Resources