I have a piece of python code which logs me into a website. I am trying to extract the data of a particular table, I'm getting errors and I'm not sure how to resolve it after searching online.
Here is my code written in my f.py file:
import mechanize
from bs4 import BeautifulSoup
import cookielib
import requests
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("http://kingmedia.tv/home")
br.select_form(nr=0)
br.form['vb_login_username'] = 'abcde'
br.form['vb_login_password'] = '12345'
br.submit()
a = br.response().read()
url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
print (url)
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for table in soup.select('table#tborder tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
print (url) gives me the HTML data of the url which I have shown below from which I want to extract the table data. The table data that I am interested in is table class="tborder"
Update: 07/05/2021
Using soup = BeautifulSoup(content, 'lxml') as suggested by #Code-Apprentice, I am able to get the desired data. However I am struggling to obtain it fully.
I need this table and the source code from the link is the following:
<td class="alt1" width="100%"><div><font size="2"><div>
Live: EPL - Leicester v Newcastle (CH3): 05/07/21 to 05/07/21
</div><div>
Live: EPL - Liverpool v Southampton (CH3): 05/08/21 to 05/08/21
</div><div>
Live: UFC PreLims (CH2): 05/08/21 to 05/08/21
</div><div>
Live: UFC - Sandhagen v Dillashaw (CH2): 05/08/21 to 05/09/21
</div><div>
Live: EPL - Man City v Chelsea (CH3): 05/08/21 to 05/08/21
</div><div>
Live: La Liga - Barcelona v Atletico Madrid (CH6)(beIn): 05/08/21 to 05/08/21
</div><div>
Live: EPL - Leeds v Tottenham (CH3): 05/08/21 to 05/08/21
</div><div>
Live: F1 Qualifying (CH2): 05/08/21 to 05/08/21
</div><div>
Live: EPL - Sheff Utd v Crystal Palace (CH3): 05/08/21 to 05/08/21
</div></font><br>View More Detailed Calendar HERE</div></td>
url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
print (url)
soup = BeautifulSoup(requests.get(url).text, 'lxml')
This looks very suspicious. You are reading the content of one HTTP response then using it as the URL for another request. Instead, just parse the content of the first request with beautiful soup:
content = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
soup = BeautifulSoup(content, 'lxml')
First, I renamed url to content to reflect what the variable actual represents. Second, I use content directly in the creation of the BeautifulSoup object.
Disclaimer: this still might not be exactly correct, but it should get you headed in the right direction.
Related
I am trying to scrape Company name, Postcode, phone number and web address from:
https://www.matki.co.uk/matki-dealers/ Finding it difficult as the information is only retrieved upon clicking the region on the page. If anyone could help it would be much appreciated. Very new to both Python and especially scraping!
!pip install beautifulsoup4
!pip install urllib3
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.matki.co.uk/matki-dealers/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
I guess this is what you wanted to do: (you can put the result after in a file or a database, or even parse it and use it directly)
import requests
from bs4 import BeautifulSoup
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
# parse HTML
soup = BeautifulSoup(page.content, "html.parser")
# extract the HTML results
results = soup.find(class_="dealer-region")
company_elements = results.find_all("article")
# Loop through the results and extract the wanted informations
for company_element in company_elements:
# some cleanup before printing the info:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
# the results ..
print(company_info)
Output:
ESP Bathrooms & Interiors, Queens Retail Park, Queens Street, Preston, PR1 4HZ, 01772 200400, www.espbathrooms.co.uk
Paul Scarr & Son Ltd, Supreme Centre, Haws Hill, Lancaster Road A6, Carnforth, LA5 9DG, 01524 733788,
Stonebridge Interiors, 19 Main Street, Ponteland, NE20 9NH, 01661 520251, www.stonebridgeinteriors.com
Bathe Distinctive Bathrooms, 55 Pottery Road, Wigan, WN3 5AA, www.bathe-showroom.co.uk
Draw A Bath Ltd, 68 Telegraph Road, Heswall, Wirral, CH60 7SG, 0151 342 7100, www.drawabath.co.uk
Acaelia Home Design, Unit 4 Fence Avenue Industrial Estate, Macclesfield, Cheshire, SK10 1LT, 01625 464955, www.acaeliahomedesign.co.uk
...
I am trying to scrape a website with multiple brackets. My plan is to have 3 variables (oem, model, leadtime) to generate the desired output. However, I cannot figure out how to scrape this webpage in 3 variables. Given I am new to python and BeautifulSoup, I highly appreciate your feedback.
Desired output with 3 varibales and the command:
print(oem, model, leadtime)
Renault, Mégane E-Tech, 12 Monate
Nissan, Ariya, 6 Monate
...
Volvo, XC90, 10-12 Monate
Output as of now:
Renault Mégane E-Tech12 Monate
Nissan Ariya6 Monate
Peugeot e-2086-7 Monate
KIA Sportage5-6 Monate6-7 Monate (Hybrid)
Jeep Compass3-5 Monate3-5 Monate (Hybrid)
VW Taigo3-6 Monate
...
XC9010-12 Monate
Code as of now:
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for card in overview.find_all('tbody'):
for model2 in card.find_all('tr'):
model = model2.text.replace('Angebote vergleichen', '')
#oem?-->this needs to be defined
#leadtime?--> this needs to defined
print(model)
The brand name is inside h3 tag. You can get the parent with this approach .find_all("div", {"class": "expandable-content-container"})
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for el in overview.find_all("div", {"class": "expandable-content-container"}):
header = el.find("h3").text.strip()
if not header.startswith("Top 10") and not header.endswith("?"):
for row in el.find_all("tr")[1:]:
model_monate = ", ".join(
list(map(lambda x: x.text, row.find_all("td")[:-1]))
)
print(f"{el.find('h3').text.strip()}, {model_monate}")
print("----")
the parts of the car model info that you're trying to scrape are actually stored in separate td tags, meaning, you can just access their index to get corresponding info, try the code below.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.carwow.de/neuwagen-lieferzeiten#gref").text
soup = BeautifulSoup(response, 'html.parser')
for tbody in soup.select('tbody'):
for tr in tbody:
brand = tr.select('td > a')[0].get('href').split('/')[3].capitalize()
model = tr.select('td > a')[0].get('href').split('/')[4].capitalize()
monate = tr.select('td')[1].getText(strip=True)
print(f'{brand}, {model}, {monate}')
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))
I need your help to have an explanation on how to do pagination and loop on 5 different pages but with the same URL (http://www.chartsinfrance.net/charts/albums.php,p2) with just the last word of the URL who change for the number of the page.
I can scrape data of the first page but I don't understand how to get other URLs and scrape all the data in one loop and having like the 250 songs in one execution of the script!
import requests
from bs4 import BeautifulSoup
req = requests.get('http://www.chartsinfrance.net/charts/albums.php')
soup = BeautifulSoup(req.text, "html.parser")
charts = soup.select('.c1_td1')
Auteurs=[]
Titre=[]
Rang=[]
Evolution=[]
for chart in charts:
Rang = chart.select_one('.c1_td2').get_text()
Auteurs = chart.select_one('.c1_td5 a').get_text()
Evolution = chart.select_one('.c1_td3').get_text()
Titre = chart.select_one('.c1_td5 .noir11').get_text()
print('--------')
print(Auteurs)
print(Titre)
print(Rang)
print(Evolution)
You can put your code to while ... loop, where you load soup, get information about songs and then select link to next page.
If the link to next page exists, load new soup and continue the loop.
If not, break the loop.
For example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.chartsinfrance.net/charts/albums.php'
while True:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
charts = soup.select('.c1_td1')
Auteurs=[]
Titre=[]
Rang=[]
Evolution=[]
for chart in charts:
Rang = chart.select_one('.c1_td2').get_text()
Auteurs = chart.select_one('.c1_td5 a').get_text()
Evolution = chart.select_one('.c1_td3').get_text()
Titre = chart.select_one('.c1_td5 .noir11').get_text()
print('--------')
print(Auteurs)
print(Titre)
print(Rang)
print(Evolution)
next_link = soup.select_one('a:contains("→ Suite du classement")')
if next_link:
url = 'http://www.chartsinfrance.net' + next_link['href']
else:
break
Prints:
--------
Lady Gaga
Chromatica
1
Entrée
--------
Johnny Hallyday
Johnny 69
2
Entrée
--------
...
--------
Bof
Pulp Fiction
248
-115
--------
Trois Cafés Gourmands
Un air de Live
249
-30
--------
Various Artists
Salut Les Copains 60 Ans
250
Entrée
I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks
You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385
You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)