Scraping web site - python

I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks

You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385

You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)

Related

How to webscrape multiple items in <tr> bracket and split them in 3 variables with python?

I am trying to scrape a website with multiple brackets. My plan is to have 3 variables (oem, model, leadtime) to generate the desired output. However, I cannot figure out how to scrape this webpage in 3 variables. Given I am new to python and BeautifulSoup, I highly appreciate your feedback.
Desired output with 3 varibales and the command:
print(oem, model, leadtime)
Renault, Mégane E-Tech, 12 Monate
Nissan, Ariya, 6 Monate
...
Volvo, XC90, 10-12 Monate
Output as of now:
Renault Mégane E-Tech12 Monate
Nissan Ariya6 Monate
Peugeot e-2086-7 Monate
KIA Sportage5-6 Monate6-7 Monate (Hybrid)
Jeep Compass3-5 Monate3-5 Monate (Hybrid)
VW Taigo3-6 Monate
...
XC9010-12 Monate
Code as of now:
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for card in overview.find_all('tbody'):
for model2 in card.find_all('tr'):
model = model2.text.replace('Angebote vergleichen', '')
#oem?-->this needs to be defined
#leadtime?--> this needs to defined
print(model)
The brand name is inside h3 tag. You can get the parent with this approach .find_all("div", {"class": "expandable-content-container"})
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for el in overview.find_all("div", {"class": "expandable-content-container"}):
header = el.find("h3").text.strip()
if not header.startswith("Top 10") and not header.endswith("?"):
for row in el.find_all("tr")[1:]:
model_monate = ", ".join(
list(map(lambda x: x.text, row.find_all("td")[:-1]))
)
print(f"{el.find('h3').text.strip()}, {model_monate}")
print("----")
the parts of the car model info that you're trying to scrape are actually stored in separate td tags, meaning, you can just access their index to get corresponding info, try the code below.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.carwow.de/neuwagen-lieferzeiten#gref").text
soup = BeautifulSoup(response, 'html.parser')
for tbody in soup.select('tbody'):
for tr in tbody:
brand = tr.select('td > a')[0].get('href').split('/')[3].capitalize()
model = tr.select('td > a')[0].get('href').split('/')[4].capitalize()
monate = tr.select('td')[1].getText(strip=True)
print(f'{brand}, {model}, {monate}')

How to list out all the h2, h3, and p tags then create a dataframe to store them

I had given a website to scrape all of the key items
But the output I got is only for one item using BeautifulSoup4. So wonder if I need to use anything like soup.findall to extract all the key items in a list from the website.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url=
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
column= soup.find(class_ = re.compile('columns is-multiline'))
print(column.prettify())
position = column.h2.text
company = column.h3.text
city_state= column.find_all('p')[-2].text
print (position, company, city_state)
Thank you.
Try this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://realpython.github.io/fake-jobs/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
positions = [pos.text for pos in soup.find_all('h2')]
companies = [com.text for com in soup.find_all('h3')]
city_state0 = []
city_state1 = []
for p in soup.find_all('p', {'class' : 'location'}):
city_state0.append(p.text.split(',')[0].strip())
city_state1.append(p.text.split(',')[1].strip())
df = pd.DataFrame({
'city_state1': city_state0,
'city_state2': city_state1,
'companies' : companies,
'positions' : positions
})
print(df)
Output:
You need to use find_all to get all the elements like so. find only gets the first element.
titles = soup.find_all('h2', class_='title is-5')
companies = soup.find_all('h3', class_='subtitle is-6 company')
locations = soup.find_all('p', class_='location')
# loop over locations and extract the city and state
for location in locations:
city = location.split(', ')[0]
state = location.split(', ')[1]

how to get list of members using beautifulsoup

I tried to do this
URL=str(browser.current_url)
page=requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
imena = soup.findAll('a', class_='text-headline')
imena
Assuming the URL is for the Members tab of the Starva club Россия 2021, i.e., https://www.strava.com/clubs/236545/members the following should work to get all of the members across the 193 pages (you really should be using the Strava API...):
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
BASE_URL = "https://www.strava.com/clubs/236545/members?page="
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(f"{BASE_URL}{1}")
driver.find_element_by_id("email").send_keys("<your-email>")
driver.find_element_by_id("password").send_keys("<your-password>")
driver.find_element_by_id("login-button").click()
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
num_pages = int(soup.find("ul", "pagination").find_all("li")[-2].text)
# Ignore the admins shown on each page
athletes = soup.find_all("ul", {"class": "list-athletes"})[1]
members = [
avatar.attrs['title']
for avatar in athletes.find_all("div", {"class": "avatar"})
if 'title' in avatar.attrs
]
for page in range(2, num_pages + 1):
time.sleep(1)
driver.get(f"{BASE_URL}{page}")
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
athletes = soup.find_all("ul", {"class": "list-athletes"})[1]
for avatar in athletes.find_all("div", {"class": "avatar"}):
if 'title' in avatar.attrs:
members.append(avatar.attrs['title'])
# Print first 10 members
print('\n'.join(m.strip() for m in members[:10]))
driver.close()
Output (first 10 members):
- Victor Koldaev - ♥LCHF Runners♥
Antonio Raposo ®️
Vadim Issin
"DuSenna🇧🇷 Vá com Garra e a Felicidade te Agarra 😉
#MIX MIX
#RunВасяRun ...
$ерЖ 🇷🇺 КЛИМoff
'Luis Fernando Osorio' MTB
( CE )Faisal ALShammary "حائل $الشرقية "
(# Monique #) bermudez
You need to first get the "text-headline" elements div and then loop through each of them to get the anchor links.
URL=str(browser.current_url)
page=requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
members = soup.findAll('div', {'class': 'text-headline'})
for (member in members):
name = member.find("a")
print(name.get_text())

How can I fix this python code to get all of the names of the people in the table I am referencing in the code?

I am trying to get python to give me the names of the State Senators from Alabama on Ballotpedia. However, the code I put together is only giving me the title I requested from the url but I am not getting any names. Here is my current python code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alabama_State_Senate']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable gray sortable tablesorter
jquery-tablesorter a")]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
I believe my error is in this line:
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.bptable gray sortable tablesorter jquery-tablesorter a")]
Thank you.
This seems to work for me:
import requests
from bs4 import BeautifulSoup
url = "https://ballotpedia.org/Alabama_State_Senate"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for row in soup.find(id="officeholder-table").select("tr:not([colspan])"):
name = row.select_one("td:nth-of-type(2)").text
print(name)
Another way to get all info:
holders = soup.select("#officeholder-table")
targets = holders[0].select('tr')
for target in targets:
print(target.text)
Output:
Alabama State Senate District 20
Linda Coleman-Madison
Democratic
2006
etc.

Python BeautifulSoup - get values from p

html = '<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'
soup = BeautifulSoup(html, 'html.parser')
sup_elem = soup.find("sup").string # 33 - it works
How do I get the "96" before the element ?
You can grab the previousSibling tag
from bs4 import BeautifulSoup
html = '''<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'''
soup = BeautifulSoup(html, 'html.parser')
elem1 = soup.find("sup").previousSibling
elem2 = soup.find("sup").text # 33 - it works
print ('.'.join([elem1, elem2]))
Output:
96.33
You can use children method. It will return a list of all the children of p tag. (6 will be first child of it.
html = '<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'
soup = BeautifulSoup(html, 'html.parser')
elem = list(soup.find("p").children)[0] #0th element of the list will be 96
sup_elem = soup.find("sup").string
result = elem + '.' + sup_elem #96.33
Use select instead.
from bs4 import BeautifulSoup
html = '''<p class="product-new-price">96<sup>33</sup> <span class="tether-target tether-enabled tether-element-attached-top tether-element-attached-left tether-target-attached-top tether-target-attached-right">Lei</span>
</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('.product-new-price').text.strip().replace('Lei',''))
There is no "." in source but you can always divide by 100
print(int(soup.select_one('.product-new-price').text.strip().replace('Lei',''))/100)

Categories