how to extract with beatifulsoup, problems with div

how to extract with beatifulsoup, problems with div - python

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Related

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?

to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Accessing multiple tags inside one tag

I´ve the following HTML code to webscrape:
<ul class="item-features">
<li>
<strong>Graphic Type:</strong> Dedicated Card
</li>
<li>
<strong>Resolution:</strong> 3840 x 2160
</li>
<li>
<strong>Weight:</strong> 4.40 lbs.
</li>
<li>
<strong>Color:</strong> Black
</li>
</ul>
I would like to print in a .csv file all single tags inside the : Graphic Type, Resolution, Weight, etc. in different columns in a .csv file.
I´ve tried the following in Python:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
url ='https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
Client = req(url)
pagina = Client.read()
Client.close()
pagina_soup=soup(pagina,"html.parser")
productes = pagina_soup.findAll("div",{"class":"item-container})
producte = productes [0]
features = producte.findAll("ul",{"class":"item-features"})
features[0].text
And it displays all the features but just in one single column of the .csv.
'\nGraphic Type: Dedicated CardResolution: 3840 x 2160Weight: 4.40 lbs.Color: Black\nModel #: AERO 15 OLED SA-7US5020SH\nItem #: N82E16834233268\nReturn Policy: Standard Return Policy\n'
I don´t now how to export them one by one. Please, see my whole pyhton code:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
#Link de la pàgina on farem webscraping
url = 'https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
#Obrim una connexió amb la pàgina web
Client = req(url)
#Offloads the content of the page into a variable
pagina = Client.read()
#Closes the client
Client.close()
#html parser
pagina_soup=soup(pagina,"html.parser")
#grabs each product
productes = pagina_soup.findAll("div",{"class":"item-container"})
#Obrim un axiu .csv
filename = "ordinadors.csv"
f=open(filename,"w")
#Capçaleres del meu arxiu .csv
headers = "Marca; Producte; PreuActual; PreuAnterior; Rebaixa; CostEnvio
\n"
#Escrivim la capçalera
f.write(headers)
#Fem un loop sobre tots els productes
for producte in productes:
#Agafem la marca del producte
marca_productes = producte.findAll("div",{"class":"item-info"})
marca = marca_productes[0].div.a.img["title"]
#Agafem el nom del producte
name = producte.a.img["title"]
#Preu Actual
actual_productes = producte.findAll("li",{"class":"price-current"})
preuActual = actual_productes[0].strong.text
#Preu anterior
try:
preuAbans = producte.find("li", class_="price-
was").next_element.strip()
except:
print("Not found")
#Agafem els costes de envio
costos_productes = producte.findAll("li",{"class":"price-ship"})
#Com que es tracta d'un vector, agafem el primer element i el netegem.
costos = costos_productes[0].text.strip()
#Writing the file
f.write(marca + ";" + name.replace(","," ") + ";" + preuActual + ";"
+ preuAbans + ";" + costos + "\n")
f.close()

keys = [x.find().text for x in pagina_soup.find_all('li')]
values = [x.find('strong').next_sibling.strip() for x in pagina_soup.find_all('li')]
print(keys)
print(values)
out:
Out[6]: ['Graphic Type:', 'Resolution:', 'Weight:', 'Color:']
Out[7]: ['Dedicated Card', '3840 x 2160', '4.40 lbs.', 'Black']

want to scrape only text inside <ul> without spaces and balises

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout'
I'm using this code but it brings spaces, new lines and balises li from the ul:
def parse(self, response):
print("procesing:"+response.url)
#Extract data using css selectors
#product_name=response.css('.product::text').extract()
#price_range=response.css('.value::text').extract()
#Extract data using xpath
title = response.xpath("//b/text()").extract()
genre1 = response.xpath("(//span/text())[2]").extract()
def1 = response.xpath("((//*[self::ul])[1])").extract()
genre2 = response.xpath("(//span/text())[3]").extract()
def2 = response.xpath("((//*[self::ul])[2])").extract()
row_data=zip(title,genre1,def1,genre2,def2)
#Making extracted data row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
#key:value
'page':response.url,
'title' : item[0], #item[0] means product in the list and so on, index tells what value to assign
'genere1' : item[1],
'def1' : item[2],
'genere2' : item[3],
'def2' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
When I add the tag text()
def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()
it scrapes only blank spaces.

It happens because the text you want is not direct children of <ul> tag so using /text() would return direct children (or simply children) text. You need to get text from grand children of <ul> tag which is the text you want to scrape. For this purpose you can use //text() instead of /text or narrow down the XPath expression like:
"//*[#class='defbox'][n]//ul/li/a/text()"
By doing this you have more clear list output also you can make a clean string of it:
>>> def1 = response.xpath("//*[#class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entièreté, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.\n\nS’emploie devant un nom précédé ou non d’un article, d’un dé
monstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'

Looking for child content with Beautifulsoup

I am trying to scrape a phrase/author from the body of a URL. I can scrape the phrases but I don't know how to find the author and print it together with the phrase. Can you help me?
import urllib.request
from bs4 import BeautifulSoup
page_url = "https://www.pensador.com/frases/"
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, "html.parser")
for frase in soup.find_all("p", attrs={'class': 'frase fr'}):
print(frase.text + '\n')
# author = soup.find_all("span", attrs={'class': 'autor'})
# print(author.text)
# this is the author that I need, for each phrase the right author

You can get to the parent of the p.frase.fr tag, which is a div, and get the author by selecting span.autor descending the div:
In [1268]: for phrase in soup.select('p.frase.fr'):
...: author = phrase.parent.select_one('span.autor')
...: print(author.text.strip(), ': ', phrase.text.strip())
...:
Roberto Shinyashiki : Tudo o que um sonho precisa para ser realizado é alguém que acredite que ele possa ser realizado.
Paulo Coelho : Imagine uma nova história para sua vida e acredite nela.
Carlos Drummond de Andrade : Ser feliz sem motivo é a mais autêntica forma de felicidade.
...
...
Here, I'm using the CSS selector by phrase.parent.select_one('span.autor'), you can obviously use find here:
phrase.parent.find('span', attrs={'class': 'autor'})

Python BeautifulSoup extracting titles according to id

This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.

Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to extract with beatifulsoup, problems with div - python

Related

Webscraping with BeautifulSoup in Python tags

Accessing multiple tags inside one tag

want to scrape only text inside <ul> without spaces and balises

Looking for child content with Beautifulsoup

Python BeautifulSoup extracting titles according to id

Categories

Resources