want to scrape only text inside <ul> without spaces and balises

want to scrape only text inside <ul> without spaces and balises - python

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout'
I'm using this code but it brings spaces, new lines and balises li from the ul:
def parse(self, response):
print("procesing:"+response.url)
#Extract data using css selectors
#product_name=response.css('.product::text').extract()
#price_range=response.css('.value::text').extract()
#Extract data using xpath
title = response.xpath("//b/text()").extract()
genre1 = response.xpath("(//span/text())[2]").extract()
def1 = response.xpath("((//*[self::ul])[1])").extract()
genre2 = response.xpath("(//span/text())[3]").extract()
def2 = response.xpath("((//*[self::ul])[2])").extract()
row_data=zip(title,genre1,def1,genre2,def2)
#Making extracted data row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
#key:value
'page':response.url,
'title' : item[0], #item[0] means product in the list and so on, index tells what value to assign
'genere1' : item[1],
'def1' : item[2],
'genere2' : item[3],
'def2' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
When I add the tag text()
def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()
it scrapes only blank spaces.

It happens because the text you want is not direct children of <ul> tag so using /text() would return direct children (or simply children) text. You need to get text from grand children of <ul> tag which is the text you want to scrape. For this purpose you can use //text() instead of /text or narrow down the XPath expression like:
"//*[#class='defbox'][n]//ul/li/a/text()"
By doing this you have more clear list output also you can make a clean string of it:
>>> def1 = response.xpath("//*[#class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entièreté, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.\n\nS’emploie devant un nom précédé ou non d’un article, d’un dé
monstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'

Related

how to extract with beatifulsoup, problems with div

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Extract tags from xml using python

I'm trying to extract tags from an XML file using RE in Python. I need to extract nodes that start with tag "< PE" and their corresponding Unit IDs which are nodes above each tag "<PE". The file can be seen here
When I use the below code, I don't get the correct tags "<unit IDs", that is, the ones that correspond to each tag "<PE". For example, in my output, the content extracted from tag "<PE" with "<Unit ID=250" is actually "<Unit ID=149" in the original file. Besides, the code skips some tags "<Unit ID". Does anyone see in my code where's the error?
import re
t=open('ALICE.per1_replaced.txt','r')
t=t.read()
unitid=re.findall('<unit.*?"pe">', t, re.DOTALL)
PE=re.findall("<PE.*?</PE>", t, re.DOTALL)
a=zip(unitid,PE)
tp=tuple(a)
w=open('Tags.txt','w')
for x, j in tp:
a=x + '\n'+j + '\n'
w.write(a)
w.close()
I've tried this version as well but I had the same problems:
with open('ALICE.per1_replaced.txt','r') as t:
contents = t.read()
unitid=re.findall('<unit.*?"pe">', contents, re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
for i, p in zip(unitid, PE):
fi.write( "{}\n{}\n".format(i, p))
my desired output is a file with tags "<Unit ID=" followed by the content within the tag that starts with "<PE" and ends with "" as below:
<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
<head>
</head>
<body>
Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade,
ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu
bastante natural); mas quando o Coelho de fato tirou um relógio do bolso
do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe
ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de
bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás
dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro
de uma grande toca de coelho sob a cerca.
</body>
</html></PE>

You seem to have multiple tags under each tag (eg, for unit 3), thus the zip doesn't work correctly. As #Error_2646 noted in comments, some XML or beautiful soup package would work better for this job.
But if for whatever reason you want to stick to regex, you can fix this by running a regex on the list of strings returned by the first regex. Example code that worked on the small part of the input I downloaded:
units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
#first get your unit regex
unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
#there should only be one within each
assert (len(unitid) == 1)
#now find all pes for this unit
PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
# combine results
output = unitid[0] + "\n"
for pe in PE:
output += pe + "\n"
unitList.append(output)
for x in unitList:
print(x)

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?

to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Looking for child content with Beautifulsoup

I am trying to scrape a phrase/author from the body of a URL. I can scrape the phrases but I don't know how to find the author and print it together with the phrase. Can you help me?
import urllib.request
from bs4 import BeautifulSoup
page_url = "https://www.pensador.com/frases/"
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, "html.parser")
for frase in soup.find_all("p", attrs={'class': 'frase fr'}):
print(frase.text + '\n')
# author = soup.find_all("span", attrs={'class': 'autor'})
# print(author.text)
# this is the author that I need, for each phrase the right author

You can get to the parent of the p.frase.fr tag, which is a div, and get the author by selecting span.autor descending the div:
In [1268]: for phrase in soup.select('p.frase.fr'):
...: author = phrase.parent.select_one('span.autor')
...: print(author.text.strip(), ': ', phrase.text.strip())
...:
Roberto Shinyashiki : Tudo o que um sonho precisa para ser realizado é alguém que acredite que ele possa ser realizado.
Paulo Coelho : Imagine uma nova história para sua vida e acredite nela.
Carlos Drummond de Andrade : Ser feliz sem motivo é a mais autêntica forma de felicidade.
...
...
Here, I'm using the CSS selector by phrase.parent.select_one('span.autor'), you can obviously use find here:
phrase.parent.find('span', attrs={'class': 'autor'})

Parsing a tag in HTML

I know that the question has been asked but I think not in this specific situation. If it's the case feel free to show me the case.
I have a HTML file hierarchized (you can view the original here) that way :
<h5 id="foo1">Title 1</h5>
<table class="foo2">
<tbody>
<tr>
<td>
<h3 class="foo3">SomeName1</h3>
<img src="Somesource" alt="SomeName2" title="SomeTitle"><br>
<p class="textcode">
Some precious text here
</p>
</td>
...
</table>
I would like to extract the name, the image and the text contained in the <p> each table data in each h5 separately meaning I would like to save each one of these items in a separate folder named after the h5 therein.
I tried this :
# coding: utf-8
import os
import re
from bs4 import BeautifulSoup as bs
os.chdir("WorkingDirectory")
# Sélection du HTML et remplissage de son contenu dans la variable éponyme
with open("TheGoodPath.htm","r") as html:
html = bs(html,'html.parser')
# Sélection des hearders, restriction des résultats aux six premiers et création des dossiers
h5 = html.find_all("h5",limit=6)
for h in h5:
# Création des fichiers avec le nom des headers
chemin = u"../Résulat/"
nom = str(h.contents[0].string)
os.makedirs(chemin + nom,exist_ok=True)
# Sélection de la table soeur située juste après le header
table = h.find_next_sibling(name = 'table')
for t in table:
# Sélection des headers contenant les titres des documents
h3 = t.find_all("h3")
for k in h3:
titre = str(k.string)
# Création des répertoires avec les noms des figures
os.makedirs(chemin + nom + titre,exist_ok=True)
os.fdopen(titre.tex)
# Récupération de l'image située dans la balise soeur située juste après le header précédent
img = k.find_next_sibling("img")
chimg = img.img['src']
os.fdopen(img.img['title'])
# Récupération du code TikZ située dans la balise soeur située juste après le header précédent
tikz = k.find_next_sibling('p')
# Extraction du code TikZ contenu dans la balise précédemment récupérée
code = tikz.get_text()
# Définition puis écriture du préambule et du code nécessaire à la production de l'image précédemment enregistrée
preambule = r"%PREAMBULE \n \usepackage{pgfplots} \n \usepackage{tikz} \n \usepackage[european resistor, european voltage, european current]{circuitikz} \n \usetikzlibrary{arrows,shapes,positioning} \n \usetikzlibrary{decorations.markings,decorations.pathmorphing, decorations.pathreplacing} \n \usetikzlibrary{calc,patterns,shapes.geometric} \n %FIN PREAMBULE"
with open(chemin + nom + titre,'w') as result:
result.write(preambule + code)
But it prints AttributeError: 'NavigableString' object has no attribute 'find_next_element' for h3 = t.find_all("h3"), line 21

This seems to be what you want, there only seems to be one table between each h5 so don't iterate over it just use find_next and use the table returned:
from bs4 import BeautifulSoup
import requests
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
h5s = soup.find_all("h5",limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
print(img["src"])
print(img["title"])
print(img.find_next("p").text)
print()
Which gives you output like:
repere-plan.svg
\begin{tikzpicture}[scale=1]
\draw (0,0) --++ (1,1) --++ (3,0) --++ (-1,-1) --++ (-3,0);
\draw [thick] [->] (2,0.5) --++(0,2) node [right] {z};
%thick : gras ; very thick : trÃ¨s gras ; ultra thick : hyper gras
\draw (2,0.5) node [left] {O};
\draw [thick] [->] (2,0.5) --++(-1,-1) node [left] {x};
\draw [thick] [->] (2,0.5) --++(2,0) node [below] {y};
\end{tikzpicture}
Lignes de champ et Ã©quipotentielles
images/cours-licence/em3/ligne-champ-equipot.svg
ligne-champ-equipot.svg
\begin{tikzpicture}[scale=0.8]
\draw[->] (-2,0) -- (2,0);
\draw[->] (0,-2) -- (0,2);
\draw node [red] at (-2,1.25) {\scriptsize{Lignes de champ}};
\draw node [blue] at (2,-1.25) {\scriptsize{Equipotentielles}};
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sin(\x r)*3*sin(\x r)*5});
%r = angle en radian
%domain permet de dÃ©finir le domaine dans lequel la fonction sera tracÃ©e
%samples=200 permet d'augmenter le nombre de points pour le tracÃ©
%smooth amÃ©liore Ã©galement la qualitÃ© de la trace
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sin(\x r)*2*sin(\x r)*5});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sqrt(abs(cos(\x r)))*15});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sqrt(abs(cos(\x r)))*15});
\end{tikzpicture}
Fonction arctangente
images/schemas/math/arctan.svg
arctan.svg
\begin{tikzpicture}[scale=0.8]
\draw[very thin,color=gray] (-pi,pi) grid (-pi,pi);
\draw[->] (-pi,0) -- (pi,0) node[right] {$x$};
\draw[->] (0,-2) -- (0,2);
\draw[color=red,domain=-pi:pi,samples=150] plot ({\x},{rad(atan(\x))} )node[right,red] {$\arctan(x)$};
\draw[color=blue,domain=-pi:pi] plot ({\x},{rad(-atan(\x))} )node[right,blue] {$-\arctan(x)$};
%Le rad() est une autre faÃ§on de dire que l'argument est en radian
\end{tikzpicture}
To write all the .svg's to disk:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os import path
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
base_url = "http://www.physagreg.fr/"
h5s = soup.find_all("h5", limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
src, title = img["src"], img["title"]
# join base url and image url
img_url = urljoin(base_url, src)
# open file using title as file name
with open(title, "w") as f:
# requests the img url and write content
f.write(requests.get(img_url).content)
Which will give you arctan.svg courbe-Epeff.svg and all the rest on the page etc..

It looks like (judging by the for t in table loop) you meant to find multiple "table" elements. Use find_next_siblings() instead of find_next_sibling():
table = h.find_next_siblings(name='table')
for t in table:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

want to scrape only text inside <ul> without spaces and balises - python

Related

how to extract with beatifulsoup, problems with div

Extract tags from xml using python

Webscraping with BeautifulSoup in Python tags

Looking for child content with Beautifulsoup

Parsing a tag in HTML

Categories

Resources