Scroll down in the iframe element with Selenium Python - python

I have a problem when I want to scrol , I attach the code from where I locate and change to the frame and how I currently try scrolling:
.
.
.
driver.find_element(By.XPATH,'//*[#id ="nivel4_11_5_3_1_2"]').click() # Consultar factura y nota
WebDriverWait(driver,60).until(EC.frame_to_be_available_and_switch_to_it(('iframeApplication'))) #Cambiar a iframe consulta de facturas
for z in categories_1:
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr[1]/td[3]/div/div[2]/input')))
driver.find_element(By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr[1]/td[3]/div/div[2]/input').send_keys(periodoactual[0]) #Rellenar fecha inicio
driver.find_element(By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr[2]/td[3]/div/div[2]/input').send_keys(periodoactual[1]) #Rellenar fecha final
driver.find_element(By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr[3]/td[3]/div/div[3]/input[1]').clear() #Limpiar campo de tipo de consulta
driver.find_element(By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr[3]/td[3]/div/div[3]/input[1]').send_keys('FE Recibidas') #Llenar con tipo de consultas
driver.find_element(By.XPATH,'/html/body/div[1]/table/tbody/tr/td/div/div/form/table/tbody/tr/td/table/tbody/tr/td/table[3]/tbody/tr[1]/td[1]/span/span/span/span[3]').click() #Clickear en buscar comprobantes
sleep(5)
driver.switch_to.default_content()
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.ID,'iframeApplication')))
marco= driver.find_element(By.ID,'iframeApplication')
driver.execute_script('arguments[0].scrollIntoView({block: "center"})', marco)
print(len(driver.find_elements(By.LINK_TEXT,'Descargar Factura (XML)')))
I want to highlight that the switch works correctly since I can retrieve information and click within the frame, what I want is scrolling down
The problem is the following: in order for it to load all the information from the iframe I need scrolling, that way the print count of elements returns 64 and not 25, that's why I want scrolling down into the iframe

No need to switch.
Can you try following code:
marco= driver.find_element(By.XPATH,'//*[#id="dojox_grid__View_1"]/div')
driver.execute_script('arguments[0].scrollIntoView({block: "center"})', marco)

Related

how to extract with beatifulsoup, problems with div

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2
Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Can't get text without tag using Selenium Python

first of all, I'll show the code that I'm having problem to in order to better explain myself.
<div class="archivos"> ... </div>
<br>
<br>
<br>
<br>
THIS IS THE TEXT THAT I WANT TO CHECK
<div class="archivos"> ... </div>
...
I'm using Selenium in Python.
So, this is a piece of the html that I'm working with. My objective is, inside the div with "class=archivos", there's a link that i want to click, but for that, I need to first analyze the text that's over it to know if I want to click or not the link.
The problem is that there's no tag on the text, and I can't seem to find a way to copy it so I can search it for the information I want. The text changes every time so I need to locate the possible texts previous to every "class=archivos".
So far I've tried a lot of ways to find it using XPath mainly, trying to get to the previous element of the div. I haven't come with anything that works yet, as I'm not very experienced with Selenium and XPaths.
I've found this https://chercher.tech/python/relative-xpath-selenium-python,which helped me try some XPaths, and several responses here on SO but to no avail.
I've read somewhere that I can use Javascript code from Python using Selenium to get it, but I don't know Javascript and don't know how to do it. Maybe somebody understands what I'm talking about.
This is the webpage if it helps: http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901
Thanks in advance for the help, and I'll provide any further information if it's needed.
Here is example how to extract the previous text with BeautifulSoup. I loaded the page with requests module, but you can feed the HTML source to BeautifulSoup from selenium:
import requests
from bs4 import BeautifulSoup
url = 'http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for t in soup.select('.archivos'):
previous_text = t.find_previous(text=True).strip()
link = t.a['href']
print(previous_text)
print('http://www.boa.aragon.es' + link)
print('-' * 80)
Prints:
ORDEN HAP/804/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo de los Departamentos de Industria, Competitividad y Desarrollo Empresarial y de Economía, Planificación y Empleo.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=1&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/805/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Departamento de Agricultura, Ganadería y Medio Ambiente.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=2&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/806/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Organismo Autónomo Instituto Aragonés de Servicios Sociales.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=3&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN ECD/807/2020, de 24 de agosto, por la que se aprueba el expediente relativo al procedimiento selectivo de acceso al Cuerpo de Catedráticos de Música y Artes Escénicas.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=4&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
RESOLUCIÓN de 28 de julio de 2020, de la Dirección General de Justicia, por la que se convocan a concurso de traslado plazas vacantes entre funcionarios de los Cuerpos y Escalas de Gestión Procesal y Administrativa, Tramitación Procesal y
Administrativa y Auxilio Judicial de la Administración de Justicia.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=5&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
...and so on.

Python Encode Ã3 as ó

I have a string like
'La empresa de capitales mixtos que opera el predio de residuos,
Ceamse, aclarÃ3 este martes que la responsabilidad del
desentendimiento con los recicladores informales que provocÃ3 un
nuevo bloqueo y hace peligrar la recolecciÃ3n'
and y need this
'La empresa de capitales mixtos que opera el predio de residuos,
Ceamse, aclaró este martes que la responsabilidad del
desentendimiento con los recicladores informales que provocó un
nuevo bloqueo y hace peligrar la recolección'
how can I do this with Python ?
thanks!
You need to fix your webscraping script!
It looks like La Capital sends proper http header and html head information, and the content is UTF-8 encoded. So your script needs to handle that, and everything will work fine.
I know from experience requests.get and beautifulsoup 4 both handles Unicode well, so just debug your script, and see where it goes wrong. Check the raw input, check if you need your page's .content or .text, and fix it accordingly.

want to scrape only text inside <ul> without spaces and balises

I'm using xpath, I want to scrape from this URL: https://www.le-dictionnaire.com/definition/tout'
I'm using this code but it brings spaces, new lines and balises li from the ul:
def parse(self, response):
print("procesing:"+response.url)
#Extract data using css selectors
#product_name=response.css('.product::text').extract()
#price_range=response.css('.value::text').extract()
#Extract data using xpath
title = response.xpath("//b/text()").extract()
genre1 = response.xpath("(//span/text())[2]").extract()
def1 = response.xpath("((//*[self::ul])[1])").extract()
genre2 = response.xpath("(//span/text())[3]").extract()
def2 = response.xpath("((//*[self::ul])[2])").extract()
row_data=zip(title,genre1,def1,genre2,def2)
#Making extracted data row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
#key:value
'page':response.url,
'title' : item[0], #item[0] means product in the list and so on, index tells what value to assign
'genere1' : item[1],
'def1' : item[2],
'genere2' : item[3],
'def2' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
When I add the tag text()
def1 = response.xpath("((//*[self::ul])[1]/text())").extract()
def2 = response.xpath("((//*[self::ul])[2]/text())").extract()
it scrapes only blank spaces.
It happens because the text you want is not direct children of <ul> tag so using /text() would return direct children (or simply children) text. You need to get text from grand children of <ul> tag which is the text you want to scrape. For this purpose you can use //text() instead of /text or narrow down the XPath expression like:
"//*[#class='defbox'][n]//ul/li/a/text()"
By doing this you have more clear list output also you can make a clean string of it:
>>> def1 = response.xpath("//*[#class='defbox'][1]//ul/li/a/text()").getall()
>>> ' '.join(def1)
'Qui comprend l’intégrité, l’entièreté, la totalité d’une chose considérée par rapport au nombre, à l’étendue ou à l’intensité de l’énergie.\n\nS’emploie devant un nom précédé ou non d’un article, d’un dé
monstratif ou d’un possessif. S’emploie aussi devant un nom propre. S’emploie également devant ceci, cela, ce que, ce qui, ceux qui et celles qui. S’emploie aussi comme attribut après le verbe.'

Parsing a tag in HTML

I know that the question has been asked but I think not in this specific situation. If it's the case feel free to show me the case.
I have a HTML file hierarchized (you can view the original here) that way :
<h5 id="foo1">Title 1</h5>
<table class="foo2">
<tbody>
<tr>
<td>
<h3 class="foo3">SomeName1</h3>
<img src="Somesource" alt="SomeName2" title="SomeTitle"><br>
<p class="textcode">
Some precious text here
</p>
</td>
...
</table>
I would like to extract the name, the image and the text contained in the <p> each table data in each h5 separately meaning I would like to save each one of these items in a separate folder named after the h5 therein.
I tried this :
# coding: utf-8
import os
import re
from bs4 import BeautifulSoup as bs
os.chdir("WorkingDirectory")
# Sélection du HTML et remplissage de son contenu dans la variable éponyme
with open("TheGoodPath.htm","r") as html:
html = bs(html,'html.parser')
# Sélection des hearders, restriction des résultats aux six premiers et création des dossiers
h5 = html.find_all("h5",limit=6)
for h in h5:
# Création des fichiers avec le nom des headers
chemin = u"../Résulat/"
nom = str(h.contents[0].string)
os.makedirs(chemin + nom,exist_ok=True)
# Sélection de la table soeur située juste après le header
table = h.find_next_sibling(name = 'table')
for t in table:
# Sélection des headers contenant les titres des documents
h3 = t.find_all("h3")
for k in h3:
titre = str(k.string)
# Création des répertoires avec les noms des figures
os.makedirs(chemin + nom + titre,exist_ok=True)
os.fdopen(titre.tex)
# Récupération de l'image située dans la balise soeur située juste après le header précédent
img = k.find_next_sibling("img")
chimg = img.img['src']
os.fdopen(img.img['title'])
# Récupération du code TikZ située dans la balise soeur située juste après le header précédent
tikz = k.find_next_sibling('p')
# Extraction du code TikZ contenu dans la balise précédemment récupérée
code = tikz.get_text()
# Définition puis écriture du préambule et du code nécessaire à la production de l'image précédemment enregistrée
preambule = r"%PREAMBULE \n \usepackage{pgfplots} \n \usepackage{tikz} \n \usepackage[european resistor, european voltage, european current]{circuitikz} \n \usetikzlibrary{arrows,shapes,positioning} \n \usetikzlibrary{decorations.markings,decorations.pathmorphing, decorations.pathreplacing} \n \usetikzlibrary{calc,patterns,shapes.geometric} \n %FIN PREAMBULE"
with open(chemin + nom + titre,'w') as result:
result.write(preambule + code)
But it prints AttributeError: 'NavigableString' object has no attribute 'find_next_element' for h3 = t.find_all("h3"), line 21
This seems to be what you want, there only seems to be one table between each h5 so don't iterate over it just use find_next and use the table returned:
from bs4 import BeautifulSoup
import requests
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
h5s = soup.find_all("h5",limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
print(img["src"])
print(img["title"])
print(img.find_next("p").text)
print()
Which gives you output like:
repere-plan.svg
\begin{tikzpicture}[scale=1]
\draw (0,0) --++ (1,1) --++ (3,0) --++ (-1,-1) --++ (-3,0);
\draw [thick] [->] (2,0.5) --++(0,2) node [right] {z};
%thick : gras ; very thick : très gras ; ultra thick : hyper gras
\draw (2,0.5) node [left] {O};
\draw [thick] [->] (2,0.5) --++(-1,-1) node [left] {x};
\draw [thick] [->] (2,0.5) --++(2,0) node [below] {y};
\end{tikzpicture}
Lignes de champ et équipotentielles
images/cours-licence/em3/ligne-champ-equipot.svg
ligne-champ-equipot.svg
\begin{tikzpicture}[scale=0.8]
\draw[->] (-2,0) -- (2,0);
\draw[->] (0,-2) -- (0,2);
\draw node [red] at (-2,1.25) {\scriptsize{Lignes de champ}};
\draw node [blue] at (2,-1.25) {\scriptsize{Equipotentielles}};
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sin(\x r)*3*sin(\x r)*5});
%r = angle en radian
%domain permet de définir le domaine dans lequel la fonction sera tracée
%samples=200 permet d'augmenter le nombre de points pour le tracé
%smooth améliore également la qualité de la trace
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sin(\x r)*2*sin(\x r)*5});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sqrt(abs(cos(\x r)))*15});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sqrt(abs(cos(\x r)))*15});
\end{tikzpicture}
Fonction arctangente
images/schemas/math/arctan.svg
arctan.svg
\begin{tikzpicture}[scale=0.8]
\draw[very thin,color=gray] (-pi,pi) grid (-pi,pi);
\draw[->] (-pi,0) -- (pi,0) node[right] {$x$};
\draw[->] (0,-2) -- (0,2);
\draw[color=red,domain=-pi:pi,samples=150] plot ({\x},{rad(atan(\x))} )node[right,red] {$\arctan(x)$};
\draw[color=blue,domain=-pi:pi] plot ({\x},{rad(-atan(\x))} )node[right,blue] {$-\arctan(x)$};
%Le rad() est une autre façon de dire que l'argument est en radian
\end{tikzpicture}
To write all the .svg's to disk:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os import path
cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text
soup = BeautifulSoup(cont)
base_url = "http://www.physagreg.fr/"
h5s = soup.find_all("h5", limit=6)
for h5 in h5s:
# find first table after
table = h5.find_next("table")
# find all h3 elements in that table
for h3 in table.select("h3"):
print(h3.text)
img = h3.find_next("img")
src, title = img["src"], img["title"]
# join base url and image url
img_url = urljoin(base_url, src)
# open file using title as file name
with open(title, "w") as f:
# requests the img url and write content
f.write(requests.get(img_url).content)
Which will give you arctan.svg courbe-Epeff.svg and all the rest on the page etc..
It looks like (judging by the for t in table loop) you meant to find multiple "table" elements. Use find_next_siblings() instead of find_next_sibling():
table = h.find_next_siblings(name='table')
for t in table:

Categories