Parsing webpage that is all text

Parsing webpage that is all text - python

I'm trying to parse webpage that is a plain text document, it's encoded in HTML so I tried using BeautifulSoup to pull out the text and make a list, but I wasn't able to.
<body>
<pre>
--------------------
BDMEP - INMET
--------------------
Estação : PONTA PORA - MS (OMM: 83702)
Latitude (graus) : -22.55
Longitude (graus) : -55.71
Altitude (metros): 650.00
Estação Operante
Inicio de operação: 24/11/1941
Periodo solicitado dos dados: 01/01/2015 a 17/11/2016
Os dados listados abaixo são os que encontram-se digitados no BDMEP
Hora em UTC
--------------------
Obs.: Os dados aparecem separados por ; (ponto e vírgula) no formato txt.
Para o formato planilha XLS,
siga as instruções
--------------------
Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
</pre>
</body>
I'm interested in:
Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
Ideally to construct a DataFrame and save as a CSV.
So far I tried stuff like:
soup = BeautifulSoup(a.content, 'html.parser')
soup = soup.find_all('pre')
text = []
for i in soup:
print(i)
text.append(i)
But it has not done the trick. It makes it all one entry in the list.

BS is usefull for HTML tags but you have mostly text so use string functions like split('\n') and slicing [start_row:end_row]
your HTML text
content = '''<body>
<pre>
--------------------
BDMEP - INMET
--------------------
Estação : PONTA PORA - MS (OMM: 83702)
Latitude (graus) : -22.55
Longitude (graus) : -55.71
Altitude (metros): 650.00
Estação Operante
Inicio de operação: 24/11/1941
Periodo solicitado dos dados: 01/01/2015 a 17/11/2016
Os dados listados abaixo são os que encontram-se digitados no BDMEP
Hora em UTC
--------------------
Obs.: Os dados aparecem separados por ; (ponto e vírgula) no formato txt.
Para o formato planilha XLS,
siga as instruções
--------------------
Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
</pre>
</body>'''
and
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
text = soup.find('pre').text
lines = text.split('\n')
print(lines[-6:-1])
or in one line
print(content.split('\n')[-7:-2])
If table has more rows then you can search last ---------------- to find start of table
last = content.rfind(' --------------------')
lines = content[last:].split('\n')
print(lines[1:-2])
And now you can split lines into columns using split(';') to create data for pandas :)
Or use io.StringIO to create file-like object in memory and use pd.read_csv()
import pandas as pd
import io
last = content.rfind(' --------------------')
lines = content[last:].split('\n')[1:-2]
# create one string with table
text = '\n'.join(lines)
# create file-like object with text
fileobject = io.StringIO(text)
# use file-like object with read_csv()
df = pd.read_csv(fileobject, delimiter=';')
print(df)
or
import pandas as pd
import io
start = content.rfind(' --------------------')
start += len(' --------------------')
end = content.rfind(' </pre>')
text = content[start:end]
fileobject = io.StringIO(text)
df = pd.read_csv(fileobject, delimiter=';')
print(df)

you need re to do this job
in:
import re
re.findall(r'\w+;.+\n', string=html)
out:
['Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;\n',
'83702;01/01/2015;0000;;;;;;;73.5;3.333333;\n',
'83702;06/01/2016;1200;5;;;;;;;;\n',
'83702;07/01/2016;0000;;;;;;;76.25;2.40072;\n',
'83702;01/02/2016;1200;15.2;;;;;;;;\n']

Related

export of scraping data to csv

New in python, I scraped a site to get data like season, teams and position.
I want to save the data in a CSV.
The problem is that the data all get listed on one line.
I would like to have a result like this:
Below my code:
import pdb
import re
import os
import json
import pandas as pd
'''Structurer les données dans un taleur'''
Base=[]
Saison=[]
Position=[]
Equipe=[]
for ele in os.listdir('Data/Saison'):
with open('Data/Saison/'+ele,'r',encoding='utf8') as output:
contenu = output.read()
saison=ele.replace('.html','')
Saison.append(saison)
pattern='<td class="left first__left strong">(.{1,8})</td>'
position=re.findall(pattern,contenu)
Position.append(position)
pattern='<a class="list-team-entry" href="/fr/basketball/equipe/(.{1,4})'
id_equipe=re.findall(pattern,contenu)
pattern='<a class="list-team-entry" href="/fr/basketball/equipe/(.{1,30})" title="(.{1,20})">'
ens=re.findall(pattern,contenu)
for e in ens:
equipe=e[1]
Equipe.append(equipe)
pattern='<td class="left highlight">(.{1,8})</td>'
pourcent_victoire=re.findall(pattern,contenu)
Base.append([Saison,Position,Equipe])
'''On enregistre en CSV'''
df=pd.DataFrame(Base, columns = ['saison','position','equipe'])
df.to_csv('DataFinal/Base.csv',sep='|',encoding='utf8',index=False)
`
Thanks for you help

how to extract with beatifulsoup, problems with div

I need to extract from the url all the Nacimientos, years of yesterday today and tomorrow, I try to extract all the <li>, but when a <div> appears it only extracts up to the <div>, try next_sibling and it didn't work either.
# Página objetivo
url = "https://es.m.wikipedia.org/wiki/9_de_julio"
##Cantidad de articulos en español actuales. ##
# Obtener un requests de la URL objetivo
wikipedia2 = requests.get(url)
# Si el Status Code es OK!
if wikipedia2.status_code == 200:
nacimientos2 = soup(wikipedia2.text, "lxml")
else:
print("La página respondió con error", wikipedia.status_code)
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find('ul').find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Main issue is that the focus of your selection is the first <ul> and all its <li>, you can simply adjust the selection while skipping the <ul>, cause your working on a specific <section>.
As one line with list comprehension and css selectors`:
yearList = [e.text[:4] for e in soup.select('section#mf-section-2 li')]
or based on your code -> anios= filtro.find_all('li'):
filtro= nacimientos2.find("section", id="mf-section-2")
anios= filtro.find_all('li')
lista2 = []
for data in anios:
lista2.append(data.text[:4])
lista2

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?

to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Accessing multiple tags inside one tag

I´ve the following HTML code to webscrape:
<ul class="item-features">
<li>
<strong>Graphic Type:</strong> Dedicated Card
</li>
<li>
<strong>Resolution:</strong> 3840 x 2160
</li>
<li>
<strong>Weight:</strong> 4.40 lbs.
</li>
<li>
<strong>Color:</strong> Black
</li>
</ul>
I would like to print in a .csv file all single tags inside the : Graphic Type, Resolution, Weight, etc. in different columns in a .csv file.
I´ve tried the following in Python:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
url ='https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
Client = req(url)
pagina = Client.read()
Client.close()
pagina_soup=soup(pagina,"html.parser")
productes = pagina_soup.findAll("div",{"class":"item-container})
producte = productes [0]
features = producte.findAll("ul",{"class":"item-features"})
features[0].text
And it displays all the features but just in one single column of the .csv.
'\nGraphic Type: Dedicated CardResolution: 3840 x 2160Weight: 4.40 lbs.Color: Black\nModel #: AERO 15 OLED SA-7US5020SH\nItem #: N82E16834233268\nReturn Policy: Standard Return Policy\n'
I don´t now how to export them one by one. Please, see my whole pyhton code:
import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
#Link de la pàgina on farem webscraping
url = 'https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
#Obrim una connexió amb la pàgina web
Client = req(url)
#Offloads the content of the page into a variable
pagina = Client.read()
#Closes the client
Client.close()
#html parser
pagina_soup=soup(pagina,"html.parser")
#grabs each product
productes = pagina_soup.findAll("div",{"class":"item-container"})
#Obrim un axiu .csv
filename = "ordinadors.csv"
f=open(filename,"w")
#Capçaleres del meu arxiu .csv
headers = "Marca; Producte; PreuActual; PreuAnterior; Rebaixa; CostEnvio
\n"
#Escrivim la capçalera
f.write(headers)
#Fem un loop sobre tots els productes
for producte in productes:
#Agafem la marca del producte
marca_productes = producte.findAll("div",{"class":"item-info"})
marca = marca_productes[0].div.a.img["title"]
#Agafem el nom del producte
name = producte.a.img["title"]
#Preu Actual
actual_productes = producte.findAll("li",{"class":"price-current"})
preuActual = actual_productes[0].strong.text
#Preu anterior
try:
preuAbans = producte.find("li", class_="price-
was").next_element.strip()
except:
print("Not found")
#Agafem els costes de envio
costos_productes = producte.findAll("li",{"class":"price-ship"})
#Com que es tracta d'un vector, agafem el primer element i el netegem.
costos = costos_productes[0].text.strip()
#Writing the file
f.write(marca + ";" + name.replace(","," ") + ";" + preuActual + ";"
+ preuAbans + ";" + costos + "\n")
f.close()

keys = [x.find().text for x in pagina_soup.find_all('li')]
values = [x.find('strong').next_sibling.strip() for x in pagina_soup.find_all('li')]
print(keys)
print(values)
out:
Out[6]: ['Graphic Type:', 'Resolution:', 'Weight:', 'Color:']
Out[7]: ['Dedicated Card', '3840 x 2160', '4.40 lbs.', 'Black']

Converting .htm to .txt with Python3

Does anyone know if I can optimize the code below, especially the for part?
I have several files in .htm format and I am specifying to read the files and generate a large .txt file. But it's taking too long. Is there any way to optimize this code?
Below is the code:
##### Importando Bibliotecas
from bs4 import BeautifulSoup
import urllib.request
import os
##### Lendo arquivos na pasta e salvando os nomes em arquivos
os.chdir('C:\\Users\\US365NR\\Desktop\\PROJETO OI\\PEDIDOS_DEBORA\\RAZOES\\PARTE_2')
arquivos = os.listdir()
##### Criando um documento txt unificado e abrindo.
filename = 'UNIFICADO.txt'
file = open(filename, 'w')
##### Criando uma iteracao para ler todos os arquivos na pasta arquivos.
for name in arquivos:
nfLink = 'file:///C:/Users/US365NR/Desktop/PROJETO%20OI/PEDIDOS_DEBORA/RAZOES//PARTE_2//' + name
print('TRABALHANDO NO ARQUIVO:')
print(name)
##### Lendo o arquivo htm com o BeautifulSoup
c=urllib.request.urlopen(nfLink)
soup=c.read()
soup = BeautifulSoup(soup)
print('TERMINOU DE LER BEAUTIFUL SOUP')
##### Para ter controle do que esta acontecendo
N_LINHAS = 0
LINHAS = []
N_TABLE = 0
TABELAS = []
tables = soup.findAll('table') ##### Encontrando todas as tabelas
N_TABLE = len(tables)
for table in tables: ##### Para cada tabela, quero ler as linhas
rows = table.findAll('tr')[1:]
N_LINHAS += len(rows)
for tr in rows: ##### Encontrando as colunas
cols = tr.findAll('td')
for i in range(0, len(cols)): #####Salvando as informacoes no arquivo txt
a = cols[i].text.replace('--*', '').replace('\n','') + '|'
file.write(a)
file.write('\n') ##### Proxima linha
LINHAS.append(N_LINHAS)
TABELAS.append(N_TABLE)
##### Print's de controle
print('TOTAL DE LINHAS', LINHAS)
print('TOTAL DE TABELAS', TABELAS)
print('FIM DO TRABALHO NO ARQUVO:')
print(name)
print('\n')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing webpage that is all text - python

Related

export of scraping data to csv

how to extract with beatifulsoup, problems with div

Webscraping with BeautifulSoup in Python tags

Accessing multiple tags inside one tag

Converting .htm to .txt with Python3

Categories

Resources