I'm trying to write a small web scraper in python, and I think I've run into an encoding issue. I'm trying to scrape http://www.resident-music.com/tickets (specifically the table on the page) - a row might look something like this -
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
I'm essentially trying to replace the £55.00 with £55, and any other 'non-text' nasties.
I've tried a few different encoding things you can go with beautifulsoup, and urllib2 - to no avail, I think I'm just doing it all wrong.
Thanks
You want to unescape the html which you can do using html.unescape in python3:
In [14]: from html import unescape
In [15]: h = """<tr>
....: <td style="width:64.9%;height:11px;">
....: <p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
....: </td>
....: <td style="width:13.1%;height:11px;">
....: <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
....: </td>
....: <td style="width:15.42%;height:11px;">
....: <p><strong>various</strong></p>
....: </td>
....: <td style="width:6.58%;height:11px;">
....: <p><strong>£55.00</strong></p>
....: </td>
....: </tr>"""
In [16]:
In [16]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
For python2 use:
In [6]: from html.parser import HTMLParser
In [7]: unescape = HTMLParser().unescape
In [8]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
You can see both correctly unescape all entities not just the pound sign.
I used requests for this but hopefully you can do that using urllib2 also. So here is the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(requests.get('your_url').text)
chart = soup.findAll(name='tr')
print str(chart).replace('£',unichr(163)) #replace '£' with '£'
Now you should take the expected output!
Sample output:
...
<strong>£71.50</strong></p>
...
Anyway about the parsing you can do it with many ways, what was interesting here is: print str(chart).replace('£',unichr(163)) which was quite challenging :)
Update
If you want to escape more than one (or even one) characters (like dashes,pounds etc...), it would be easier/more efficient for you to use a parser as in Padraic's answer. Sometimes as you will also read in the comments they handle and other encoding issues.
Related
from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
print('NASEEF')
soup = BeautifulSoup(html,'lxml')
print('naseef2')
for i,rows in enumerate(soup.find_all('tr',class_='view-content')):
print('naseef3')
for texts in soup.find('td',header = 'view-field-2021-throughput-table-column'):
print('naseef4')
number = texts.text
if number is None:
continue
print('Naseef')
TSA_travel_numbers(html.page_source)
As you can see NASEEF and naseef2 gets printed into the console, but not naseef3 and naseef4, and no error to this code, it runs fine, I don't know what is happening here, anyone please point me what is really happening here?
In other words it is not going inside the for loops specified in that function.
please help me, and sorry for your time and advance thanks!
Your page does not contain <tr> tags with a class of view-content, so find_all is correctly returning no results. If you remove the class restriction, you get many results:
>>> soup.find_all('tr', limit=2)
[<tr>
<th class="views-align-center views-field views-field-field-today-date views-align-center" id="view-field-today-date-table-column" scope="col">Date</th>
<th class="views-align-center views-field views-field-field-2021-throughput views-align-center" id="view-field-2021-throughput-table-column" scope="col">2021 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2020-throughput views-align-center" id="view-field-2020-throughput-table-column" scope="col">2020 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2019-throughput views-align-center" id="view-field-2019-throughput-table-column" scope="col">2019 Traveler Throughput </th>
</tr>, <tr>
<td class="views-field views-field-field-today-date views-align-center" headers="view-field-today-date-table-column">5/9/2021 </td>
<td class="views-field views-field-field-2021-throughput views-align-center" headers="view-field-2021-throughput-table-column">1,707,805 </td>
<td class="views-field views-field-field-2020-throughput views-align-center" headers="view-field-2020-throughput-table-column">200,815 </td>
<td class="views-field views-field-field-2019-throughput views-align-center" headers="view-field-2019-throughput-table-column">2,419,114 </td>
</tr>]
Once you change that, the inner loop is looking for <td> tags with a header of view-field-2021-throughput-table-column. There are no such tags in the page either, but there are those which have a headers field with that name.
This line is also wrong:
number = texts.text
...because texts is a NavigableString and does not have the text attribute.
Additionally, the word naseef is not really clear as to what it means, so it's better to replace that with more descriptive strings. Finally, you don't really need the Selenium connection or the tokenizer, so for the purposes of this example we can leave those out. The resulting code looks like this:
from bs4 import BeautifulSoup
import numpy as np
import requests
html = requests.get("https://www.tsa.gov/coronavirus/passenger-throughput").text
def TSA_travel_numbers(html):
print('Entering parsing function')
soup = BeautifulSoup(html,'lxml')
print('Parsed HTML to soup')
for i,rows in enumerate(soup.find_all('tr')):
print('Found <tr> tag number', i)
for texts in soup.find('td',headers = 'view-field-2021-throughput-table-column'):
print('found <td> tag with headers')
number = texts
if number is None:
continue
print('Value is', number)
TSA_travel_numbers(html)
Its output looks like:
Entering parsing function
Parsed HTML to soup
Found <tr> tag number 0
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 1
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 2
found <td> tag with headers
...
I have the following snippets of HTML which are part of a much larger HTML page:
<tr >
<th class="left">
<span tooltip haspopup="true" class="tip" title="A type of fruit">Oranges</span>:
</th>
<td class="reduce">
Seven
</td>
</tr>
<tr >
<th class="left">
Apples
</th>
<td>
Three
</td>
</tr>
When I execute the code:
soup.find_all(string='Oranges')
I get:
['Oranges']
Which is perfect.
However when I execute the code:
soup.find_all(string='Apples')
I get:
[]
Why isn't this working? I have a feeling it's to do with the whitespace and new line etc around the 'Apples' bit of the HTML code, however I can't work out to catch it. I've tried the below which have been fruitless.
soup.find_all(string='\n Apples\n ')
soup.find_all(string=' Apples ')
soup.find_all(string=' Apples ')
Would appreciate your help! Thanks.
P.s. I don't think it's important but ultimately I'm using a "findParent().fetchNextSiblings()[0].text.strip()" or similar to get the 'Seven' and 'Three' - which works in the former case but not in the latter.
Try:
import re
...
soup.find_all(text = re.compile(r"Apples", re.IGNORECASE))
I am trying to learn beautifulsoup to scrap HTML and have a difficult challenge. HTML I am trying to scrap is not well formatted and with lack of knowledge with beautifulsoup I am kind of stuck..
The HTML I am trying to scrap is as below
*<tr class="unique">
<td>S.N.</td>
<td>Traded Companies</td>
<td class="alnright">No. Of Transaction</td>
<td class="alnright">Max Price</td>
<td class="alnright">Min Price</td>
<td class="alnright">Closing Price</td>
<td class="alnright">Traded Shares</td>
<td class="alnright">Amount</td>
<td class="alnright">Previous Closing</td>
<td class="alnright">Difference Rs.</td>
</tr>
<tr>
<td>1</td>
<td>Agriculture Development Bank Limited</td>
<td class="alnright">47</td>
<td class="alnright">437.00</td>
<td class="alnright">426.00</td>
<td class="alnright">435.00</td>
<td class="alnright">9725.00</td>
<td class="alnright">4204614.00</td>
<td class="alnright">431.00</td>
<td class="alnright">4.00
<img src="http://www.nepalstock.com/images/increase.gif">
</td>
</tr>*
So the outcome I want to get is the string "Agriculture Development Bank Limited".
Thanks in advance for your help !
Can help you more precisely if you ask it in general what are you looking for. Here is the code which will satisfy your this need.
from bs4 import BeautifulSoup
html_doc = """
Your HTML code
"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find("td", text="Agriculture Development Bank Limited").text
I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !
I have a few known formats in an HTML page, I need to parse the content of the tags
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>
basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.
I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.
This is what you are looking for:
import re
s="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
"""
p="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center>(.*)</TD>
<TD> </TD>
</TR>
"""
m=re.search(p, s)
print m.group(1)
Don't use regular expression to parse HTML (It's not a regular language).
There are many threads on the topic at stackoverflow.
I recommend you to use: BeautifulSoup, Pattern and similar modules.
There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.
HTML isn't a regular language, using regular expression to work with it is difficult.
BeautifulSoup is a nice parser, here's an example how to use it:
from BeautifulSoup import BeautifulSoup
html = u'''
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>'''
bs = BeautifulSoup(html)
print [td.contents for td in bs.findAll('td')]
output:
[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]
You know what to do from here. :)
Install with pip install BeautifulSoup. Here are the docs:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
This regular expression will do:
re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)
But I recommend using a parser.