I'm using scrapy to scrape data from this website: http://www.nuforc.org/webreports/ndxevent.html
I need to seperate dates from counts of UFO sightings, yes exciting!
Here is an example of what I'm scraping
<TR VALIGN=TOP>
<TD><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><A HREF= ndxe201303.html>03/2013</A></TD>
<TD ALIGN=RIGHT><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>108</TD>
So in this example date = 03/2013, count = 108
Now the dates are not a problem since I can just do
hxs.select('//tbody//td//font//a//text()').extract()
To get the text within "a" tag.
But is there a way to get text from td element that has the style ALIGN=RIGHT.
I have looked at the docs and selectors but I'm confused
hxs.select('//tbody[contains(td, "ALIGN")]').extract()
?
This selects text from all <td> with the attribute ALIGN="RIGHT":
hxs.select('//tbody//td[#ALIGN="RIGHT"]//text()').extract()
Related
I am dealing with HTML table data consisting of two fields: First, a field that holds a hyperlinked text string, and second, one that holds a date string. I need the two to be extracted and remain associated.
I am catching the rows in the following fashion (found from another SO question):
pg = s.get(url).text # s = requests Session object
soup = BeautifulSoup(pg, 'html.parser')
files = [[
[td for td in tr.find_all('td')]
for tr in table.find_all('tr')]
for table in soup.find_all('table')]
iterating over files[0] yields rows that have dynamic classes because the HTML was published from Microsoft Excel. So I can't depend on class names. But the location of elements are stable. The rows look like this:
[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]
Broken up, for easier reading:
[
<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none">
<a href="subfolder/north_america-latest.shp.zip">
<span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>
north_america-latest.shp.zip
</span>
</a>
</td>,
<td class="another auto tag" style="border-top:none;border-left:none">
2023-01-01
</td>
]
Using the .get_text() method with td I can get the string literal of the link, as well as the date in one go, but once I have the td object, how do I go about obtaining the following three elements?
"subfolder/north_america-latest.shp.zip" # the link
"north_america-latest.shp.zip" # the name
"2023-01-01" # the date
Assuming that what you call 'row' is actually a string, here is how you would get those bits of information:
from bs4 import BeautifulSoup as bs
my_list = '''[<td class="excel auto tag" height="16" style="height:12.0pt;border-top:none"><span style='font-size:9.0pt;font-family:"Courier New", monospace;mso-font-charset:0'>north_america-latest.shp.zip</span></td>, <td class="another auto tag" style="border-top:none;border-left:none">2023-01-01</td>]'''
soup = bs(my_list, 'html.parser')
link = soup.select_one('td a').get('href')
text = soup.select_one('td a').get_text()
date = soup.select('td')[1].get_text()
print('link:', link)
print('text:', text)
print('date:', date)
Result in terminal:
link: subfolder/north_america-latest.shp.zip
text: north_america-latest.shp.zip
date: 2023-01-01
I'm not particularly convinced this is the actual key to your conundrum: surely there is a better way of getting the information you're after, beside that list comprehension you're using. As stated in comments, without the actual page HTML, truly debugging this is next to impossible.
so I wanna extract lets say value="THE TEXT IWANNA EXTRACT ;0" at the html code below. I wanna extract all the string inside value attribute of "td class="regu". But I cant seem to find a way to extract it. I have extracted Names of ppl but I cant extract the string inside value attrib. Any help is much appreciated. Thankyou. Im stuck for like 24 hours already. Im open to use other Libraries as long as I can extract it.
<table class="dbtable" border="0" width="100%">
<tbody><tr>
<td class="tableheader" align="center" width="1%"><b>#</b></td>
<td class="tableheader" align="center" width="60%"><b>User Name</b></td>
<td class="tableheader" align="center"><b>User Type</b></td>
</tr><tr bgcolor="#ffffff">
<td class="regu"><input name="chkStud" value="THE TEXT IWANNA EXTRACT ;0" type="checkbox"></td>
<td class="regu">NAME OF STUDENT HERE </td>
<td class="regu"> Student</td>
</tr><tr bgcolor="#ffffff">
<td class="regu"><input name="chkStud" value="PLEASE EXTRACT ME HERE, IM DYING TO GET OUT;0" type="checkbox"></td>
<td class="regu">FOO BAR FOO BAR</td>
<td class="regu"> Student</td>
</tbody></table>
Here is the python code
#!/usr/bin/python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import logging
driver = webdriver.Firefox()
driver.get("http://somewebsite/iwannascrape/login.php") #page requires a login T_T
assert "Student" in driver.title
elem = driver.find_element_by_name("txtUser")
elem.clear()
elem.send_keys("YOU_SIR_NAME") #login creds. please dont mind :D
elem2 = driver.find_element_by_name("txtPwd")
elem2.clear()
elem2.send_keys("PASSSWORDHERE")
elem2.send_keys(Keys.RETURN)
driver.get("http://somewebsite/iwannascrape/afterlogin/illhere")
# using this to extract only the table with class='dbtable' so its easier to manipulate :)
table_clas = driver.find_element_by_xpath("//*[#class='dbtable']")
source_code = table_clas.get_attribute("outerHTML") #this prints out the table and its children.
print source_code
for i in range (10): # spacing for readability
print "\n"
print table_clas.text #this prints out the names.
Once you locate the desired element, use get_attribute() method:
elm = driver.find_element_by_css_selector("#dbtable input[name=chkStud]")
print(elm.get_attribute("value"))
table_clas = driver.find_element_by_xpath("//*[#class='dbtable']")
#select the desired element to thin down the html
td = table_clas.find_elements_by_xpath("//*[#name='chkStud']")
#finally hunt down the element you want specifally.
#find_elements or find_element
#should you use find_elements, then it returns a list you can iterate it
# like
for things in td:
print things.get_attribute("value")
this prints:
THE TEXT IWANNA EXTRACT
PLEASE EXTRACT ME HERE, IM DYING TO GET OUT;0
I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.
For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th nor td tags are unique? (See example HTML).
Say I have tag = soup.find('th') with the value
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> - How can I get the value of '$25 million' from tag?
I thought I could do something like tag.td or tag.text but neither of these are working for me.
Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?
Example HTML Code:
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
<td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference">[2]</sup></td>
</tr>
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th>
<td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference">[3]</sup></td>
</tr>
You can firstly find the node with tag td whose text is Budget and then find its next sibling td and get the text from the node:
soup.find("th", text="Budget").find_next_sibling("td").get_text()
# u'$25 million[2]'
To get every Amount in <td> tags You should use
tags = soup.findAll('td')
and then
for tag in tags:
print tag.get_text() # To get the text i.e. '$25 million'
What you need is find_all() method in BeatifulSoup.
For example:
tdTags = soup.find_all('td',{'class':'reference'})
This means you will find all 'td' tags when class = 'reference'.
You can find whatever td tags you want as long as you find the unique attribute in expected td tags.
Then you can do a for loop to find the content, as #Bijoy said.
The other possible way might be:
split_text = soup.get_text().split('\n')
# The next index from Budget is cost
split_text[split_text.index('Budget')+1]
So I am trying to scrape an html using BeautifulSoup, but I am having problems finding a tag id using Python 3.4. I know what the tag ("tr") is, but the id is constantly changing and I would like to save the id when it changes. For example:
<div class = "thisclass"
<table id = "thistable">
<tbody>
<tr id="what i want">
<td class = "someinfo">
<tbody>
<table>
<div>
I can find the div tag and the table, and I know the tr tag is there, but I want to extract the text next to id, without knowing what the text is going to say.
so far I have this code:
soup = BeautifulSoup(url.read())
divTag = soup.find_all("table",id ="thistable")
i = 0
for i in divTag:
trtag = soup.find("tr", id)
print(trtag)
i = i+1
if anyone could help me solve this problem I would appreciate it.
You can use a css selector:
print([element.get('id') for element in soup.select('table#thistable tr[id]'))
I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.
The HTML is structured like so:
<html>
<body>
<table class="mainTable">
<thead>
<tr>
<th>IP</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="hello.html">127.0.0.1<a></td>
<td><img src="uk.gif" />uk</td>
</tr>
<tr>
<td><a href="hello.html">192.168.0.1<a></td>
<td><img src="uk.gif" />us</td>
</tr>
<tr>
<td><a href="hello.html">255.255.255.0<a></td>
<td><img src="uk.gif" />br</td>
</tr>
</tbody>
</table>
The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))
table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)
this outputs:
127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br
What i need is the IP table.tbody.tr.td.a elements text and not the country table.tbody.tr.td.img.a elements.
Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.
Thanks.
This gives you the right list:
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[127.0.0.1<a></a>, <a></a>, 192.168.0.1<a></a>, <a></a>, 255.255.255.0<a></a>, <a></a>]
just apply .text on the elements of this list.
There are multiple empty <a></a> tags in above list because the <a> tags in the html are not closed properly. To get rid of them, you may use
pred = lambda tag: tag.parent.find('img') is None and tag.text
and ultimately:
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']
You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
if ip_pat.match(row.text):
print(row.text)
Search just first <td> for each row in tbody:
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
or maybe more readable:
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]
First note that the HTML is not well-formed. It is not closing the a tags. There are two <a> tags started here:
<a href="hello.html">127.0.0.1<a>
If you print table you'll see BeautifulSoup is parsing the HTML as:
<td>
127.0.0.1<a></a>
</td>
...
Each a is followed by an empty a.
Given the presence of those extra <a> tags, if you want every third <a> tag, then
for row in table.findAll("a")[::3]:
print(row.get_text())
suffices:
127.0.0.1
192.168.0.1
255.255.255.0
On the other hand, if the occurrence of <a> tags is not so regular and you only want that <a> tags with no previous sibling (such as, but not limited to <img>), then
for row in table.findAll("a"):
sibling = row.findPreviousSibling()
if sibling is None:
print(row.get_text())
would work.
If you have lxml, the criteria can be expressed more succinctly using XPath:
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[#class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)
The XPath used above has the following meaning:
//table select all <table> tags
[#class="mainTable"] that have a class="mainTable" attribute
// from these tags select descendants
td/a which are td tags with a child <a> tag
[not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag
/text() return the text of the <a> tag
It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.