So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class.
e.g:
sel.xpath("//div[#id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[#class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.
Related
I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.
I have a page having item price as shown in attached image. i want to extract this price as 64.99. I want to ask what would be the xpath to get this number as Im using selenium webdriver to find this price
I have tried a lot of permutations of xpaths but the problem is that this page have a lot such products so its being difficult to find unique xpath of that price. e.g -
//li[#class = 'price-current'] (gives 13 result on the page)
//*[#id = 'landingpage-price' and #class = 'price-current'] (give no result)
Any help will be appreciated. Thanks
Since you mentioned there are lot of such products, then the problem you are asking is wrong. You need to find out how to get to the product that you are interested in and then find its price. You are trying to find the price directly.
Now the issue in below xpath
//*[#id = 'landingpage-price' and #class = 'price-current'] (give no result)
is that, you are trying to search inside landingpage-price and specifying the class condition also on the container element. First I would suggest do this using css, but I will show both xpath and css as well.
XPath
elem = driver.find_element_by_xpath("//div[#id = 'landingpage-price']//li[#class = 'price-current']")
print (elem.text.replace("$",""))
CSS
elem = driver.find_element_by_css_selector("#landingpage-price .price-current")
print (elem.text.replace("$",""))
You xpath would break if developers adds more classes to the price. So using a css is better and it does work also. As you can see in below image it uniquely identified the element
I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.
In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")
try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code
I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!
I am trying to access table values which can be found here - https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm
Specifically, I am trying to access the Net sales figure for 2015 (ie. 233,715) which can be found on page 39 of the 10-K form (see image).
.
Here is my code...
from lxml import html
import requests
SEC_page = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm')
SEC_tree = html.fromstring(SEC_page.content)
Description = SEC_tree.xpath('//html/body/document/type/sequence/filename/description/text()')
Sales_2015 = SEC_tree.xpath('//html/body/document/type/sequence/filename/description/text/table[48]/tbody/tr[4]/td[4]/font/text()')
print Description
print Sales_2015
We can see that 'Description' prints - ie. ['FORM 10-K\n', '\n']
However, 'Sales_2015' comes back as empty - ie. []
What I am doing wrong?
It's quite hard to debug and find a problem in your expression as you use absolute XPath. You should avoid using absolute XPath. Note that you reffer to table[48]! 48, Carl! You'd better use relative XPath as it's more flexible, reliable and verbose:
(//p[contains(., "CONSOLIDATED STATEMENTS OF OPERATIONS")]/following::td[contains(.,"Net sales")]/following-sibling::td[#align="right"]//text())[1]
Here we first found the header of table with text "CONSOLIDATED STATEMENTS OF OPERATIONS" then found the following table cell "Net sales" and grab the first number in the same row which is 233,715
I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.
Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...
mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'
Try this.This should work for your case.See demo.
https://www.regex101.com/r/fG5pZ8/6
Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.
Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.
More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like
re.findall('stuff (rank) stuff (title) stuff stuff stuff')
(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders)
and extract both the "rank" group and the "title" group out of each matched row.
Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.