I am trying to access table values which can be found here - https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm
Specifically, I am trying to access the Net sales figure for 2015 (ie. 233,715) which can be found on page 39 of the 10-K form (see image).
.
Here is my code...
from lxml import html
import requests
SEC_page = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm')
SEC_tree = html.fromstring(SEC_page.content)
Description = SEC_tree.xpath('//html/body/document/type/sequence/filename/description/text()')
Sales_2015 = SEC_tree.xpath('//html/body/document/type/sequence/filename/description/text/table[48]/tbody/tr[4]/td[4]/font/text()')
print Description
print Sales_2015
We can see that 'Description' prints - ie. ['FORM 10-K\n', '\n']
However, 'Sales_2015' comes back as empty - ie. []
What I am doing wrong?
It's quite hard to debug and find a problem in your expression as you use absolute XPath. You should avoid using absolute XPath. Note that you reffer to table[48]! 48, Carl! You'd better use relative XPath as it's more flexible, reliable and verbose:
(//p[contains(., "CONSOLIDATED STATEMENTS OF OPERATIONS")]/following::td[contains(.,"Net sales")]/following-sibling::td[#align="right"]//text())[1]
Here we first found the header of table with text "CONSOLIDATED STATEMENTS OF OPERATIONS" then found the following table cell "Net sales" and grab the first number in the same row which is 233,715
Related
While crawling the website, there is no class name of some text I want to pull or any id style to separate the part that contains that text. In the selector path I used with soup.select it doesn't work for continuous operations. As an example, I want to take the data below, but I don't know how to do it.
ex.
Just a guess you can get the table, if so and you know the row, you can do the following. Use findAll to get all the rows in a list and use the slice syntax to access your element:
row = your_table_result.findAll('tr')[5::6]
EDITED AFTER QUESTION UPDATE
You solve your problem in different ways, but first grab the table:
table = soup.find("table",{"class":"auflistung"})
Way #1 - You know the row, where information is stored
(be aware that structure of table can change or maybe differ)
rows = table.findAll('td')
name = rows[0].text.strip()
position = rows[6].text.strip()
Way #2 - You know heading of information
(works great cause there ist only one column)
name = table.find("th", text="Anavatandaki isim:").find_next_sibling("td").text.strip()
position = table.find("th", text="Mevki:").find_next_sibling("td").text.strip()
I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.
I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.
How can I get an element at this specific location:
Check picture
The XPath is:
//*[#id="id316"]/span[2]
I got this path from google chrome browser. I basically want to retreive the number at this specific location with the following statement:
zimmer = response.xpath('//*[#id="id316"]/span[2]').extract()
However I'm not getting anything but an empty string. I found out that the id value is different for each element in the list I'm interested in. Is there a way to write this expression such that it works for generic numbers?
Use the corresponding label and get the following sibling element containing the value:
//span[. = 'Zimmer']/following-sibling::span/text()
And, note the bonus to the readability of the locator.
So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class.
e.g:
sel.xpath("//div[#id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[#class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.