unable Scrape text from website using scrappy - python
I am new here and I am trying to scrape the nearest station and distance list from this link https://www.onthemarket.com/details/10405122/ I have been stuck here for a day. any help would be apreciated.
I have tried
response.xpath('//div[#class = "tab-content"]/span')
response.xpath('//section//span[#class="poi-name"]')
response.xpath('//section[#class="poi"]/div//text()').extract()
nothing seems to work.
please if you are able to get it please do explain why I failed that would be much apreciated.
The data is not in the downloaded html:
<ol class="tab-list"></ol><div class="tab-content"></div>
It probably receives the data in another call. Try not hurry up writing the scraper, invest some time to understand how this particular UI works. I would also suggest downloading data via curl or scrapy shell "your_url" (as in this case it will not be downloaded by browser, which renders the page and can trick you like right now).
Related
How to scrape a website and all its directories from the one link?
Sorry if this is not a valid question, i personally feel it kind of boarders on the edge. Assuming the website involved has given full permission How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg. Using the link: https://www.dogs.com could I pull info from: https://www.dogs.com/about-us and any other directory attached to the "https://www.dogs.com/" (I have no idea is dogs.com is a real website or not, just an example) I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess. anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages. you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.
Scraping webpage generated by javascript
I have a problem getting javascript content into HTML to use it for scripting. I used multiple methods as phantomjs or python QT library and they all get most of the content in nicely but the problem is that there are javascript buttons inside the page like this: Pls see screenshot here Now when I load this page from a script these buttons won't default to any value so I am getting back 0 for all SELL/NEUTRAL/BUY values below. Is there a way to set these values when you load the page from a script? Example page with all the values is: https://www.tradingview.com/symbols/NEBLBTC/technicals/ Any help would be greatly appreciated.
If you are trying to achieve this with scrapy or with derivation of cURL or urrlib I am afraid that you can't do this. Python has another external packages such selenium that allow you to interact with the javascript of the page, but the problem with selenium is too slow, if you want something similar to scrapy you could check how the site works (as i can see it works through ajax or websockets) and fetch the info that you want through urllib, like you would do with an API. Please let me know if you understand me or i misunderstood your question
I used seleneum which was perfect for this job, it is indeed slow but fits my purpose. I also used the seleneum firefox plugin to generate the python script as it was very challenging to find where exactly in the code as the button I had to press.
Python- Downloading a file from a webpage by clicking on a link
I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website. What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was: <a id="lnkDownLoad" href="javascript:getQuotes(true);"> Download this file in Excel Format </a> No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?
Using Chrome, go to View > Developer > Developer Tools In this new developer tools UI, change to the Network tab Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity. Click the link, and see if there was any requests made to the server If there was, click it, and see if you can reverse engineer the API of its endpoint Please be aware that this may be against the website's Terms of Service!
It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently. import requests from bs4 import BeautifulSoup URL = 'http://www.nasdaq.com/symbol/amd/historical' page = requests.get(URL).text soup = BeautifulSoup(page, 'lxml') tableDiv = soup.find_all('div', id="historicalContainer") tableRows = tableDiv[0].findAll('tr') for tableRow in tableRows[2:]: row = tuple(tableRow.getText().split()) print ('"%s",%s,%s,%s,%s,"%s"' % row) Output: "03/24/2017",14.16,14.18,13.54,13.7,"50,022,400" "03/23/2017",13.96,14.115,13.77,13.79,"44,402,540" "03/22/2017",13.7,14.145,13.55,14.1,"61,120,500" "03/21/2017",14.4,14.49,13.78,13.82,"72,373,080" "03/20/2017",13.68,14.5,13.54,14.4,"91,009,110" "03/17/2017",13.62,13.74,13.36,13.49,"224,761,700" "03/16/2017",13.79,13.88,13.65,13.65,"44,356,700" "03/15/2017",14.03,14.06,13.62,13.98,"55,070,770" "03/14/2017",14,14.15,13.6401,14.1,"52,355,490" "03/13/2017",14.475,14.68,14.18,14.28,"72,917,550" "03/10/2017",13.5,13.93,13.45,13.91,"62,426,240" "03/09/2017",13.45,13.45,13.11,13.33,"45,122,590" "03/08/2017",13.25,13.55,13.1,13.22,"71,231,410" "03/07/2017",13.07,13.37,12.79,13.05,"76,518,390" "03/06/2017",13,13.34,12.38,13.04,"117,044,000" "03/03/2017",13.55,13.58,12.79,13.03,"163,489,100" "03/02/2017",14.59,14.78,13.87,13.9,"103,970,100" "03/01/2017",15.08,15.09,14.52,14.96,"73,311,380" "02/28/2017",15.45,15.55,14.35,14.46,"141,638,700" "02/27/2017",14.27,15.35,14.27,15.2,"95,126,330" "02/24/2017",14,14.32,13.86,14.12,"46,130,900" "02/23/2017",14.2,14.45,13.82,14.32,"79,900,450" "02/22/2017",14.3,14.5,14.04,14.28,"71,394,390" "02/21/2017",13.41,14.1,13.4,14,"66,250,920" "02/17/2017",12.79,13.14,12.6,13.13,"40,831,730" "02/16/2017",13.25,13.35,12.84,12.97,"52,403,840" "02/15/2017",13.2,13.44,13.15,13.3,"33,655,580" "02/14/2017",13.43,13.49,13.19,13.26,"40,436,710" "02/13/2017",13.7,13.95,13.38,13.49,"57,231,080" "02/10/2017",13.86,13.86,13.25,13.58,"54,522,240" "02/09/2017",13.78,13.89,13.4,13.42,"72,826,820" "02/08/2017",13.21,13.75,13.08,13.56,"75,894,880" "02/07/2017",14.05,14.27,13.06,13.29,"158,507,200" "02/06/2017",12.46,13.7,12.38,13.63,"139,921,700" "02/03/2017",12.37,12.5,12.04,12.24,"59,981,710" "02/02/2017",11.98,12.66,11.95,12.28,"116,246,800" "02/01/2017",10.9,12.14,10.81,12.06,"165,784,500" "01/31/2017",10.6,10.67,10.22,10.37,"51,993,490" "01/30/2017",10.62,10.68,10.3,10.61,"37,648,430" "01/27/2017",10.6,10.73,10.52,10.67,"32,563,480" "01/26/2017",10.35,10.66,10.3,10.52,"35,779,140" "01/25/2017",10.74,10.975,10.15,10.35,"61,800,440" "01/24/2017",9.95,10.49,9.95,10.44,"43,858,900" "01/23/2017",9.68,10.06,9.68,9.91,"27,848,180" "01/20/2017",9.88,9.96,9.67,9.75,"27,936,610" "01/19/2017",9.92,10.25,9.75,9.77,"46,087,250" "01/18/2017",9.54,10.1,9.42,9.88,"51,705,580" "01/17/2017",10.17,10.23,9.78,9.82,"70,388,000" "01/13/2017",10.79,10.87,10.56,10.58,"38,344,340" "01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900" "01/11/2017",11.39,11.41,11.15,11.2,"39,337,330" "01/10/2017",11.55,11.63,11.33,11.44,"29,122,540" "01/09/2017",11.37,11.64,11.31,11.49,"37,215,840" "01/06/2017",11.29,11.49,11.11,11.32,"34,437,560" "01/05/2017",11.43,11.69,11.23,11.24,"38,777,380" "01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680" "01/03/2017",11.42,11.65,11.02,11.43,"55,114,820" "12/30/2016",11.7,11.78,11.25,11.34,"44,033,460" "12/29/2016",11.24,11.62,11.01,11.59,"50,180,310" "12/28/2016",12.28,12.42,11.46,11.55,"71,072,640" "12/27/2016",11.65,12.08,11.6,12.07,"44,168,130" The script escapes dates and thousands-separated numbers.
Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that. If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.
Parsing from a website -- source code does not contain the info I need
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here. I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me. For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction. Thanks.
The page is being generated via JavaScript. Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON. Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)
transferring real time data from a website in python
I am programming in Python. I would like to extract real time data from a webpage without refreshing it: http://www.fxstreet.com/rates-charts/currency-rates/ I think the real time data webpage is written in AJAX but I am not quite sure.. I thought about opening an internet browser with the program but I do not really know/like this way... Is there an other way to do it? I would like to fill a dictionnary in my program (or even a SQL database) with the latest numbers each second. please help me in python, thanks!
To get the data, you'll need to look through the javascript and HTML source to find what URL it's hitting to get the data it's displaying. Then, you can call that URL with urllib or your favorite python library and parse it Also, it may be easier if you use a plugin like Firebug that lets you watch the AJAX requests.