I am new here and I am trying to scrape the nearest station and distance list from this link https://www.onthemarket.com/details/10405122/ I have been stuck here for a day. any help would be apreciated.
I have tried
response.xpath('//div[#class = "tab-content"]/span')
response.xpath('//section//span[#class="poi-name"]')
response.xpath('//section[#class="poi"]/div//text()').extract()
nothing seems to work.
please if you are able to get it please do explain why I failed that would be much apreciated.
The data is not in the downloaded html:
<ol class="tab-list"></ol><div class="tab-content"></div>
It probably receives the data in another call. Try not hurry up writing the scraper, invest some time to understand how this particular UI works. I would also suggest downloading data via curl or scrapy shell "your_url" (as in this case it will not be downloaded by browser, which renders the page and can trick you like right now).
I'm using python 2.7 (in Windows 7 OS) I'm just trying to read a webpage using urllib function and writing it to a file. Below is my code.
import urllib
html=urllib.urlopen("http://www.sciencedirect.com/science/article/pii/S027252311730076X").readlines()
print len(html)
g=open("D:\path\to\output\output.html",'w')
for i in html:
g.write(i)
g.close()
But when I compared the page source of the above mentioned link in browser (by right click -> View page source) and my output html file they are different. Many information are missing in my output.html file. Why is that? and how can i get the original page source? Because I have to further write few more codes to extract some specific info from this page.
Thanks for your help in advance.
How can we save the webpage including the content in it, so that it is viewable offline, using urllib in python language? Currently I am using the following code:
import urllib.request
driver.webdriver.Chrome()
driver.get("http://www.yahoo.com")
urllib.request.urlretrieve("http://www.yahoo.com", C:\\Users\\karanjuneja\\Downloads\\kj\\yahoo.mhtml")
This works and strores an mhtml version of the webpage in the folder, but when you open the file, you will only find the codes written and not the page how it appears online. Do we need to make changes to the code?
Also, is there an alternate way of saving the webpage in MHTML format with all the content as it appears online, and not just the source.Any suggestions?
Thanks Karan
I guess this site might help you~
Create an MHTML archive
I already had a previous question, but that was pasted in vba tags etc. So I'll try again with proper tags and title since I gained a bit of knowledge now, hopefully.
The problem:
I need to find ~1000 dates from a database with plant variety data which probably is behind a login so here is a screenshot . Now I could of course fill out this form ~1000 times but there must be a smarter way to do this. If it were an HTML site I would know what to do, and have vba just pull in the results. I have been reading all morning about these javascript pages and ajax libraries but it is above my level. So hopefully someone can help me out a bit. I also used firebug to see what is going on when I press search:
These codes are similar to the last picture posted, make it easier to read. Code left here for copying.
f.cc.facet.limit
-1
f.cc.facet.mincount
1
f.end_date.facet.date.end
2030-01-01T00:00:00Z
f.end_date.facet.date.gap
+5YEARS
f.end_date.facet.date.oth...
all
f.end_date.facet.date.sta...
1945-01-01T00:00:00Z
f.end_type.facet.limit
20
f.end_type.facet.mincount
1
f.grant_start_date.facet....
NOW/YEAR
f.grant_start_date.facet....
+5YEARS
f.grant_start_date.facet....
all
f.grant_start_date.facet....
1900-01-01T00:00:00Z
f.status.facet.limit
20
f.status.facet.mincount
1
f.type.facet.limit
20
f.type.facet.mincount
1
facet
true
facet.date
grant_start_date
facet.date
end_date
facet.field
cc
facet.field
type
facet.field
status
facet.field
end_type
fl
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl
true
hl.fl
cc,latin_name,den_info,den_final
hl.fragsize
5000
hl.requireFieldMatch
false
json.nl
map
q
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi
3-9BgbCWwYBd7aIWPU1/onjQ==
rows
25
sort
uc asc,score desc
start
0
type
upov
wt
json
Source
fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date
And this is what it looks like in HTML, atleast according to firebug:
{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}
Edit:
It uses the GET method and XMLHttpRequest, as can be seen from this screenshot:
I already found how to make python run from excel vba here in this topic
I also downloaded beautiful soup but python is not my kind of language, so any help would be greatly appreciated.
Image refered to in comment on answer of Will
1) Use Excel to store your search parameters.
2) Run a few manual searches to find out what parameters you need to change on each request.
3) Invoke an http get request to the url that you have found in firebug/Fiddler (the url that it calls when you click "search" manually). See Urllib3 https://urllib3.readthedocs.org/en/latest/
3) Look at Json pickle to help you deal with the json response, saving (serializing) it to a file.
4) Reading and writing data involves IO libraries. Google is your friend. (Possibly easier to save your excel file as a csv and then just read the csv file for your search parameters).
5) Download PyCharm for your python development - it's really good.
Hope this helps.
I finally figured it out. I don't need to use python, I can just use an url, and then import the content into excel. I found out with Fiddler that the URL should become https://www3.wipo.int/pluto/user/jsp/select.jsp? And then the piece of code from the OP goes behind that.
The rest of my solution can be found in another question I had. It uses no Python but only VBA, which commands IE to open a website and copies the content of it.
Alright so the issue is that I visit a site to download the file I want but the problem is the website that I try to download the file from doesn't host the actual file instead it uses dropbox to host it so as soon as you click download your redirected to a blank page that has dropbox pop up in a small window allowing you to download it. Things to note, there is no log in so I can direct python right to the link where dropbox pops up but it wont download the file.
import urllib
url = 'https://thewebsitedownload.com'
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(url, filename)
Thats the code I use to use and it worked like a charm for direct downloads but now when I try to use it for the site that has the dropbox popup download it just ends up downloading the html code of the site (from what I can tell) and does not actually download the file.
I am still relatively new to python/ coding in general but I am loving it so far this is just the first brick wall that I have hit that I didn't find any similar resolutions to.
Thanks in advance! Sample codes help so much thats how I have been learning so far.
Use Beautifulsoup to parse the html you get. You can then get the href link to the file. There are a lot of Beautifulsoup tutorials on the web, so I think you'll find it fairly easy to figure out how to get the link in your specific situation.
First you download the html with the code you already have, but without the filename:
import urllib
from bs4 import BeautifulSoup
import re
url = 'https://thewebsitedownload.com'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
link = soup.find_all(href=re.compile("dropbox"))[0]['href']
print link
filename = 'filetobedownloaded.exe'
urllib.urlretrieve(link, filename)
I made this from the docs, but haven't tested it, but I think you get the idea.