Links in Google App Engine are prepended with the site URL - python

I've had a weird issue that's been stumping me for days, and I really need to get it working tonight. I wrote an app in Python on Google App Engine (I'm assuming this is relevant to the issue), and whenever I include a link with the tag, the link on the live site is prepended with my own site's URL.
For example, if I placed a link in the home page html to, say, YouTube, like so:
Clicky here
...then on the live website, it'll be a link to www.mysitedomain.com/www.youtube.com
Needless to say, I get a 404 every time. I hope this is a simple issue to resolve, I'm really on a time crunch tonight. Thank you for any and all help!

Put http:// before your link, otherwise it is taken as a relative link.
Clicky here

Related

How can I personalize an URL path in Django?

I am trying to build an website which renders some books and the corresponding pages. I want to make possible to access a page like this:
path('/<str:book_pk>-<int:book_page>/', views.TestClass.as_view(), name='book-test')
I want a user to access it very simple, something like: mysite.com/5-12/ - which redirects him to the book nr 5 at page 12. The problem is that when I access this page from the website itself, using href, the real path becomes:
mysite.com/%2F5-13/
If I want to write in the browser, the following path: myste.com/5-13/, it throws me 404 page not found, because the real path is mysite.com/%2F5-13/ . This is pretty obvious, but my question is:
How can I stick to my initial path, and make it possible to be accessed via myste.com/5-13/? For some reason, Django URL Patterns, adds an extra %2F string at the beginning. Can somebody explain me why, and how to solve this issue?
I much appreciate your time and effort! Thank you so much!
You don't have to include / at the beginning of the url, simply:
path('<str:book_pk>-<int:book_page>/', views.TestClass.as_view(), name='book-test')
/ is encoded automatically as %2F in urls (read the full list here)

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Google Analytics API - Table ID

I am trying to access our site's Web usage statistics through Google Analytics API. I downloaded the Python code from here
http://code.google.com/p/gdata-python-client/
Under samples/analytics folder, there is data_feed_demo.py. I ran it, however this code seems to want an table ID, but in the docs it is not clear where this comes from. On the Web, some suggest to use profile id, others say to look at some URL from the GA admin pages. I tried various sections of such URLs from the GA tool, but the code was not able to get data. Any ideas?
There was an answer here
https://groups.google.com/forum/?fromgroups#!topic/google-analytics-data-export-api/SdprtYcBLP4
When I logged in GA, the URL in the main page is something like
https://www.google.com/analytics/web/?et=#dashboard...a[xxx]w[xxx]p[xxx]/
I took [xxx] out of p[xxx] and gave the sample script as ga:[xxx]. This worked. Funny thing is I remember taking out p values out of the URL before, but I guess I was not on the main page. Anyhow. This is the answer.
https://developers.google.com/analytics/solutions/articles/hello-analytics-api
This is the best way to get started with GA api.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

Categories