Reading Google Trends time series data with Python - python

I'm trying to read the following URL into Python:
http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3
with the code:
trend_url = 'http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3'
response = urllib2.urlopen(trend_url)
the_page = response.read()
The resulting value of the_page, for reasons that I don't understand, is an error page.
UPDATE: I think that the problem is related to some authentication issue: when I try to open the link in the browser's incognito window, it also returns an error page.

use requests
import requests
a = requests.get('http://www.google.com/trends/fetchComponent?q=nepal&cid=TIMESERIES_GRAPH_0&export=3')
a.text
u'// Data table response\ngoogle.visualization.Query.setResponse({"version":" ....
I tested your example and it is works.

I think is kinda late, but I think that Google does that in order to protect their data. You have to create a web-scraping that will go to the interface put the word you want, and it will generate the page/url. That is not the same as going at first sight for the URL generated.

Related

How do you get the username of a roblox account using Python Requests?

hope you are all doing well. This question is a bit more random than others I have asked. I am making a bot that extracts every username of the first 600,000,000 accounts on the platform Roblox, and loads it into a list.
This is my problem. I am using requests to get to the account page, but I can't find out how to extract the username from that page. I have tried using headers and inspect element but they don't work. If anyone has some suggestions on how to complete this, please help. Also, I am extraordinarily bad at network programming, so I may have made a noob mistake somewhere. Code is attached below.
import requests
users = []
for i in range(1, 600000001):
r = requests.get("https://web.roblox.com/users/{i}/profile".format(i=i))
print(r.status_code)
if r.status_code == 404:
users.append('Deleted')
continue
print(r.headers.get('username'))
You have to know that before working on the scraping, you have some errors in the code:
First of all in the 4th line if you want to use the .format command to insert values in a string you only have to insert the {}; so you should write:
r = requests.get("https://web.roblox.com/users/{}/profile".format(i))
And later you should remove continue from your code
But before doing anything you have to try the link be sure it's working, so copy the link, past it on your browser and remove i and add a number.
If it works you can go on with the code, if not you have to find another link to access to the page you want.
Eventually, to take the elements from the html page you have to use r.content.
But before continuing with coding you have to print(r.content).
You will see a long dict full of elements but you don't have to be afraid of it:
you have to search the value that interest you, and see how it's called, and you will be able to call that value writing
`<name_of_variable> = r.content['<name_of_the_value>']`

Trying to Log into vrv using requests, but results usually come in semi blank

im just trying to log into vrv and get the list of shows from the crunchyroll page so i can just open the site later, but when i try to get back the parsed website after logging in. Theres a lot of info missing like titles and images and its incomplete. This is the code i have up to now. Obviously my email and password isnt email and password, i just changed them to post it here.
import requests
import pyperclip as p
def enterVrv():
s = requests.Session()
dataL = {'email': 'email', 'password': 'password'}
s.post('https://static.vrv.co/vrvweb/build/common.6fb25c4cff650ac4e6ae.js', data=dataL)
crunchy = s.get('https://vrv.co/crunchyroll/browse')
p.copy(str(crunchy.content))
exit(0)
Ive tried posting from the normal 'https://vrv.co' site, i tried from the 'https://vrv.co.signin' link, and i tried the link you currently see in the code that i got from the networks pane in the developers tool. After i ran the code i would take the copied html and replace the current one on a webbrowser to see if its pulling up correctly, but it all comes in incomplete.
It looks like your problem is that you're trying to get data from a web page that's being loaded dynamically. Indeed, if you navigate to https://vrv.co/crunchyroll/browse in your browser you'll likely notice there's a delay in between the page loading and the cruncyroll titles being displayed.
It also looks like vrv does not expose an API for you to programmatically access this data either.
To get around this you could try accessing the page via a web automation tool such as selenium and scraping the data that way. As for just making a basic request to the site though, you're probably out of luck.

Deferred Downloading using Python Requests Library

I am trying to fetch some information from Workflowy using Python Requests Library. Basically I am trying to programmatically get the content under this URL: https://workflowy.com/s/XCL9FCaH1b
The problem is Workflowy goes through a 'loading phase' before the actual content is displayed when I visit this website so I end up getting the content of 'loading' page when I get the request. Basically I need a way to defer getting the content so I can bypass the loading phase.
It seemed like Requests library is talking about this problem here: http://www.python-requests.org/en/latest/user/advanced/#body-content-workflow but I couldn't get this example work for my purposes.
Here is the super simple block of code that ends up getting the 'loading page':
import requests
path = "https://workflowy.com/s/XCL9FCaH1b"
r = requests.get(path, stream=True)
print(r.content)
Note that I don't have to use Requests just picked it up because it looked like it might offer a solution to my problem. Also currently using Python 2.7.
Thanks a lot for your time!

403 'Access Denied' Error when opening web page with urllib2 in Python

I'm trying to get definitions of words using Google and urllib2 by opening this url, https://www.google.com/search?q=define+<something> and parsing the source for the definition. However, when I try to access the page I get a 403 Error, supposedly to prevent data mining in this sort of fasion. I'm fairly sure it wouldn't be wise to try and bypass that, so i'm wondering if there's an alternative for accessing data from Google's servers, or a data dump I should be using.
Edit: Here is the extent of the code i'm using to access the URL;
url = "https://www.google.com/search?q=define+" + word
try:
source = ulib.urlopen(url)
except ulib.HTTPError, e:
print e.fp.read()
We would need to see your code for confirmation, but your question was probably answered here. In a nutshell, you need to define your user agent.

Using mechanize to login to a webpage

This is my first experience in programming with Python and I'm trying to log in to this
webpage. After searching around I found that many people suggested using mechanize. Just to be sure that I setup things correctly before I get to code I downloaded the mechanize zip from the website and had my python script in the unzipped mechanize folder.
I have this code so far using different examples I've found:
import mechanize
theurl = 'http://voyager.umeres.maine.edu/Login'
mech = mechanize.Browser()
mech.open(theurl)
mech.select_form(nr=0)
mech["userid"] = "MYUSERNAME"
mech["password"] = "MYPASSWORD"
results = mech.submit().read()
f = file('test.html', 'w')
f.write(results)
f.close()
From looking at the source of the webpage I believe the userid/password are the correct names for the form. When I run the script in IDLE I get a bunch of errors including a time out error and a robot error. The full traceback:
I'm not exactly sure what I should expect either even if the code works. The login is for my school email which has class folders as well. My end game for what i'm trying to accomplish is once I log into my account I wanted to parse some folders for information and store them in a file that can be later converted in to json or RSS feed, but this is much further down the road with a much better understanding of Python just trying to give a more clear idea of what I want to accomplish.
The problem is that Mechanize is respecting the robots.txt
You must turn it off.
Solution:
mech = mechanize.Browser()
// needs to be set before you call open
mech.set_handle_robots(False)
Edit: it appears that the site is using some sort of additional POST values
that are generated via Javascript. This maybe a pain to recreate yourself, check the source of the page to see what's going on.
Actual POST values being sent:
challenge [a14b1f67-11edcc01]
charset UTF-8
login Login
origurl /Login/
password
savedpw 0
sha1 3f77d1e8c2ab0470ef8005a85f5f9c0d7aeedba6
userid sdsads

Categories