Noob here, let's say I want to download a .mp3 file from a website like youtube.com or hypem.com. How do I go about it ? I know how to open a webpage (with requests) , how to parse it (with beautiful soup). But after these step, I really don't know what to do. How do you find de SOURCE of the file ?
Let's say for exemple this script : https://github.com/fzakaria/HypeScript/blob/master/hypeme.py
I undertand most of it except this part,
serve_url = "http://hypem.com/serve/source/{}/{}".format(id, key)
request = urllib2.Request(serve_url, "" , {'Content-Type': 'application/json'})
request.add_header('cookie', cookie)
response = urllib2.urlopen(request)
song_data_json = response.read()
response.close()
song_data = json.loads(song_data_json)
url = song_data[u"url"]
First, how did he find that this url would serve the song ?
"http://hypem.com/serve/source/{}/{}".format(id, key)
Then there is this line, no idea what it is for:
request = urllib2.Request(serve_url, "" , {'Content-Type': 'application/json'})
So my question, where do you find the link or information to download a file if it's not meant to download? (ex: youtube) How do you find de SOURCE of the file ?
To answer your first question, web scraping involves a lot of reverse engineering. I'm guessing whoever wrote the script, studied the site they were scrape and figured out what the urls for the songs look like.
As for your second question, basically, a Request object is being built before opening the url in order to add custom headers (Content-Type) to the request.
General, un-asked for advice, have a look at the requests library. This is MUCH simpler to use than urllib. The above code using requests would become:
import requests
serve_url = "http://hypem.com/serve/source/{}/{}".format(id, key)
# cookies is a simple key/value dictionary
response = requests.get(serve_url, headers={'Content-Type': 'application/json'}, cookies=cookies)
song_data = response.json()
url = song_data[u"url"]
Much cleaner and simpler to understand IMHO.
Related
Here is my code:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie)
print(response.content)
I also tried this:
print(response.text)
But both return this:
'<script>window.location="URL HERE";</script>'
All I want is to get the html of the page.
Any ideas?
EDIT:
I don't know how I did it, but I got the header and the cookie again from the website and plugged them in and it worked. I think it might be because I use Firefix and I had an update so maybe the header changed IDK.
That is the HTML of the whole page. Unfortunately, the page seems to require JavaScript, which your approach does not support. I don't know much about JS, but it seems to be a function that redirects you to a different page.
You could use an approach based on Selenium if the website only works when JS is enabled.
The website you used seems to redirect people when I went to the link, so when you get the url, make sure to allow redirects.
Do this:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie, allow_redirects=True)
I'm trying to make a web scraper using Python. The website has a login form though and I've been trying to log in for a few days with no results. The code looks like this:
session_requests = requests.Session()
r = session_requests.get(login_url, headers=dict(referer=login_url))
print(r.content)
tree = html.fromstring(r.text)
authenticity_token = list(set(tree.xpath('//input[#name="_csrf_token"]/#value')))[0]
payload = {"_csrf_token": authenticity_token, "_username": "-username-", "_password": "-password-",}
r = session_requests.post(login_url, data=payload, headers=dict(referer=login_url))
print(r.content)
You can see I print out r.content both before and after posting to the login page, and in theory I should get different outputs (because the second one should be the content of the actual web page after the login), but unfortunately I get the exact same output.
Here's a screenshot of what the login page requires to log in:
enter image description here
Also, I know for sure that the _csrf_token is correct because I have tested it a few times, so no doubts about that part.
Another thing that might be useful: I don't think I really need to include the headers because the outputs are exactly the same with or without them (I include them just because). Thanks in advance.
Edit: the URL is https://nuvola.madisoft.it/login
Here's some more useful stuff:
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
I have another question about posts.
This post should be almost identical to one referenced on stack overflow using this question 'Using request.post to post multipart form data via python not working', but for some reason I can't get it to work. The website is http://www.camp.bicnirrh.res.in/predict/. I want to post a file that is already in the FASTA format to this website and select the 'SVM' option using requests in python. This is based on what #NorthCat gave me previously, which worked like a charm:
import requests
import urllib
file={'file':(open('Bishop/newdenovo2.txt','r').read())}
url = 'http://www.camp.bicnirrh.res.in/predict/hii.php'
payload = {"algo[]":"svm"}
raw = urllib.urlencode(payload)
response = session.post(url, files=file, data=payload)
print(response.text)
Since it's not working, I assumed the payload was the problem. I've been playing with the payload, but I can't get any of these to work.
payload = {'S1':str(data), 'filename':'', 'algo[]':'svm'} # where I tried just reading the file in, called 'data'
payload = {'svm':'svm'} # not actually in the headers, but I tried this too)
payload = {'S1': '', 'algo[]':'svm', 'B1': 'Submit'}
None of these payloads resulted in data.
Any help is appreciated. Thanks so much!
You need to set the file post variable name to "userfile", i.e.
file={'userfile':(open('Bishop/newdenovo2.txt','r').read())}
Note that the read() is unnecessary, but it doesn't prevent the file upload succeeding. Here is some code that should work for you:
import requests
session = requests.session()
response = session.post('http://www.camp.bicnirrh.res.in/predict/hii.php',
files={'userfile': ('fasta.txt', open('fasta.txt'), 'text/plain')},
data={'algo[]':'svm'})
response.text contains the HTML results, save it to a file and view it in your browser, or parse it with something like Beautiful Soup and extract the results.
In the request I've specified a mime type of "text/plain" for the file. This is not necessary, but it serves as documentation and might help the receiving server.
The content of my fasta.txt file is:
>24.6jsd2.Tut
GGTGTTGATCATGGCTCAGGACAAACGCTGGCGGCGTGCTTAATACATGCAAGTCGAACGGGCTACCTTCGGGTAGCTAGTGGCGGACGGGTGAGTAACACGTAGGTTTTCTGCCCAATAGTGGGGAATAACAGCTCGAAAGAGTTGCTAATACCGCATAAGCTCTCTTGCGTGGGCAGGAGAGGAAACCCCAGGAGCAATTCTGGGGGCTATAGGAGGAGCCTGCGGCGGATTAGCTAGATGGTGGGGTAAAGGCCTACCATGGCGACGATCCGTAGCTGGTCTGAGAGGACGGCCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAAGGAATATTCCACAATGGCCGAAAGCGTGATGGAGCGAAACCGCGTGCGGGAGGAAGCCTTTCGGGGTGTAAACCGCTTTTAGGGGAGATGAAACGCCACCGTAAGGTGGCTAAGACAGTACCCCCTGAATAAGCATCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGATGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGCGCGCGCAGGCGGCAGGTTAAGTAAGGTGTGAAATCTCCCTGCTCAACGGGGAGGGTGCACTCCAGACTGACCAGCTAGAGGACGGTAGAGGGTGGTGGAATTGCTGGTGTAGCGGTGAAATGCGTAGAGATCAGCAGGAACACCCGTGGCGAAGGCGGCCACCTGGGCCGTACCTGACGCTGAGGCGCGAAGGCTAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTAGCAGTAAACGATGTCCACTAGGTGTGGGGGGTTGTTGACCCCTTCCGTGCCGAAGCCAACGCATTAAGTGGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGCGGAGCGTGTGGTTTAATTCGATGCGACGCGAAGAACCTTACCTGGGCTTGACATGCTATCGCAACACCCTGAAAGGGGTGCCTCCTTCGGGACGGTAGCACAGATGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGTATATCTAAGGAGACTGCCGGAGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGCATGGCTCTTACGTCCAGGGCTACACATACGCTACAATGGCCGTTACAGTGAGATGCCACACCGCGAGGTGGAGCAGATCTCCAAAGGCGGCCTCAGTTCAGATTGCACTCTGCAACCCGAGTGCATGAAGTCGGAGTTGCTAGTAACCGCGTGTCAGCATAGCGCGGTGAATATGTTCCCGGGTCTTGTACACACCGCCCGTCACGTCATGGGAGCCGGCAACACTTCGAGTCCGTGAGCTAACCCCCCCTTTCGAGGGTGTGGGAGGCAGCGGCCGAGGGTGGGGCTGGTGACTGGGACGAAGTCGTAACAAGGT
I'm a Transifex user, I need to retrieve my dashboard page with the list of all the projects of my organization.
that is, the page I see when I login: https://www.transifex.com/organization/(my_organization_name)/dashboard
I can access Transifex API with this code:
import urllib.request as url
usr = 'myusername'
pwd = 'mypassword'
def GetUrl(Tx_url):
auth_handler = url.HTTPBasicAuthHandler()
auth_handler.add_password(realm='Transifex API',
uri=Tx_url,
user=usr,
passwd=pwd)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
f = url.urlopen(Tx_url)
return f.read().decode("utf-8")
everything is ok, but there's no API call to get all the projects of my organization.
the only way is to get that page html, and parse it, but if I use this code, I get the login page.
This works ok with google.com, but I get an error with www.transifex.com or www.transifex.com/organization/(my_organization_name)/dashboard
Python, HTTPS GET with basic authentication
I'm new at Python, I need some code with Python 3 and only standard library.
Thanks for any help.
The call to
/projects/
returns your projects along with all the public projects that you can have access (like what you said). You can search for the ones that you need by modifying the call to something like:
https://www.transifex.com/api/2/projects/?start=1&end=6
Doing so the number of projects returned will be restricted.
For now maybe it would be more convenient to you, if you don't have many projects, to use this call:
/project/project_slug
and fetch each one separately.
Transifex comes with an API, and you can use it to fetch all the projects you have.
I think that what you need this GET request on projects. It returns a list of (slug, name, description, source_language_code) for all projects that you have access to in JSON format.
Since you are familiar with python, you could use the requests library to perform the same actions in a much easier and more readable way.
You will just need to do something like that:
import requests
import json
AUTH = ('yourusername', 'yourpassword')
url = 'www.transifex.com/api/2/projects'
headers = {'Content-type': 'application/json'}
response = requests.get(url, headers=headers, auth=AUTH)
I hope I've helped.