How to download youtube videos using a python script [closed] - python

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I need to download videos from youtube using a python script. However i am unable to get the url of the video from the youtube page.
For example, given the url: http://www.youtube.com/watch?v=5qcmCUsw4EQ&feature=g-all-u&context=G2633db8FAAAAAAAAAAA
I need to download the video as a flv or any other format. Also i need to be able to download it multiple quality.
I tried several scripts like youtube-dl and quvi but they all give errors and dont work. Please help. It shall be deeply appreciated.

You need to parse the flashvars variable of the <embed> tag that contains the video. These change around, so some experimentation may be required to find the current variable names. Roughly speaking, you'll want to use a libraries like mechanize to grab the HTML of the page and BeautifulSoup to parse the HTML and extract the flashvars field of the <embed> element. Then look around at the variables to figure out which one contains the video URL.
e.g.,
br = mechanize.Browser()
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('%s?v=%s' % (YOUTUBE_URL, vidId))
soup = BeautifulSoup.BeautifulSoup(br.response().read())
flashVars = urllib2.urlparse.parse_qs(soup.find('embed').get('flashvars'))
# Return the first second video source URL
return flashVars['fmt_stream_map'][0].split('|')[1]

Related

Why does html from requests response deviate from dev tools? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to scrape houzz website
In browser dev tools it shows HTML content. But when I scrape it with beautifulsoup, it returns something else together with some of the html, I do not have much knowledge on this.
A little part of what I get is as follows.
</div><style data-styled="true" data-styled-version="5.2.1">.fzynIk.fzynIk{box-sizing:border-box;margin:0;overflow:hidden;}/*!sc*/
.eiQuKK.eiQuKK{box-sizing:border-box;margin:0;margin-bottom:4px;}/*!sc*/
.chJVzi.chJVzi{box-sizing:border-box;margin:0;margin-left:8px;}/*!sc*/
.kCIqph.kCIqph{box-sizing:border-box;margin:0;padding-top:32px;padding-bottom:32px;border-top:1px solid;border-color:#E6E6E6;}/*!sc*/
.dIRCmF.dIRCmF{box-sizing:border-box;margin:0;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:justify;-webkit-justify-content:space-between;-ms-flex-pack:justify;justify-content:space-between;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;margin-bottom:16px;}/*!sc*/
.kmAORk.kmAORk{box-sizing:border-box;margin:0;margin-bottom:24px;}/*!sc*/
.bPERLb.bPERLb{box-sizing:border-box;margin:0;margin-bottom:-8px;}/*!sc*/
What should I do with this? Is not this achievable with beautfulsoup?
Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.
Example (project titles)
from bs4 import BeautifulSoup
import requests
url_news = " https://www.houzz.com.au/professionals/home-builders/turrell-building-pty-ltd-pfvwau-pf~1099128087"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
[title.text for title in soup.select('#projects h3')]
Output
['Major Renovation & Master Wing',
'"The Italian Village" Private Residence',
'Country Classic',
'Residential Resort',
'Resort Style Extension, Stone and Timber',
'Old Northern Rd Estate']

scrape google resultstats with python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far
def numResults():
try:
page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
except URLError as e:
print(e)
return scounttext
My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.
What am i doing wrong and how can i extract the wanted resultstats?
Many thanks!
If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a
print(soup)
command to your code and you will see that the resultStats div doesn't appear.
The following code:
import sys
from urllib2 import Request, urlopen
import urllib
from bs4 import BeautifulSoup
query = 'pokerbonus'
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)
req_google = Request(url)
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
print(scounttext)
Will print
<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>
Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.

Python Mechanize for form with options populated by Javascript [duplicate]

This question already has an answer here:
Python mechanize javascript
(1 answer)
Closed 8 years ago.
I'm trying to use mechanize to grab prices for New York's metro-north railroad from this site: http://as0.mta.info/mnr/fares/choosestation.cfm
The problem is that when you select the first option, the site uses javascript to populate your list of possible destinations. I have written equivalent code in python, but I can't seem to get it all working. Here's what I have so far:
import mechanize
import cookielib
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://as0.mta.info/mnr/fares/choosestation.cfm")
br.select_form(name="form1")
br.form.set_all_readonly(False)
origin_control = br.form.find_control("orig_stat", type="select")
origin_control_list = origin_control.items
origin_control.value = [origin_control.items[0].name]
destination_control_list = reFillList(0, origin_control_list)
destination_control = br.form.find_control("dest_stat", type="select")
destination_control.items = destination_control_list
destination_control.value = [destination_control.items[0].name]
response = br.submit()
response_text = response.read()
print response_text
I know I didn't give you code for the reFillList() method, because it's long, but assume it correctly creates a list of mechanize.option objects. Python doesn't complain about me about anything, but on submit I get the html for this alert:
"Fare information for travel between two lines is not available on-line. Please contact our Customer Information Center at 511 and ask to speak to a representative for further information."
Am I missing something here? Thanks for all the help!
It can't really be done without trying to make sense of the crazy logic in that function. I suggest a js solution or a full browser like selenium.

Python mechanize SelectControl is empty when it should have values

I'm trying to automate the download of some data from a webform. I'm using python's mechanize module.
The url is here: http://www.hiv.lanl.gov/components/sequence/HIV/search/search.html
I need to fill out the Sequence Length, Subtype and Genomic Region. I've got the Sequence-Length and Genomic-Region figured out but I'm having trouble with selecting the Subtype. When I load the form is get an empty SelectControl and mechanize won't let me select anything.
This code should load the website:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html')
br.follow_link(text = 'Search Interface')
br.select_form(nr = 1)
Any help would be greatly appreciated.
-Will
EDIT:
I tried to use BeautifulSoup to re-parse the HTML (as per this SO question) but no luck there.
New Edit:
Below is an excerpt of the mechanize form.
<search POST http://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp multipart/form-data
<TextControl(value SequenceAccessions SA_GenBankAccession 1=)>
<SelectControl(master=[Any, *HIV-1, HIV-2, SIV, SHIV, syntheticwholeDNA, NULL])>
<TextControl(value SEQ_SAMple SSAM_common_name 1=)>
<SelectControl(slave=[])>
<TextControl(value SequenceEntry SE_sequenceLength 4=)>
<CheckboxControl(sample_year_exact=[*1])>
<TextControl(value SEQ_SAMple SSAM_Sample_year 11=)>
<TextControl(value SEQ_SAMple SSAM_Sample_country 1=)>
<CheckboxControl(recombinants=[Recombinants])>
For some reason the slave control is not populated with the possible choices.
The problem is that the Subtype <select> options are populated by Javascript, which mechanize does not execute. The Javascript code runs when the page loads, populating the slave options with the HIV-1 list:
addLoadEvent(fillList(document.forms['search'].slave, lists['HIV-1']));
However, the mapping of Virus -> Subtype is stored in this Javascript file search.js. You may need to store this mapping in your Python script and manually set the slave form value.

Python: urlopen not downloading the entire site

Greetings,
I have done:
import urllib
site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()
but it doesn't compare to viewing the source when loaded in firefox.
I suspected the user agent and did this:
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"
urllib._urlopener = AppURLopener()
and downloaded it, but it still doesn't download the whole website.
Can someone please help me do user agent switching, if that is the likely culprit?
Thanks,
Narnie
It's more likely that there is an iframe in the code or that javascript is modifying the DOM. If theres an iframe, you'll have to parse the page to get the url for the iframe or just do it manually if it's a one-off. If it's javascript, I hear that selenium-rc is good but have no first hand experience with it.
downloaded page displayed locally may look different from several reasons, like that there are relative links (can be fixed adding e.g. <base href="http://www.weather.com/today/"> into the page head element), or non-functional ajax requests (see Ways to circumvent the same-origin policy).

Categories