Python Netflix Query - python
I am trying to get the name of TV Show (Episode/Season)/Movie from the Netflix URL. Is there a way of doing it using requests and urllib? I guess I'll need the API key and secret for that.
This is what I'm trying to do.
e.g. I have this URL for Z Nation.
url = "https://www.netflix.com/gb/title/80008434"
url_data = urlparse.urlparse(url)
query = urlparse.parse_qs(url_data.query)
id = query["v"][0]
id should give me 80008434
netflixurl = ''
r = requests.get(netflixurl)
js = r.json()
item = js[""]
item should give me Z Nation Season 3. (Or whatever season/episode url is url)
Using the pyflix2 API for 'NetflixAPIV2'.
How should I go about this? Please help!
EDIT: I use this for youtube. Is there a similar thing for netflix?
import lxml
from lxml import etree
import urllib
youtube = etree.HTML(urllib.urlopen("https://www.youtube.com/watch?v=L93-7vRfxNs").read())
video_title = youtube.xpath("//span[#id='eow-title']/#title")
song = ''.join(video_title)
Result : Daft Punk - Aerodynamic
Sadly, Netflix has discontinued the use of its public API and is not accepting any new developers.
You can look into Netflix Roulette API, which is an unofficial API and lets you run queries on Netflix. You can use that API in conjunction with urllib or requests and get the results that you need.
Apart from that you can use general webscraping, using BeautifulSoup and requests. But doing it this way is not recommended as it will consume immense amount of bandwidth to scrape all the directories.
There is an API that you can subscribe to developed by uNoGS. The downside is that you can have a free account but have to submit your credit card details as, if you go over 100 requests a month, you will be charged. Needless to say alarm bells rang.
Therefore, I'm looking into building my own and in the very early stages.
Having seen some of the replies I just thought I'd throw it out there that the robots.txt file shows that the /browse subdirectory shows as 'allowed'.
Normally on websites such as this, they stipulate that they allow it for reputable search engines to be able to scrape. There is however no such clause and therefore, with the legality, as discussed so far, it appears that scraping the browse section is both legal and ethical. That being said, even there is no 'Crawl-delay' stipulated, ethically I would suggest putting one if you do succeed in getting a requests working.
so I wrote some code for this exact thing as a lot of websites gave for the USA or other regions and couldn't translate to the exact answer that worked for my Netflix.
This one uses selenium but it shouldn't be hard to downloads and understand the code that I wrote.
https://github.com/Eglis05/netflix-selenium
You can have a look at it and report anything you don't like. :)
Related
How can i convert scraping script as web-service?
I want to build a api that accepts a string and returns html code. Here is my scraping code that i want as a web-service. Code from selenium import webdriver import bs4 import requests import time url = "https://www.pnrconverter.com/" browser = webdriver.Firefox() browser.get(url) string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E PS/JPIX8U" button = browser.find_element_by_xpath("//textarea[#class='dataInputChild']") button.send_keys(string) #accept string button.submit() time.sleep(5) soup = bs4.BeautifulSoup(browser.page_source,'html.parser') html = soup.find('div',class_="main-content") #returns html print(html) Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources. Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header. If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx. If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires. If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing. For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse. Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API. We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
Python- Downloading a file from a webpage by clicking on a link
I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website. What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was: <a id="lnkDownLoad" href="javascript:getQuotes(true);"> Download this file in Excel Format </a> No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?
Using Chrome, go to View > Developer > Developer Tools In this new developer tools UI, change to the Network tab Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity. Click the link, and see if there was any requests made to the server If there was, click it, and see if you can reverse engineer the API of its endpoint Please be aware that this may be against the website's Terms of Service!
It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently. import requests from bs4 import BeautifulSoup URL = 'http://www.nasdaq.com/symbol/amd/historical' page = requests.get(URL).text soup = BeautifulSoup(page, 'lxml') tableDiv = soup.find_all('div', id="historicalContainer") tableRows = tableDiv[0].findAll('tr') for tableRow in tableRows[2:]: row = tuple(tableRow.getText().split()) print ('"%s",%s,%s,%s,%s,"%s"' % row) Output: "03/24/2017",14.16,14.18,13.54,13.7,"50,022,400" "03/23/2017",13.96,14.115,13.77,13.79,"44,402,540" "03/22/2017",13.7,14.145,13.55,14.1,"61,120,500" "03/21/2017",14.4,14.49,13.78,13.82,"72,373,080" "03/20/2017",13.68,14.5,13.54,14.4,"91,009,110" "03/17/2017",13.62,13.74,13.36,13.49,"224,761,700" "03/16/2017",13.79,13.88,13.65,13.65,"44,356,700" "03/15/2017",14.03,14.06,13.62,13.98,"55,070,770" "03/14/2017",14,14.15,13.6401,14.1,"52,355,490" "03/13/2017",14.475,14.68,14.18,14.28,"72,917,550" "03/10/2017",13.5,13.93,13.45,13.91,"62,426,240" "03/09/2017",13.45,13.45,13.11,13.33,"45,122,590" "03/08/2017",13.25,13.55,13.1,13.22,"71,231,410" "03/07/2017",13.07,13.37,12.79,13.05,"76,518,390" "03/06/2017",13,13.34,12.38,13.04,"117,044,000" "03/03/2017",13.55,13.58,12.79,13.03,"163,489,100" "03/02/2017",14.59,14.78,13.87,13.9,"103,970,100" "03/01/2017",15.08,15.09,14.52,14.96,"73,311,380" "02/28/2017",15.45,15.55,14.35,14.46,"141,638,700" "02/27/2017",14.27,15.35,14.27,15.2,"95,126,330" "02/24/2017",14,14.32,13.86,14.12,"46,130,900" "02/23/2017",14.2,14.45,13.82,14.32,"79,900,450" "02/22/2017",14.3,14.5,14.04,14.28,"71,394,390" "02/21/2017",13.41,14.1,13.4,14,"66,250,920" "02/17/2017",12.79,13.14,12.6,13.13,"40,831,730" "02/16/2017",13.25,13.35,12.84,12.97,"52,403,840" "02/15/2017",13.2,13.44,13.15,13.3,"33,655,580" "02/14/2017",13.43,13.49,13.19,13.26,"40,436,710" "02/13/2017",13.7,13.95,13.38,13.49,"57,231,080" "02/10/2017",13.86,13.86,13.25,13.58,"54,522,240" "02/09/2017",13.78,13.89,13.4,13.42,"72,826,820" "02/08/2017",13.21,13.75,13.08,13.56,"75,894,880" "02/07/2017",14.05,14.27,13.06,13.29,"158,507,200" "02/06/2017",12.46,13.7,12.38,13.63,"139,921,700" "02/03/2017",12.37,12.5,12.04,12.24,"59,981,710" "02/02/2017",11.98,12.66,11.95,12.28,"116,246,800" "02/01/2017",10.9,12.14,10.81,12.06,"165,784,500" "01/31/2017",10.6,10.67,10.22,10.37,"51,993,490" "01/30/2017",10.62,10.68,10.3,10.61,"37,648,430" "01/27/2017",10.6,10.73,10.52,10.67,"32,563,480" "01/26/2017",10.35,10.66,10.3,10.52,"35,779,140" "01/25/2017",10.74,10.975,10.15,10.35,"61,800,440" "01/24/2017",9.95,10.49,9.95,10.44,"43,858,900" "01/23/2017",9.68,10.06,9.68,9.91,"27,848,180" "01/20/2017",9.88,9.96,9.67,9.75,"27,936,610" "01/19/2017",9.92,10.25,9.75,9.77,"46,087,250" "01/18/2017",9.54,10.1,9.42,9.88,"51,705,580" "01/17/2017",10.17,10.23,9.78,9.82,"70,388,000" "01/13/2017",10.79,10.87,10.56,10.58,"38,344,340" "01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900" "01/11/2017",11.39,11.41,11.15,11.2,"39,337,330" "01/10/2017",11.55,11.63,11.33,11.44,"29,122,540" "01/09/2017",11.37,11.64,11.31,11.49,"37,215,840" "01/06/2017",11.29,11.49,11.11,11.32,"34,437,560" "01/05/2017",11.43,11.69,11.23,11.24,"38,777,380" "01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680" "01/03/2017",11.42,11.65,11.02,11.43,"55,114,820" "12/30/2016",11.7,11.78,11.25,11.34,"44,033,460" "12/29/2016",11.24,11.62,11.01,11.59,"50,180,310" "12/28/2016",12.28,12.42,11.46,11.55,"71,072,640" "12/27/2016",11.65,12.08,11.6,12.07,"44,168,130" The script escapes dates and thousands-separated numbers.
Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that. If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.
getting specific images from page
I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl: redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('a')['t1'] #works fine print div.find('img')['src'] #This getting issue KeyError: 'src' But this gives only title, not the image source Is there anything wrong? Edit: I have edited my source, still could not get image url.
Bing is using some techniques to block automated scrapers. I tried to print div.find('img') and found that they are sending source in attribute names src2, so following should work - div.find('img')['src2'] This is working for me. Hope it helps.
If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results. Which leads to the 3 main options you have: simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them. use Bing Search API (this should probably be option #1)
Searching for books with the Amazon Product Advertising API - Python
tl;dr: I am using the Amazon Product Advertising API with Python. How can I do a keyword search for a book and get XML results that contain TITLE, ISBN, and PRICE for each entry? Verbose version: I am working in Python on a web site that allows the user to search for textbooks from different sites such as eBay and Amazon. Basically, I need to obtain simple information such as titles, ISBNS, and prices for each item from a set of search results from one of those sites. Then, I can store and format that information as needed in my application (e.g, displaying HTML). In eBay's case, getting the info I needed wasn't too hard. I used urllib2 to make a request based on a sample I found. All I needed was a special security key to add to the URL: def ebaySearch(keywords): #keywords is a list of strings, e.g. ['moby', 'dick'] #findItemsAdvanced allows category filter -- 267 is books #Of course, I replaced my security appname in the example below url = "http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsAdvanced&SERVICE-NAME=FindingService&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=[MY-APPNAME]&RESPONSE-DATA-FORMAT=XML&REST-PAYLOAD&categoryId=267&keywords=" #Complete the url... numKeywords = len(keywords) for k in range(0, numKeywords-1): url += keywords[k] url += "%20" #There should not be %20 after last keyword url += keywords[numKeywords-1] request = urllib2.Request(url) response = urllib2.urlopen(request) #file like thing (due to library conversion) xml_response = response.read() ... ...Then I parsed this with minidom. In Amazon's case, it doesn't seem to be so easy. I thought I would start out by just looking for an easy wrapper. But their developer site doesn't seem to provide a python wrapper for what I am interested in (the Product Advertising API). One that I have tried, python-amazon-product-api 0.2.5 from https://pypi.python.org/pypi/python-amazon-product-api/, has been giving me some installation issues that may not be worth the time to look into (but maybe I'm just exasperated..). I also looked around and found pyaws and pyecs, but these seem to use deprecated authentication mechanisms. I then figured I would just try to construct the URLs from scratch as I did for eBay. But Amazon requires a time stamp in the URLs, which I suppose I could programatically construct (perhaps something like these folks, who go the whole 9 yards with the signature: https://forums.aws.amazon.com/thread.jspa?threadID=10048). Even if that worked (which I doubt will happen, given the amount of frustration the logistics have given so far), the bottom line is that I want name, price, and ISBN for the books that I search for. I was able to generate a sample URL with the tutorial on the API website, and then see the XML result, which indeed contained titles and ISBNs. But no prices! Gah! After some desperate Google searching, a slight modification to the URL (adding &ResponseGroup=Offers and &MerchantID=All) did the trick, but then there were no titles. (I guess yet another question I would have, then, is where can I find an index of the possible ResponseGroup parameters?) Overall, as you can see, I really just don't have a solid methodology for this. Is the construct-a-url approach a decent way to go, or will it be more trouble than it is worth? Perhaps the tl;dr at the top is a better representation of the overall question.
Another way could be amazon-simple-product-api: from amazon.api import AmazonAPI amazon = AmazonAPI(ACCESS_KEY, SECRET, ASSOC) results = amazon.search(Keywords = "book name", SearchIndex = "Books") for item in results: print item.title, item.isbn, item.price_and_currency To install, just clone from github and run sudo python setup.py install Hope this helps!
If you have installation issues with python-amazon-product-api, send details to mailing list and you will be helped.
Scraping Ajax - Using python
I'm trying to scrap a page in youtube with python which has lot of ajax in it I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want. Take a look at The Youtube Data API for more information. I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately. Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.