import requests
from sys import exit
url = 'www.videohere.com'
while(True):
try:
requests.get(url)
except KeyboardInterrupt:
exit(0)
*Ok so i'm trying to use requests to view this video on this page but like just basically stay on the page as if it is watching i have no clue on how to go about doing it, so if anyone could give any suggestions or show a little poc please do! thanks guys.
What do you mean by using requests.get() to view video?
If you would like to make to the progress bar (I don't know what it is named, hope you can understand) to move on, you just need to post some data to the server. You don't need to really "get" the video.
For example, on a Chinese video sharing website BILIBILI, it use the heartbeat to record where you were. It will help if you close the webpage and open it again, it will notify you that you could restore to the second where you were watching.
On different websites, the method might be different. On YouTube, the key is encrypted, so it might be a little bit challenging to pretent your cralwer as a real person watching the video.
For the site you mentioned as mixed.com, in Developer Tools, you could find the following:
To give your a brief of posting data on this site, the url of site is
https://browser.events.data.microsoft.com/OneCollector/1.0/?cors=true&content-type=application/x-json-stream&client-id=NO_AUTH&client-version=1DS-Web-JS-2.0.0-rc3&apikey=1918f1672f934c5ca6b2669551351de2-b61c1f8e-4440-4e53-8e38-b46e4c88a7e5-7038&upload-time=1569343709069&w=0
Once you are in someone's websites, you can find out your browser keeps posting data to this target url every several seconds. Personally speaking, I think it is a tool to record the number of users watching in this room.
For https://browser.events.data.microsoft.com/OneCollector/1.0/ this part, we shall have the same url, but the residual should be generated by the contents in your requests headers.
What's more, the apikey and the payload in the following picture tell me that it is really tough for people to use crawler to pretending a real user.
Most part of it could be easily fetched from the website, but it might require delication for not making mistakes.
Related
I am trying a to build a comment scraper for YouTube comments. I already tried to build a scraper with selenium by opening a headless browser and scrolling down, opening comment responses and then extracting the data. But a headless browser is too slow and also the scrolling seems to be unreliable because the number of scraped comments does not match the number of comment given for each video. Maybe I did something wrong, but this is irrelevant to me right now: one of the main reasons why I would like to find another way is the time: it is too slow.
I know there have been a lot of questions about scraping youtube comment on stackoverflow, but almost every answer I found suggested to use some kind of headless browser, i.e. selenium, to do the job. I don’t like that because of the reasons mentioned above. Also, some references I found online suggest to refrain from using selenium for web scraping and instead to try reverse engineering, which means, as much as I understood, to emulate ajax calls and get the data, ideally in json.
When scrolling down a youtube video (here's an example), the comments are loaded dynamically. If I have a look at the XHR activity, I get a lot of information:
Looking at the XHR activity it seems like I find very information I need to make a request to get the json data with comments. But actually I struggle to construct the right the request in order to obtain the json file with the comments.
I read some tutorials online, which mostly offer simple examples which are easy to replicate. But none of them really helped me for my endeavour.
Can someone give me a hint and show me how to post the request in order to get the json file with the comments with python? I am using the requests library in Python.
I know, there is an Youtube API which can do the job, but I wanted to find out whether there is a way of doing the same thing without the API. I think it should be possible.
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
My younger brother, who still lives in China is a fan of Michael Phelps. He wants to see his twitter posts. Since they can't access twitter behind the GFW and setting up a VPN is too hard for my mom. I want to write something that grabs the twitter and sends them to my mom's email.
I use python as my main language. Familiar with tweepy / request / scrapy
I have tried or thought about three ways of doing this:
Use the twitter API and grabs the user_timeline. However, this method will lost all graphical data and throws a bunch of useless links that are only visible after proper rendering
Do a web scraping and save the html content. Then send the html file as an attachment. However, this method still loses some graphical contents and is not that user friendly to someone in her 40s. In addition, it will be kinda hard to tell how many tweets I have scraped and if there's any updates.
Wrap the html content in the email and use html rendering within the email. I haven't work with this before so I am not exactly sure how its gonna work out.
I am aware that "what's the best way to do this" kinda question is always downvoted on SO but I do believe this problem is particular enough to engage meaningful Q&As. Any suggestion will be appreciated.
Have you thought of using selenium and taking screen shots of the browser window? Taking a screen shot with selenium is as easy as
browser.get('twitter.com')
browser.get_screenshot_as_file('twitter_screenshot.png')
You'd have to figure out a way to automate both watching for new tweets and running the selenium script when a new tweet is found. However in terms of preserving graphical content, taking screenshots w/ Selenium would be simple to implement.
Docs: http://selenium-python.readthedocs.io/api.html#selenium.webdriver.remote.webdriver.WebDriver.get_screenshot_as_file
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.