I started learning web scraping with Python. Currently, I would like to download a video of the Japanese Diet. (https://www.shugiintv.go.jp/jp/index.php?ex=VL&deli_id=40124&media_type=)
The video seems to have a mechanism to call chunklist.m3u8 from playlist.m3u8 and then call the ts files described in chunklist.m3u8 in order.
I want to download the contents from the playlist.m3u8 URL first, then call chunklist.m3u8 to download the ts files in order and concat.
However, I tried to download Playlist.m3u8, but it didn't produce the text I expected.
Also, the sample URL of playlist.m3u8 is here↓
http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8
code:
import requests
url = "http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8"
res = requests.get(url)
print(res.text)
excepted text:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-STREAM-INF:BANDWIDTH=564000,NAME="500k",RESOLUTION=640x360
chunklist_w60346572_b564000_t64NTAwaw==.m3u8
actual text:
<html><head><title>Wowza Streaming Engine 4 Perpetual Bundle Unlimited Edition 4.7.7 build20181108145350</title></head><body>Wowza Streaming Engine 4 Perpetual Bundle Unlimited Edition 4.7.7 build20181108145350</body></html>
I think there is a problem with the colon in the URL, but I don't have a clear solution. I would like to know how to avoid URL issues and successfully download the text in playlist.m3u8. Thanks.
Version:
Python 3.7.9
requests 2.25.1
Something is wrong with your url:
>>> url = "http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8"
>>> res = requests.get(url)
>>> res.request.url
'https://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8%20'
See the "%20" in the end?
I am not really sure how you got it wrong, but copy-paste this should work:
url = 'https://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8'
Related
I need to create and iOS app that is basically a mobile GitHub repository browser. A user passes their input and they get the search results for the repositories.
I have tried approaching it from the basic http point of view, but to na avail (check out this question):
How to print my http query results in Xcode?
And now with support from #HedgeHog I have found a piece of code which does exactly what the backend to my app should do; but is written in Python:
import requests, sys, webbrowser, bs4
print('Your GitHub repository search query:')
userInput = input()
results = requests.get('https://github.com/search?q=' + userInput + '&type=repositories'
+ ' '.join(sys.argv[1:]))
results.raise_for_status()
soup = bs4.BeautifulSoup(results.text, 'html.parser')
linkList = ['https://github.com/'+a['href'] for a in soup.select('.repo-list-item .f4 a[href]')]
As I have also learned, PythonKit doesn't support iOS apps.
Is there any way that I could put a *.py file in the Xcode project and that Xcode could run it as a function (pass an input from a Swift to the Python file, then receive an output to the Swift file)?
So say from a random api, lets say api.example.com as an example. It sends a random image once you go on it and sends the json for it. So like {"url": "api.example.com/img1.png"}. After de-jsonifying it how can i download the image and save it in some folder, but if its already downloaded so say the image name is taken it will not download it.
Edit: here is my code i done so far.
`
url = f"https://nekos.life/api/v2/img/neko"
response = requests.get(url)
response.raise_for_status()
jsonResponse = response.`json()
urll = (jsonResponse["url"])
urllib.request.urlretrieve(urll, "neko.png")`
as said in this article, i think [os.path][1] can do the job pretty well.
just try to use
os.path.exists(phot_path)
that should be it.
[1]: https://linuxize.com/post/python-check-if-file-exists/
I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)
The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))
Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines
I'm trying to manipulate a dynamic JSON from this site:
http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do
It has 3 elements, imagem, a base64, labelValorCaptcha, just a message, and uuidCaptcha, a value to pass by parameter to play a sound in this link bellow:
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_e7b072e1fce5493cbdc46c9e4738ab8a
When I enter in the first site through a browser and put in the second link the uuidCaptha after the equal ("..uuidCaptcha="), the sound plays normally. I wrote a simple code to catch this elements.
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
But I dont know what's happening, the caught value of the uuidCaptcha doesn't work. Open a error web page.
Someone knows?
Thanks!
It works for me.
$ cat a.py
#!/usr/bin/env python
# encoding: utf-8
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
$ python a.py
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_efc8d4bc3bdb428eab8370c4e04ab42c
As I said #Charlie Harding, the best way is download the page and get the JSON values, because this JSON is dynamic and need an opened web link to exist.
More info here.
I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.