I just wrote a script that scrape all youtube links in a given page (this one : https://coreyms.com/) and open them in a browser (this has no particular use beside learning how to do it).
from requests_html import HTML, HTMLSession
import re
import webbrowser
url = input("website url to launch all embeded videos")
if url == '':
url = 'https://coreyms.com/'
sess = HTMLSession()
r = sess.get(url)
whole_file = r.text
pattern = re.compile(r'https:\/\/www\.youtube\.com\/embed\/(.+)[\?"]')
video_ids = pattern.findall(whole_file)
The problem is that it only works with that specific website (because I used the HTML source of it in order to know what my script must search).
I managed to make the script works with this as well
https://www.udiscovermusic.com/stories/best-heavy-metal-songs/
But with another one :
https://www.musicgrotto.com/best-classic-rock-songs/
It does not work again. I think this is not the good approach. \
How can I generalise my script in order to make it work with any website ?
Is there a common tread I could use ?
I don't know much of HTML and web developpement. (that's exactly why I want to do this)
Thanks
Related
New to Python. I'm trying to find a way to get a url of the first picture I get from google search for some string. For example if I type "dog" I would like to get the first picture url for dog. I don't care which one just some url from Google image search.
Is it possible? what is the easiest way to do it? I saw from previous threads many ways to extract/download the image - but I just need the url and it doesn't matter which one.
This should work, simply replace the word to get images of anything.
Make sure you have requests and BeautifulSoup, if not run this command:
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
word = 'dog'
url = 'https://www.google.com/search?q={0}&tbm=isch'.format(word)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
images = soup.findAll('img')
for image in images:
print(image.get('src'))
I don't know about Google, but I do know an easy way to do this with Bing. There's a PyPI module called bing-image-urls (https://pypi.org/project/bing-image-urls) , and this will do the job nicely. It's pretty easy to use. Just install it with:
pip install bing-image-urls
Then, in your python script, have this code:
from bing_image_urls import bing_image_urls
url = bing_image_urls("dog", limit=1)[0]
print(url)
Just replace "dog" in this example with whatever you want.
Hopefully this answers your question
Thanks!
I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.
import requests, os, bs4, sys, webbrowser
url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)
comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])
issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' + aIssue)
comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
if comicElem == []:
print('nothing in the list')
else:
print('There are ' + str(len(comicElem)) + ' on this page')
for i in range(len(comicElem)):
comicPage = comicElem[i].get('src')
print(str(comicPage) + '\n')
is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.
Using firefox quantim 59.0.2
using python3
PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.
The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So
res = requests.get(comicUrl)
gets the HTML and DOM before any modification are made by javascript. This is why
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5
only finds 5 images.
If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:
from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
print(image.get_attribute('src'))
len(images) # = 18 images
See this post for additional resources for scraping javascript pages:
Web-scraping JavaScript page with Python
Regarding how to tell if the HTML is being modified using javascript?
I don't have any hard rules but these are some investigative steps you can carry out:
As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.
A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.
For further verification the DOM is being modified after the page is loaded:
Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.
Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.
If the line counts are significantly different then you know the DOM is being modified after it is loaded.
Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.
I am trying to capture all the visible content of a page as text. Let's say that one for example.
If I store the page source then I won't be capturing the comments section because it's loaded using javascript.
Is there a way to take HTML snapshots with selenium webdriver?
(Preferably expressed using the python wrapper)
Regardless of whether or not the HTML of the page is generated using JavaScript, you will still be able to capture it using driver.page_source.
I imagine the reason you haven't been able to capture the source of the comments section in your example is because it's contained in an iframe - In order to capture the html source for content within a frame/iframe you'll need to first switch focus to that particular frame followed by calling driver.page_source.
This code will take a screenshot of the entire page:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://dukescript.com/best/practices/2015/11/23/dynamic-templates.html')
driver.save_screenshot('screenshot.png')
driver.quit()
however, if you just want a screenshot of a specific element, you could use this:
def get_element_screenshot(element: WebElement) -> bytes:
driver = element._parent
ActionChains(driver).move_to_element(element).perform() # focus
src_base64 = driver.get_screenshot_as_base64()
scr_png = b64decode(src_base64)
scr_img = Image(blob=scr_png)
x = element.location["x"]
y = element.location["y"]
w = element.size["width"]
h = element.size["height"]
scr_img.crop(
left=math.floor(x),
top=math.floor(y),
width=math.ceil(w),
height=math.ceil(h))
return scr_img.make_blob()
Where the WebElement is the Element you're chasing. of course, this method requires you to import from base64 import b64decode and from wand.image import Image to handle the cropping.
I am trying to get an image from GoogleMaps APIs, more precisely from the staticmap API.
The problem is that in other APIs from GoogleMaps you can choose wether you want your info from the API in JSON or XML, but with staticmap (which returns an image) it seems you can't.
So I don't know how to handle the image provided by the URL since I don't know how it is coded.
This is what I´m trying to do:
import requests
url = ("https://maps.googleapis.com/maps/api/staticmap?size=400x400path=weight:3%7Ccolor:orange%7Cenc:polyline_data")
response = requests.get(url)
print(response.json())
Given that the info is probably not in Json it raises the following error:
ValueError: Expecting value: line 1 column 1 (char 0)
Hope you've got any advice about how to turn the response into something usable.
ummmm... ok, you are thinking too much.
staticmap (which returns an image)
Yes, since you are right, so this is what you have put it <img src="here"/>:
Following is a demo of it. I used the example from the documentation.
<img src="https://maps.googleapis.com/maps/api/staticmap?size=400x400&path=weight:3%7Ccolor:orange%7Cenc:_fisIp~u%7CU}%7Ca#pytA_~b#hhCyhS~hResU%7C%7Cx#oig#rwg#amUfbjA}f[roaAynd#%7CvXxiAt{ZwdUfbjAewYrqGchH~vXkqnAria#c_o#inc#k{g#i`]o%7CF}vXaj\h`]ovs#?yi_#rcAgtO%7Cj_AyaJren#nzQrst#zuYh`]v%7CGbldEuzd#%7C%7Cx#spD%7CtrAzwP%7Cd_#yiB~vXmlWhdPez\_{Km_`#~re#ew^rcAeu_#zhyByjPrst#ttGren#aeNhoFemKrvdAuvVidPwbVr~j#or#f_z#ftHr{ZlwBrvdAmtHrmT{rOt{Zz}E%7Cc%7C#o%7CLpn~AgfRpxqBfoVz_iAocAhrVjr#rh~#jzKhjp#``NrfQpcHrb^k%7CDh_z#nwB%7Ckb#a{R%7Cyh#uyZ%7CllByuZpzw#wbd#rh~#%7C%7CFhqs#teTztrAupHhyY}t]huf#e%7CFria#o}GfezAkdW%7C}[ocMt_Neq#ren#e~Ika#pgE%7Ci%7CAfiQ%7C`l#uoJrvdAgq#fppAsjGhg`#%7ChQpg{Ai_V%7C%7Cx#mkHhyYsdP%7CxeA~gF%7C}[mv`#t_NitSfjp#c}Mhg`#sbChyYq}e#rwg#atFff}#ghN~zKybk#fl}A}cPftcAite#tmT__Lha#u~DrfQi}MhkSqyWivIumCria#ciO_tHifm#fl}A{rc#fbjAqvg#rrqAcjCf%7Ci#mqJtb^s%7C#fbjA{wDfs`BmvEfqs#umWt_Nwn^pen#qiBr`xAcvMr{Zidg#dtjDkbM%7Cd_#"/>
I was able to solve the problem, this is the code:
import requests
url = ("https://maps.googleapis.com/maps/api/staticmap?size=400x400path=weight:3%7Ccolor:orange%7Cenc:polyline_data")
r = requests.get(url)
image = r._content
with open("map.png","wb") as file: #with this you create a usable file .png
file.write(image)
I'm very new to Python and I'm trying to create a tool that automates downloading images off Google.
So far, I have the following code:
import urllib
def google_image(x):
search = x.split()
search = '%20'.join(map(str, search))
url = 'http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=%s&safe=off' %
But I'm not sure where to continue or if I'm even on the right track. Can someone please help?
see scrapy documentation for image pipeline
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}