This is my code to get a web page's image's URLs
for some webpage, it works very well, while it' dosen't work for some web page
this is my code:
#!/usr/bin/python
import urllib2
import re
#bufOne = urllib2.urlopen(r"http://vgirl.weibo.com/5show/user.php?fid=17262", timeout=4).read()
bufTwo = urllib2.urlopen(r"http://541626.com/pages/38307", timeout=4).read()
jpgRule = re.findall(r'http://[\w/]*?jpg', bufOne, re.IGNORECASE)
jpgRule = re.findall(r'http://[\w/]*?jpg', bufTwo, re.IGNORECASE)
print jpgRule
bufOne work well, but bufTwodidn't work. so how to write a ruler for it to make bufTwo work well?
Don't use regex to parse HTML. Rather use Beautiful Soup to find all img tags and then get the src attributes.
from BeautifullSoup import BeautifullSoup
#...
soup = BeautifulSoup(bufTwo)
imgTags = soup.findAll('img')
img = [tag['src'] for tag in imgTags]
I'll take this chance ddk gave to show you an easier way of getting all the images.
Using Beautiful Soup like that:
from BeautifulSoup import BeautifulSoup
all_imgs = soup.findAll("img", { "src" : re.compile(r'http://[\w/]*?jpg') })
That will already give you a list with all the images you want.
Related
I want to download videos from a website.
Here is my code.
Every time when i run this code, it returns blank file.
Here is live code: https://colab.research.google.com/drive/19NDLYHI2n9rG6KeBCiv9vKXdwb5JL9Nb?usp=sharing
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
#print(soup.prettify())
result = soup.find_all('video', class_="video-player")
print(result)
using Regex
import requests
import re
response = requests.get("....../video/20000kCCF0")
videoId = '20000kCCF0'
videos = re.findall(r'https://[^"]+' + videoId + '[^"]+mp4', response.text)
print(videos)
You always get a blank return because soup.find_all() doesn't find anything.
Maybe you should check the url.content you receive by hand and then decide what to look for with find_all()
EDIT: After digging a bit around I found out how to get the content_url_orig:
from bs4 import BeautifulSoup
import requests
import json
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
result = str(soup.find_all('script')[1]) #looking for script tag inside the html-file
result = result.split('window._state = ')[1].split("</script>']")[0].split('\n')[0]
#separating the json from the whole script-string, digged around in the file to find out how to do it
result = json.loads(result)
#navigating in the json to get the video-url
entity = list(result['entities'].items())[0][1]
download_url = entity['content_url_orig']
print(download_url)
Funny sidenote: If I read the JSON correctly you can find all videos with download-URLs the creator uploaded :)
I am scraping this link : https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds
and get image urls
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']
html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})
for x in address.find_all('div', class_ = 'filmstrip-imgContainer'):
print(x.find('div').get('img'))
but i am getting output as the following :
None
None
None
None
None
None
None
The follwing is the image of the html code from where I am trying to get the image urls :
This is the section of page from where I'd like to get the urls
I'd like to get to know if there are any changes to be made in the code so that I get all the image urls.
They are dynamically loaded from a script tag. You can easily regex them from the .text of the response. The regex below specifically matches the 7 images you say you want to retrieve and show in the picture.
import requests, re
r = requests.get('https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds').text
p = re.compile(r'imgurl":"(.*?)"')
links = p.findall(r)
print(links)
Regex explanation:
Were you to decide to go with the more expensive selenium you could match with
links = [i['src'] for i in driver.find_all_elements_with_css_selector('.filmstrip-imgContainer img')]
Try this
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import requests
import re
AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']
r = requests.get(AMEXurl[0])
html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(r.content,'lxml')
Extracting All Images
images = soup_1.find_all('img', src=True)
for img in images:
print(img['src'])
all image tags that display png files.
platinum_card_image=soup_1.find('img', src=re.compile('Platinum_Card\.png$'))
print(platinum_card_image.get('src'))
all image tags that display svg files.
platinum_card_image=soup_1.find_all('img', src=re.compile('\.svg$'))
for img in platinum_card_image:
print(img.get('src'))
Edit
images_7 = soup_1.select('script')[8].string.split('__REDUX_STATE__ = ')
data = images_7[1]
for d in json.loads(data)["modelData"]['componentFeaturedCards']['cards']:
print(d['imgurl'])
I'm creating a python program that collects images from this website by Google
The images on the website change after a certain number of seconds, and the image url also changes with time. This change is handled by a script on the website. I have no idea how to get the image links from it.
I tried using BeautifulSoup and the requests library to get the image links from the site's html code:
import requests
from bs4 import BeautifulSoup
url = 'https://clients3.google.com/cast/chromecast/home'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag)
But the code returns:
{{background_url}}' in the image src ("ng-src")
For example:
<img class="S9aygc-AHe6Kc" id="picture-background" image-error-handler="" image-index="0" ng-if="backgroundUrl" ng-src="{{backgroundUrl}}"/>
How can I get the image links from a dynamically changing site? Can BeautifulSoup handle this? If not what library will do the job?
import requests
import re
def main(url):
r = requests.get(url)
match = re.search(r"(lh4\.googl.+?mv)", r.text).group(1)
match = match.replace("\\", "").replace("u003d", "=")
print(match)
main("https://clients3.google.com/cast/chromecast/home")
Just a minor addition to the answer by αԋɱҽԃ αмєяιcαη (ahmed american) in case anyone is wondering
The subdomain (lhx) in lhx.google.com is also dynamic. As a result, the link can be lh3 or lh4 et cetera.
This code fixes the problem:
import requests
import re
r = requests.get("https://clients3.google.com/cast/chromecast/home").text
match = re.search(r"(lh.\.googl.+?mv)", r).group(1)
match = match.replace('\\', '').replace("u003d", "=")
print(match)
The major difference is that the lh4 in the code by ahmed american has been replaced with "lh." so that all images can be collected no matter the url.
EDIT: This line does not work:
match = match.replace('\\', '').replace("u003d", "=")
Replace with:
match = match.replace("\\", "")
match = match.replace("u003d", "=")
None of the provided answers worked for me. Issues may be related to using an older version of python and/or the source page changing some things around.
Also, this will return all matches instead of only the first match.
Tested in Python 3.9.6.
import requests
import re
url = 'https://clients3.google.com/cast/chromecast/home'
r = requests.get(url)
for match in re.finditer(r"(ccp-lh\..+?mv)", r.text, re.S):
image_link = 'https://%s' % (match.group(1).replace("\\", "").replace("u003d", "="))
print(image_link)
I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)
I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu
Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.