Wix doesn't work with BeautifulSoup - python

Why doesn't BeautifulSoup manage to download information from wix? I'm trying to use BeautifulSoup in order to download images from my website, while other sites do work (example of the code actually working) wix does not work...
Is there anything I can change in my site's settings in order for it to work?
EDIT: CODE
from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import time
def make_soup(url):
req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
html = urllib2.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print 'Downloading images to current working directory.'
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print 'Getting: ' + filename
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print ' An error occurred. Continuing.'
print 'Done.'
def main():
url = HIDDEN ADDRESS
get_images(url)
if __name__ == '__main__':
main()

BeautifulSoup can only parse html. Wix sites are generated by javascript that runs when you load the page. When you request the page's html via urllib, you don't get the rendered html, you just get the base html with scripts to build the rendered html. In order to do this, you'd need something like selenium or a headless chrome browser to render the site via it's javascript, and then get the rendered html and feed it to beautifulsoup.
Here's an example of the body of a wix site, which you can see has no content other than a single div that gets populated via javascript.
...
<body>
<div id="SITE_CONTAINER"></div>
</body>
...

For anyone out there trying to download images from the wix website, I managed to figure out a simple idea.
Open an HTML Code frame in your page and in your code link the img srcs of the pictures in your site. When you use BeautifulSoup on the HTML code's URL, all of the images (linked in the code) will be downloaded!

Related

How do I filter tags with class in Python and BeautifulSoup?

I'm trying to scrape images from a site using beautifulsoup HTML parser.
There are 2 kinds of image tags for each image on the site. One is for the thumbnail and the other is the bigger size image that only appears after I click on the thumbnail and expand. The bigger size tag contains a class="expanded-image" attribute.
I'm trying to parse through the HTML and get the "src" attribute of the expanded image which contains the source for the image.
When I try to execute my code, nothing happens. It just says the process finished without scraping any image. But when I don't try to filter the code and just give tag as an argument, it downloads all the thumbnails.
Here's my code:
import webbrowser, requests, os
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')
list = []
for i in soup.find_all("img",{"class":"expanded-thumb"}):
list.append(i['src'].replace("//","https://"))
def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)
filename = os.path.join(pathname, url.split("/")[-1])
response = requests.get(url, stream=True)
with open(filename, "wb") as f:
f.write(response.content)
for a in list:
download(a,"file")
You might be running into a problem using "list" as a variable name. It's a type in python. Start with this (replacing TEST_4CHAN_URL with whatever thread you want), incorporating my suggestion from the comment above.
import requests
from bs4 import BeautifulSoup
TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")
src_list = []
for i in soup.find_all("a", {"class":"fileThumb"}):
src_list.append(i['href'].replace("//", "https://"))
print(src_list)

Web scraping for downloading images from NHTSA website (CIREN crash cases)

I am trying to download some images from NHTSA Crash Viewer (CIREN cases). An example of the case https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817
If I try to download a Front crash image then there is no file downloaded. I am using beautifulsoup4 and requests libraries. This code works for other websites.
The link of images are in the following format: https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0
I have also tried the previous answers from SO but none solution works, Error obtained:
No response form server
Code used for web scraping
from bs4 import *
import requests as rq
import os
r2 = rq.get("https://crashviewer.nhtsa.dot.gov/nass-CIREN/GetBinary.aspx?Image&ImageID=555004572&CaseID=555003071&Version=0")
soup2 = BeautifulSoup(r2.text, "html.parser")
links = []
x = soup2.select('img[src^="https://crashviewer.nhtsa.dot.gov"]')
for img in x:
links.append(img['src'])
os.mkdir('ciren_photos')
i=1
for index, img_link in enumerate(links):
if i<=200:
img_data = rq.get(img_link).content
with open("ciren_photos\\"+str(index+1)+'.jpg', 'wb+') as f:
f.write(img_data)
i += 1
else:
f.close()
break
This is a task that would require Selenium, but luckily there is a shortcut. On the top of the page there is a "Text and Images Only" link that goes to a page like this one: https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?ViewText&CaseID=99817&xsl=textonly.xsl&websrc=true that contains all the images and text content in one page. You can select that link with soup.find('a', text='Text and Images Only').
That link and the image links are relative (links to the same site are usually relative links), so you'll have to use urljoin() to get the full urls.
from bs4 import BeautifulSoup
import requests as rq
from urllib.parse import urljoin
url = 'https://crashviewer.nhtsa.dot.gov/nass-CIREN/CaseForm.aspx?xsl=main.xsl&CaseID=99817'
with rq.session() as s:
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
url = urljoin(url, soup.find('a', text='Text and Images Only')['href'])
r = s.get(url)
soup = BeautifulSoup(r.text, "html.parser")
links = [urljoin(url, i['src']) for i in soup.select('img[src^="GetBinary.aspx"]')]
for link in links:
content = s.get(link).content
# write `content` to file
So, the site doesn't return valid pictures unless the request has valid cookies. There are two ways to get the cookies: either use cookies from a previous request or use a Sessiion object. It's best to use a Session because it also handles the TCP connection and other parameters.

Links from BeautifulSoup without href or <a>

I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery = soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
print("TAG:", sep="\n")
print(gallery[x], sep="\n")
if page.status_code == 200:
print("Request OK")
This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:
<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>
So, how do i get only the links within the gallery[] list?
What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.
The page loads it's data through AJAX, so through network inspector we see, where the call is made. This snippet will obtain all the image links found on page 1, sorted by trending:
import requests
import json
url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)
for data in json_data['data']:
print(data['cover']['medium_image_url'])
Prints:
https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229
... and so on.
If you print the variable json_data, you will see other information the page sends (like icon image url, total_count, data about the author etc.)
You can access the attributes using key-value.
Ex:
from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])
Output:
https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301

Downloading pdfs with python?

I'm writting a script that uses regex to find pdf links on a page then download said links. The script runs and names the files properly in my personal directory however it is not downloading the full pdf file. The pdfs are being pulled and are only 19kb, a corrupted pdf, when they should be approxemtely 15mb
import urllib, urllib2, re
url = 'http://www.website.com/Products'
destination = 'C:/Users/working/'
website = urllib2.urlopen(url)
html = website.read()
links = re.findall('.PDF">.*_geo.PDF', html)
for item in links:
DL = item[6:]
DL_PATH = url + '/' + DL
SV_PATH = destination + DL
urllib.urlretrieve(DL_PATH, SV_PATH)
The url variable links to a page with links to all the pdfs. When you click on the pdf link it takes you to 'www.website.com/Products/NorthCarolina.pdf' which displays the pdf in the browser. I'm not sure if because of this i should be using a diffrent python method or module
You could try something like this:
import requests
links = ['link.pdf']
for link in links:
book_name = link.split('/')[-1]
with open(book_name, 'wb') as book:
a = requests.get(link, stream=True)
for block in a.iter_content(512):
if not block:
break
book.write(block)
You can also use HTML knowledge (for parsing) and the BeautifulSoup library to find all pdf files from a webpage and then download them all together.
html = urlopen(my_url).read()
html_page = bs(html, features=”lxml”)
After parsing, you can search for <a> tags since all hyperlinks have these tags. Once you have all the <a> tags, you can further narrow them down by checking if they end with the pdf extension or not. Here's a full explanation for it: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Parsing txt file, to web scrape an image from each link on each line, with python

I'm trying to open a txt file, with a http link on each line, and then have python go to each link, find a specific image, and print out a direct link to that image, FOR EACH page, listed in the txt file.
But, I have no idea what I'm doing. (started python a few days ago)
Here's my current code, that, does not work...
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
txt = open('links.txt').read().splitlines()
page = urlopen(txt)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
Update 1:
Ok, here's what I need a little more specifically. I have a script which prints out a lot of links into a txt file, each link on it's own line. i.e.
http://link.com/1
http://link.com/2
etc
etc
what I'm trying to accomplish, at the moment is have something that opens that text file, containing those links, and run my regex that I already posted, then print the image links, it will find, in link.com/1 etc, into another text file, which should look something like
http://link.com/1/image.jpg
http://link.com/2/image.jpg
etc.
Then after that, I don't need any help, as I already have a python script which will download the images, from that txt file.
Update 2: Basically, what I need is this script.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://staff.tumblr.com'
page = urlopen(url)
html = page.read()
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
but instead of it looking for a specific url, in the url variable, it will crawl all urls in a text file I specify, then print out the results.
I would suggest you to use Scrapy spider
Here is an example
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
urllist =[]
with open("URLFilename") as f:
for line in f:
urllist.append(line)
class YourScrapingSpider(XMLFeedSpider):
name = "imagespider"
allowed_domains = []
url = NextURL()
start_urls = []
def start_requests(self):
start_url = self.url.next()
request = Request(start_url, dont_filter=True)
yield request
def parse(self, response, node):
scraped_item = Item()
yield scraped_item
next_url = self.url.next()
yield Request(next_url)
I am creating a spider while will read the URL from file and make the request and download the images.
For this we have to use ImagesPipeline
It will be difficult in the starting stage but i would suggest you to learn about Scrapy. Scrapy is a web crawling framework in Python.
Update :
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
print(soup)
for tag in soup.findAll('img'):
print (tag)
# process(url)
def main():
url = "https://www.organicfacts.net/health-benefits/fruit/health-benefits-of-grapes.html"
process(url)
if __name__ == "__main__":
main()
o/p
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1430-35x35.jpg" title="Coconut Oil for Skin" alt="Coconut Oil for Skin" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/1427-35x35.jpg" title="Coconut Oil for Hair" alt="Coconut Oil for Hair" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/335-35x35.jpg" title="Health Benefits of Cranberry Juice" alt="Health Benefits of Cranberry Juice" width="35" height="35" class="wpp-thumbnail wpp_cached_thumb wpp_featured" />
<img src="https://www.organicfacts.net/wp-content/uploads/wordpress-popular-posts/59-35x35.jpg"
Update 2:
with open(the_filename, 'w') as f:
for s in image_links:
f.write(s + '\n')

Categories