How to Get image when dynamic link comes from a website - python

I want to get the full resolution image displayed from this website :
http://oiswww.eumetsat.org/IPPS/html/MSG/PRODUCTS/MPE/FULLRESOLUTION/index.htm
The image has a dynamic link every time when it is updated, which cause problem if we want to download it every time.
Do you have some tricks with python to systematically download the full resolution image.
Thanks all.

You can use BeautifulSoup, lxml or a Python RegExp to parse the HTML and get the correct link, there should be an xpath to it.

From the source code of the html:
array_nom_imagen[0]="wwCzemwbmWlTk"
array_nom_imagen[1]="CtXqGo6wG8hVz"
array_nom_imagen[2]="8UFuyfrkbcd0b"
...
...
array_nom_imagen[138]="fFoSqmGjl6zhJ"
array_nom_imagen[139]="S5QefAKEdpWQf"
array_nom_imagen[140]="vCcabHqeoVgdv"
and
function loadimages(i_image) {
array_imagen[i_image] = new Image()
array_imagen[i_image].src = "IMAGESDisplay/"+array_nom_imagen[i_image]
imageurl[i_image]="IMAGESDisplay/"+array_nom_imagen[i_image]
loaded_images[i_image]="TRUE"
}
So only 141 pictures are available.

Related

How do I extract text in the right order from PDF using PyPDF2?

I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.

Getting a URL of some picture from Google search

New to Python. I'm trying to find a way to get a url of the first picture I get from google search for some string. For example if I type "dog" I would like to get the first picture url for dog. I don't care which one just some url from Google image search.
Is it possible? what is the easiest way to do it? I saw from previous threads many ways to extract/download the image - but I just need the url and it doesn't matter which one.
This should work, simply replace the word to get images of anything.
Make sure you have requests and BeautifulSoup, if not run this command:
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
word = 'dog'
url = 'https://www.google.com/search?q={0}&tbm=isch'.format(word)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
images = soup.findAll('img')
for image in images:
print(image.get('src'))
I don't know about Google, but I do know an easy way to do this with Bing. There's a PyPI module called bing-image-urls (https://pypi.org/project/bing-image-urls) , and this will do the job nicely. It's pretty easy to use. Just install it with:
pip install bing-image-urls
Then, in your python script, have this code:
from bing_image_urls import bing_image_urls
url = bing_image_urls("dog", limit=1)[0]
print(url)
Just replace "dog" in this example with whatever you want.
Hopefully this answers your question
Thanks!

Beautiful Soup can not find all image tags in html (stops exactly at 5)

I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.
import requests, os, bs4, sys, webbrowser
url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)
comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])
issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' + aIssue)
comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
if comicElem == []:
print('nothing in the list')
else:
print('There are ' + str(len(comicElem)) + ' on this page')
for i in range(len(comicElem)):
comicPage = comicElem[i].get('src')
print(str(comicPage) + '\n')
is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.
Using firefox quantim 59.0.2
using python3
PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.
The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So
res = requests.get(comicUrl)
gets the HTML and DOM before any modification are made by javascript. This is why
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5
only finds 5 images.
If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:
from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
print(image.get_attribute('src'))
len(images) # = 18 images
See this post for additional resources for scraping javascript pages:
Web-scraping JavaScript page with Python
Regarding how to tell if the HTML is being modified using javascript?
I don't have any hard rules but these are some investigative steps you can carry out:
As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.
A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.
For further verification the DOM is being modified after the page is loaded:
Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.
Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.
If the line counts are significantly different then you know the DOM is being modified after it is loaded.
Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.

Display or save a List of URL images

Python is known to be an easy and powerful language. I have a List, literally, of URL images,
>>> for i in images: print i
http://upload.wikimedia.org/wikipedia/commons/8/86/Influenza_virus_research.jpg
http://upload.wikimedia.org/wikipedia/commons/f/f8/Wiktionary-logo-en.svg
http://upload.wikimedia.org/wikipedia/en/e/e7/Cscr-featured.svg
http://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg
http://upload.wikimedia.org/wikipedia/commons/4/4c/Wikisource-logo.svg
http://upload.wikimedia.org/wikipedia/commons/1/1b/Wikiversity-logo-en.svg
http://upload.wikimedia.org/wikipedia/commons/1/1b/Wikiversity-logo-en.svg
I wonder if there's some library (or snippet of code) in python to easily display a list of URL images in a browser, or maybe save them in a folder.
import urllib
urllib.urlretrieve("http://8020.photos.jpgmag.com/3670771_314453_2ee7120da5_m.jpg", "my.jpg")
The "my.jpg" is the path to save the file. It can be "/home/user/pics/my.jpg" etc..

Python: How to make Reportlab move to next page in PDF output

I'm using the open source version Reportlab with Python on Windows. My code loops through multiple PNG files & combines them to form a single PDF. Each PNG is stretched to the full LETTER spec (8.5x11).
Problem is, all the images saved to output.pdf are sandwiched on top of each other and only the last image added is visible. Is there something that I need to add between each drawImage() to offset to a new page? Here's a simple linear view of what I'm doing -
WIDTH,HEIGHT = LETTER
canv = canvas.Canvas('output.pdf',pagesize=LETTER)
canv.setPageCompression(0)
page = Image.open('one.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
page = Image.open('two.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
page = Image.open('three.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
canv.save()
[Follow up of the post's comment]
Use canv.showPage() after you use canv.drawImage(...) each time.
( http://www.reportlab.com/apis/reportlab/dev/pdfgen.html#reportlab.pdfgen.canvas.Canvas.showPage )
Follow the source document(for that matter any tool you are using, you should dig into it's respective website documentation):
http://www.reportlab.com/apis/reportlab/dev/pdfgen.html

Categories