I can't set image thumbnails for mp3 files using eyed3 module in Python.
I try next script:
import eyed3
from eyed3.id3.frames import ImageFrame
th = 'url_to_my_pic'
file = 'to_mp3_pleer/file.mp3'
audiofile = eyed3.load(file)
audiofile.initTag()
audiofile.tag.frames = ImageFrame(image_url=th)
audiofile.tag.save()
But this do nothing with thumbnails in my file.
In google no information about settings thumbnails using eyed3. How can I set it?
After several hours learning of eyeD3, googling and experimenting with file cover, I think, I have a solution for you.
You need to follow these rules:
use ID3v2.3 (not v2.4 as by default in eyeD3);
add right description for cover image (word cover);
pass image as binary;
I'll give you an example of code, which works fine on my Windows 10 (should works on other platforms as well):
import eyed3
import urllib.request
audiofile = eyed3.load("D:\\tmp\\tmp_mp3\\track_example.mp3")
audiofile.initTag(version=(2, 3, 0)) # version is important
# Other data for demonstration purpose only (from docs)
audiofile.tag.artist = "Token Entry"
audiofile.tag.album = "Free For All Comp LP"
audiofile.tag.album_artist = "Various Artists"
audiofile.tag.title = "The Edge"
# Read image from local file (for demonstration and future readers)
with open("D:\\tmp\\tmp_covers\\cover_2021-03-13.jpg", "rb") as image_file:
imagedata = image_file.read()
audiofile.tag.images.set(3, imagedata, "image/jpeg", u"cover")
audiofile.tag.save()
# Get image from the Internet
response = urllib.request.urlopen("https://example.com/your-picture-here.jpg")
imagedata = response.read()
audiofile.tag.images.set(3, imagedata, "image/jpeg", u"cover")
audiofile.tag.save()
Credits: My code is based on several pages: 1, 2, 3
Related
So I am able to post Docx files to WordPress using WP REST-API using mammoth docx package in Python
I am able to upload an image to WordPress.
But when there are images in the docx file they are not uploading on the WordPress media section.
Any input on this?
I am using python for this.
Here is the code for Docx to HTML conversion
with open(file_path, "rb") as docx_file:
# html = mammoth.extract_raw_text(docx_file)
result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
html = result.value # The generated HTML
kindly do note that I am able to see images in the actual published post but they have a weird source image URL & are not appearing in the WordPress media section.
Weird image source URL like
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQECAgMCAgICAgQDAwIDBQQFBQUEBAQFBgcGBQUHBgQEBgkGBwgICAgIBQYJCgkICgcICAj/2wBDAQEBAQICAgQCAgQIBQQFCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAj/wAARCAUABQADASIAAhEBAxEB/8QAHwAAAQMFAQEBAAAAAAAAAAAAAAUGBwMECAkKAgsB/8QAhxAAAQIEBAMEBQYHCAUOFggXAQIDAAQFEQYHEiETMUEIIlFhCRQ & so on
Also Huge thanks to Contributors for the Python to WordPress repo
The mammoth cli has a function that extracts images, saves them to a directory and inserts the file names in the img tags in the html code. If you don't want to use mammoth in command line you could use this code:
import os
from mammoth.cli import ImageWriter, _write_output
output_dir = './output'
filename = 'filename.docx'
with open(filename, "rb") as docx_fileobj:
convert_image = mammoth.images.img_element(ImageWriter(output_dir))
output_filename = "{0}.html".format(os.path.basename(filename).rpartition(".")[0])
output_path = os.path.join(output_dir, output_filename)
result = mammoth.convert(
docx_fileobj,
convert_image=convert_image,
output_format='html',
)
_write_output(output_path, result.value)
Note that you would still need to change the img links as you'll be uploading the images to Wordpress, but this solves your mapping issue. You might also want to change the ImageWriter class to save the images to something else than tiff.
i am not a programmer but i want to do that. I want to download images from multiple urls ( urllist.txt ) the url not have the image in it so need to recognise the image > 400kb and have 20 sec delay beetween the download so the site not lock me out
thnx in advance
Stack Overflow is for stuff like error handling, but I thought I could help. This worked for me:
downloader.py
import requests
import random
def download_imgs(file):
'''
Downloads images based
on the URLs given in `file`.
'''
with open(file, 'r') as url_file:
data = url_file.read().strip().split('\n') # Read the URLs in the file
for url in data:
img = requests.get(url.strip()) # Open the link
with open(str(random.uniform(1, 10000)), 'wb') as write_img:
# Random module to generate a random name for the image
write_img.write(img.content)
# Saved the image
return True
download_imgs('~/Desktop/urllist.txt')
urllist.txt
https://lh3.googleusercontent.com/a-/AOh14GhJAxUW_Gcq2xzMqe3_tc3eLV6e9-sMTqDWuRY7=s88-c-k-c0x00ffffff-no-rj-mo
https://i.ytimg.com/vi/m4jmapVMaQA/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLBqRJKwS9ZzMwnUZvmkXrAw5EzH5w
Even though the images don't have file extensions (Ex: PNG or PING), this program seems to work fine for me.
I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
I'm currently trying to work with Google's python vision library. But I'm currently stuck on how to read images from the web. So far I've got this here down below. My issue is that the contents always seem to be empty and when I check using PyCharm, it says that it only contains b''.
How can I open this image so I can use it for Google's library?
from google.cloud import vision
from google.cloud.vision import types
from urllib import request
import io
client = vision.ImageAnnotatorClient.from_service_account_json('cred.json')
url = "https://cdn.getyourguide.com/img/location_img-59-1969619245-148.jpg"
img = request.urlopen(url)
with io.open('location_img-59-1969619245-148.jpg', 'rb') as fhand:
content = fhand.read()
image = types.Image(content=content)
response = client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description)
Do you try get image via requests library?
import requests
r = requests.get("https://cdn.getyourguide.com/img/location_img-59-1969619245-148.jpg")
o = open("location_img.jpg", "wb")
o.write(r.content)
o.close()
I am downloading a list of images (all .jpg) from the web using this python script:
__author__ = 'alessio'
import urllib.request
fname = "inputs/skyscraper_light.txt"
with open(fname) as f:
content = f.readlines()
for link in content:
try:
link_fname = link.split('/')[-1]
urllib.request.urlretrieve(link, "outputs_new/" + link_fname)
print("saved without errors " + link_fname)
except:
pass
In OSX preview I see the images just fine, but I can't open them with any image editor (for example Photoshop says "Could not complete your request because Photoshop does not recognize this type of file."), and when i try to attach them to a word document, the files are not even showed as picture files in the dialog for browsing for image.
What am i doing wrong?
As J.F. Sebastian suggested me in the comments, the issue was related to the newline in the filename.
To make my script work, you need to replace
link_fname = link.split('/')[-1]
with
link_fname = link.strip().split('/')[-1]