I normally use WGET to download an image or two from some web-page, I do something like this from the command prompt: wget 'webpage-url' -P 'directory to where I wanna save it'. Now how do I automate it in Perl and Python? That is what command shall enable me to simulate as if I am entering the command at the command-prompt? In Python there are so many similar looking modules like subprocess, os, etc that I am quite confused.
In Perl, the easiest way is to use LWP::Simple:
use LWP::Simple qw(getstore);
getstore('www.example.com', '/path/to/saved/file.ext');
import subprocess
subprocess.call(["wget", "www.example.com", "-P", "/dir/to/save"])
If you want to read URL and process the response:
import urllib2
response = urllib2.urlopen('http://example.com/')
html = response.read()
How to extract images from the html you can read here on SO
in Perl, also, you can use qx(yourcommandhere). this is external call of programs.
so, in your example: qx(wget 'webpage-url' -P '/home/myWebPages/'). this is enough for you.
But, as s0me0ne said, using LWP::Simple is better.
If you have a list of urls in a file, you can use this code:
my $fh; # filehandler
open $fh, "<", "fileWithUrls.txt" or die "can't find file with urls!";
my #urls = <$fh>; # read all urls, one in each raw of file
my $wget = '/path/to/wget.exe';
for my $url(#urls) {
qx($wget $url '/home/myWebPages/');
}
Related
So, I'm writing a basic python script to use youtube-dl to download a highquality thumbnail from a video. With the command line youtube-dl, you can run "youtube-dl --list-thumbnails [LINK]" and it will output a list of different quality links to the thumbnail images. Usually the highest resolution one has 'maxresdefault' in its link. I want to be able to download this image from the command line with wget. This is the code I have so far to achieve it. I'm not familiar with regex, but according to this site: regexr.com, it should have a match in the link with 'maxresdefault'.
import subprocess
import sys
import re
youtubeoutput = subprocess.call(['youtube-dl', '--list-thumbnails', 'https://www.youtube.com/watch?v=t2U2mUtTnzY'], shell=True, stdout=subprocess.PIPE)
print(str(youtubeoutput))
imgurl = re.search("/maxresdefault/g", str(youtubeoutput)).group(0)
print(imgurl)
subprocess.run('wget', str(imgurl))
I put the print statements in there to see what the outputs were. When I run the code, I can see the youtube-dl doesn't recognize a link being in there. youtube-dl: error: You must provide at least one url. Since there's no links in the output, the re.search becomes a NoneType and it gives me an error. I don't know why youtube-dl won't recognize the link. I'm not even sure it recognizes the --list-thumnails. Could anyone help?
You've asked subprocess to use a shell (shell=True), so you would usually pass an entire command to call, like so:
youtubeoutput = subprocess.call("youtube-dl --list-thumbnails https://www.youtube.com/watch?v=t2U2mUtTnzY", shell=True, stdout=subprocess.PIPE)
But really, you may not need a shell. Try something like:
youtubeoutput = subprocess.check_output(['youtube-dl', '--list-thumbnails', 'https://www.youtube.com/watch?v=t2U2mUtTnzY'])
Note that call does not actually return the program's standard output; check_output does.
Reference
pdftotext looks like it only takes the pdf file name or the path to it. The docs aren't extremely helpful (https://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/) (https://linux.die.net/man/1/pdftotext)
Is there a way to send the binary contents directly into this?
Let's say i'm grabbing a url that directly links to a PDF. I grab the response of that url using python requests,
response = requests.get(somePdfUrl)
I grab the binary,
pdfBinary = response.content
And I want to be able to send it into this function and run it using subprocess but normally it would be like this:
def textExtract(pdfBinary):
text = subprocess.run(['pdftotext', '/path/to/file.pdf'],
stdout=PIPE, stderr=PIPE)
This might be impossible and limited to the package but is there way to somehow to insert the pdfBinary into this method? I don't want to have to save the pdf file everytime and then insert it into the subprocess.
Yes, It's possible:
from subprocess import Popen, PIPE
command = ['pdftotext', '-layout', '-', '-']
p = Popen(command, stdout=PIPE, stdin=PIPE)
r = requests.get(url)
output = p.communicate(input=r.content)[0].decode()
I solved this by using a modified dockerized version of this utility, say you have a
Dockerfile:
FROM minidocks/poppler
CMD cp /dev/stdin myfile.pdf && pdftohtml -noframes -stdout myfile.pdf
Build the docker:
$ docker build . -t pdftohtml
Then you can do something like
$ cat some-pdf.pdf | docker run -i pdftohtml
This can be called from your python script, without the cat of course, you'd feed your binary data to the process.
Background
I'm working on a bash script to pull serial numbers and part numbers from all the devices in a server rack, my goal is to be able to run a single script (inventory.sh) and walk away while it generates text files containing the information I need. I'm using bash for maximum compatibility, the RHEL 6.7 systems do have Perl and Python installed, however they have minimal libraries. So far I haven't had to use anything other than bash, but I'm not against calling a Perl or Python script from my bash script.
My Problem
I need to retrieve the Serial Numbers and Part numbers from the drives in a Dot Hill Systems AssuredSAN 3824, as well as the Serial numbers from the equipment inside. The only way I have found to get all the information I need is to connect over SSH and run the following three commands dumping the output to a local file:
show controllers
show frus
show disks
Limitations:
I don't have "sshpass" installed, and would prefer not to install it.
The Controller is not capable of storing SSH keys ( no option in custom shell).
The Controller also cannot write or transfer local files.
The Rack does NOT have access to the Internet.
I looked at paramiko, but while Python is installed I do not have pip.
I also cannot use CPAN.
For what its worth, the output comes back in XML format. (I've already written the code to parse it in bash)
Right now I think my best option would be to have a library for Python or Perl in the folder with my other scripts, and write a script to dump the commands' output to files that I can parse with my bash script. Which language is easier to just provide a library in a file? I'm looking for a library that is as small and simple as possible to use. I just need a way to get the output of those commands to XML files. Right now I am just using ssh 3 times in my script and having to enter the password each time.
Have a look at SNMP. There is a reasonable chance that you can use SNMP tools to remotely extract the information you need. The manufacturer should be able to provide you with the MIBs.
I ended up contacting the Manufacturer and asking my question. They said that the system isn't setup for connecting without a password, and their SNMP is very basic and won't provide the information I need. They said to connect to the system with FTP and use "get logs " to download an archive of the configuration and logs. Not exactly ideal as it takes 4 minutes just to run that one command but it seems to be my only option. Below is the script I wrote to retrieve the file automatically by adding the login credentials to the .netrc file. This works on RHEL 6.7:
#!/bin/bash
#Retrieve the logs and configuration from a Dot Hill Systems AssuredSAN 3824 automatically.
#Modify "LINE" and "HOST" to fit your configuration.
LINE='machine <IP> login manage password <password>'
HOST='<IP>'
AUTOLOGIN="/root/.netrc"
FILE='logfiles.zip'
#Check for and verify the autologin file
if [ -f $AUTOLOGIN ]; then
printf "Found auto-login file, checking for proper entry... \r"
READLINE=`cat $AUTOLOGIN | grep "$LINE"`
#Append the line to the end of .netrc if file exists but not the line.
if [ "$LINE" != "$READLINE" ]; then
printf "Proper entry not found, creating it... \r"
echo "$LINE" >> "$AUTOLOGIN"
else
printf "Proper entry found... \r"
fi
#Create the Autologin file if it doesn't exist
else
printf "Auto-Login file does not exist, creating it and setting permissions...\r"
echo "$LINE" > "$AUTOLOGIN"
chmod 600 "$AUTOLOGIN"
fi
#Start getting the information from the controller. (This takes a VERY long time)
printf "Retrieving Storage Controller data, this will take awhile... \r"
ftp $HOST << SCRIPT
get logs $FILE
SCRIPT
exit 0
This gave me a bunch of files in the zip, but all I needed was the "store_....logs" file. It was about 500,000 lines long, the first portion is the entire configuration in XML format, then the configuration in text format, followed by the logs from the system. I parsed the file and stripped off the logs at the end which cut the file down to 15,000 lines. From there I divided it into two files (config.xml and config.txt). I then pulled the XML output of the 3 commands that I needed and it to the 3 files my previously written script searches for. Now my inventory script pulls in everything it needs, albeit pretty slow due to waiting 4 minutes for the system to generate the zip file. I hope this helps someone in the future.
Edit:
Waiting 4 minutes for the system to compile was taking too long. So I ended up using paramiko and python scripts to dump output from the commands to files that my other code can parse. It accepts the IP of the Controller as a parameter. Here is the script for those interested. Thank you again for all the help.
#!/usr/bin/env python
#Saves output of "show disks" from the storage Controller to an XML file.
import paramiko
import sys
import re
import xmltodict
IP = sys.argv[1]
USERNAME = "manage"
PASSWORD = "password"
FILENAME = "./logfiles/disks.xml"
cmd = "show disks"
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
client.connect(IP,username=USERNAME,password=PASSWORD)
stdin, stdout, stderr = client.exec_command(cmd)
except Exception as e:
sys.exit(1)
data = ""
for line in stdout:
if re.search('#', line):
pass
else:
data += line
client.close()
f = open(FILENAME, 'w+')
f.write(data)
f.close()
sys.exit(0)
I have a pdf file and i want to replace some text in pdf file and generate new pdf. How can i do that in python?
I have tried reportlab , reportlab does not have any fucntion to search text and replace it. What other module can i use?
You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.
I'm developer evangelist at Aspose.
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)
#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])
response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)
Have a look in THIS thread for one of the many ways to read text from a PDF. Then you'll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.
The CAM::PDF Perl Library can output text that's not too hard to parse (it seems to fairly randomly split lines of text). I couldn't be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf.
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textOut = $ARGV[1];
$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);
open(my $fh, '>', $textOut);
print $fh $page;
close $fh;
exit;
and
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];
$pdf = CAM::PDF->new($pdfIn);
my $page;
open(my $fh, '<', $textIn) or die "cannot open file $filename";
{
local $/;
$page = <$fh>;
}
close($fh);
$pdf->setPageContent(1, $page);
$pdf->cleanoutput($pdfOut);
exit;
You can call these with python either side of doing some regex etc stuff on the outputted text file.
If you're completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan, then in the prompt install "CAM::PDF";, this will install the required modules.
Also, I realise that I should probably be using stdout etc, but I was in a hurry :-)
Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?
I am using python scripts to load data to a database bulk loader.
The input to the loader is stdin. I have been unable to get the correct syntax to call the unix based bulk loader passing the contents of a python list to be loaded.
I have been reading about Popen and PIPE but they have not been behaving as i expect.
The python list contains database records to be bulkloaded. In linux it would look similar to this:
echo "this is the string being written to the DB" | sql -c "COPY table FROM stdin"
What would be the correct way replace the echo statement with a python list to be used with this command ?
I do not have sample code for this process as i have been experimenting with the features of Popen and PIPE with some very simple syntax and not obtaining the desired result.
Any help would be very much appreciated.
Thanks
If your data is short and simple, you could preformat the entire list and do it simple with subprocess like this:
import subprocess
data = ["list", "of", "stuff"]
proc = subprocess.Popen(["sql", "-c", "COPY table FROM stdin"], stdin=subprocess.PIPE)
proc.communicate("\n".join(data))
If the data is too big to preformat like this, then you can attempt to use the stdin pipe directly, though subprocess module is flaky when using the pipes if you need to read from stdout/stderr too.
for line in data:
print >>proc.stdin, line