Batch file downloading using Perl or any other language

Batch file downloading using Perl or any other language - python

I have pretty good knowledge in JS, HTML, CSS, C, C++ and C#. I have this website which offers question papers for us school students, but to download those we have to visit every page and it's too hard for us. There are about 150 files. So... ;)
The download links always look like this:
http://www.example.com/content/download_content.php?content_id=#
where # is a number.
So I thought if javascript or perl or python or any other language can download the files and save it locally automatically. Currently I don't need much, just the basic code. I'll learn the language and then I'll develop on it myself. So please help me out pals..

That's how I usually do such things in bash:
for i in `seq 1 1000` ; do wget "http://www.example.com/content/download_content.php?content_id=$i" -O $i.html ; done
UPDATE Since the URLs point to more than one file type, you could use the file command to identify the type of a downloaded file, and adjust the extension accordingly:
for i in `seq 1 1000`
do
wget "http://www.example.com/content/download_content.php?content_id=$i" -O $i.out
mime=`file --brief --mime-type $i.out`
if [ "$mime" == "application/pdf" ]
then
mv $i.out $i.pdf
elif [ "$mime" == "application/vnd.ms-office" ]
then
mv $i.out $i.doc
fi
done

This will do it in shell script using the wget program, dumping them all into the current directory:
#!/bin/sh
i=1
while [ $i -le 150 ]; do
wget -O $i.out "http://www.example.com/content/download_content.php?content_id=$i"
i = $((i + 1))
done

How about using curl instead:
curl -O http://www.example.com/content/download_content.php?content_id=#[1-150]
Should work on most linux distros and if its not there you can download curl from here: http://curl.haxx.se/ or with a 'apt-get install curl'

Related

unzip operation taking several hours

I am using the following shell script to loop over 90 zip files & unarchive them on a Linux box hosted with Hostinger (Shared web hosting)
#!/bin/bash
SOURCE_DIR="<path_to_archives>"
cd ${SOURCE_DIR}
for f in *.zip
do
# unzip -oqq "$f" -d "${f%.zip}" &
python3 scripts/extract_archives.py "${f}" &
done
wait
The python script being called by the above shell script is below -
import shutil
import sys
source_path = "<path to source dir>"
def extract_files(in_file):
shutil.unpack_archive(source_path + in_file, source_path + in_file.split('.')[0])
print('Extracted : ', in_file)
extract_files(sys.argv[1].strip())
Irrespective of whether I use the inbuilt unzip command or a python, it's taking about 2.5 hours to unzip all the files. unarchiving all the zip files results 90 folders with 170000 files overall. I would've thought anywhere between 15/20 min is reasonably acceptable timeframe.
I've tried a few different variations in that, I have tried just tarring the folders instead of zipping them up thinking just un-tarring may be faster than unzipping. I've used tar command from source server to transfer the files over ssh & untar in memory something like this -
time tar zcf - . | ssh -p <port> user#host "tar xzf - -C <dest dir>"
Nothing is helping. I am open to using any other programming language like Perl, Go or others too if necessary to speed things up.
Please can someone help me solve this performance problem.

Thank you everyone for your answers. As you indicated, this was to do with throttling on the servers in a hosted environment

wget set file name in batch download

I am trying to download files using an input file (a.txt) which has URLs using the following commands
wget -i a.txt
URLs are like
https://domian.com/abc?api=123&xyz=323&title=newFile12
https://domian.com/abc?api=1243&xyz=3223&title=newFile13
I want to set the name of the file from the URL by using the title tag (for example in the above URL name of the file download need to be newFile12) but can't find any way around it.
In order to get it done, I have to write a python script (similar to this answer https://stackoverflow.com/a/28313383/10549469) and run one by one is there any other way around.

You can create a script on the fly and and pipe it to bash. A bit slower than wget -i but would preserve file names:
sed "s/\(.*title=\(.*\)\)/wget -O '\2' '\1'/" a.txt
When you are satisfied with the results, you can pipe it to bash:
sed "s/\(.*title=\(.*\)\)/wget -O '\2' '\1'/" a.txt | bash

Have a look at wget --content-disposition or for loop with wget -O <outputfile name> <url>
Following command downloads the file with filename as provided by server (vim-readonly-1.1.tar.gz) instead of download_script.php?src_id=27233.
wget --content-disposition https://www.vim.org/scripts/download_script.php?src_id=27233

CMD in Windows 7 does not execute command (Python Django)

Ok people at this link for pysec as the technical solution is explained we have some code that you must type to the command prompt (i think because it has a dollar sign in front):
$ cd ~/path/to/pysec && python -c "import sqlite3; sqlite3.connect('edgar.db')"
$ mv ./local-settings-example.py ./local-settings.py
$ mkdir ./pysec/data
However whenever i go to C:\Python27\pysec-master which is the location where the pysec file is stored (according to instructions) and type these commands exactly as i see them i get that the system cannot find the path specified.
Like this
C:\Python27\pysec-master>cd ~/path/to/pysec && python -c
cmd response --> The system cannot find the path specified.
C:\Python27\pysec-master>cd ~/path/to/pysec && python -c "import sqlite3; sqlite3.connect('edgar.db')"
cmd response --> The system cannot find the path specified.
C:\Python27\pysec-master>mv ./local-settings-example.py ./local-settings.py
cmd response --> 'mv' is not recognized as an internal or external command, operable program or batch file.
C:\Python27\pysec-master>mkdir ./pysec/data
cmd response --> The syntax of the command is incorrect.
What seems to be the problem? Don't you have to type these commands in the cmd since they have a dollar sign?

ANSWER FOR THIS QUESTION IS PROVIDED FROM THE COMMENTS UNDER THE QUESTION BY USER Stephan
I decided to put the all together in on place>
cd ~..., mv .` and mkdir .\ looks more like unix syntax than windows-cmd. cd and mkdir work on both platforms, but with different syntax. The cmd-version of mv is move.(ANSWER)
also /path/to/pysec tells you, that you should put in the path to pysec, not the string "\path\to\pysec"(ANSWER)
Can we transform these commands to Windows syntax?(QUESTION)
Should we put the path to pysec like this: C:\Python27\pysec-master i mean the full or absolute as it is called path? Because it that tutorial I can see that the example is trimmed as cd ~/path/to/pysec`(QUESTION)
the tilde (~) has a special meaning in unix. I don't speak unix, but I think it means "Systemdrive". The CMD command would be: cd /d "c:\Python27\pysec-master" (in CMD use \, in unix it's /). Instead of mv use move (ANSWER)
Only the third command does not seem to work mkdir ./pysec/data well I think there muse be something different for windows (QUESTION)
mkdir .\pysec\data ... You remember? "in CMD use \, in unix it's /"(ANSWER)
THANK YOU FOR THE SUPPORT

Have the same README both in Markdown and reStructuredText

I have a project hosted on GitHub. For this I have written my README using the Markdown syntax in order to have it nicely formatted on GitHub.
As my project is in Python I also plan to upload it to PyPi. The syntax used for READMEs on PyPi is reStructuredText.
I would like to avoid having to handle two READMEs containing roughly the same content; so I searched for a markdown to RST (or the other way around) translator, but couldn't find any.
The other solution I see is to perform a markdown/HTML and then a HTML/RST translation. I found some ressources for this here and here so I guess it should be possible.
Would you have any idea that could fit better with what I want to do?

I would recommend Pandoc, the "swiss-army knife for converting files from one markup format into another" (check out the diagram of supported conversions at the bottom of the page, it is quite impressive). Pandoc allows markdown to reStructuredText translation directly. There is also an online editor here which lets you try it out, so you could simply use the online editor to convert your README files.

As #Chris suggested, you can use Pandoc to convert Markdown to RST. This can be simply automated using pypandoc module and some magic in setup.py:
from setuptools import setup
try:
from pypandoc import convert
read_md = lambda f: convert(f, 'rst')
except ImportError:
print("warning: pypandoc module not found, could not convert Markdown to RST")
read_md = lambda f: open(f, 'r').read()
setup(
# name, version, ...
long_description=read_md('README.md'),
install_requires=[]
)
This will automatically convert README.md to RST for the long description using on PyPi. When pypandoc is not available, then it just reads README.md without the conversion – to not force others to install pypandoc when they wanna just build the module, not upload to PyPi.
So you can write in Markdown as usual and don’t care about RST mess anymore. ;)

2019 Update
The PyPI Warehouse now supports rendering Markdown as well! You just need to update your package configuration and add the long_description_content_type='text/markdown' to it. e.g.:
setup(
name='an_example_package',
# other arguments omitted
long_description=long_description,
long_description_content_type='text/markdown'
)
Therefore, there is no need to keep the README in two formats any longer.
You can find more information about it in the documentation.
Old answer:
The Markup library used by GitHub supports reStructuredText. This means you can write a README.rst file.
They even support syntax specific color highlighting using the code and code-block directives (Example)

PyPI now supports Markdown for long descriptions!
In setup.py, set long_description to a Markdown string, add long_description_content_type="text/markdown" and make sure you're using recent tooling (setuptools 38.6.0+, twine 1.11+).
See Dustin Ingram's blog post for more details.

You might also be interested in the fact that it is possible to write in a common subset so that your document comes out the same way when rendered as markdown or rendered as reStructuredText: https://gist.github.com/dupuy/1855764 ☺

For my requirements I didn't want to install Pandoc in my computer. I used docverter. Docverter is a document conversion server with an HTTP interface using Pandoc for this.
import requests
r = requests.post(url='http://c.docverter.com/convert',
data={'to':'rst','from':'markdown'},
files={'input_files[]':open('README.md','rb')})
if r.ok:
print r.content

I ran into this problem and solved it with the two following bash scripts.
Note that I have LaTeX bundled into my Markdown.
#!/usr/bin/env bash
if [ $# -lt 1 ]; then
echo "$0 file.md"
exit;
fi
filename=$(basename "$1")
extension="${filename##*.}"
filename="${filename%.*}"
if [ "$extension" = "md" ]; then
rst=".rst"
pandoc $1 -o $filename$rst
fi
Its also useful to convert to html. md2html:
#!/usr/bin/env bash
if [ $# -lt 1 ]; then
echo "$0 file.md <style.css>"
exit;
fi
filename=$(basename "$1")
extension="${filename##*.}"
filename="${filename%.*}"
if [ "$extension" = "md" ]; then
html=".html"
if [ -z $2 ]; then
# if no css
pandoc -s -S --mathjax --highlight-style pygments $1 -o $filename$html
else
pandoc -s -S --mathjax --highlight-style pygments -c $2 $1 -o $filename$html
fi
fi
I hope that helps

Using the pandoc tool suggested by others I created a md2rst utility to create the rst files. Even though this solution means you have both an md and an rst it seemed to be the least invasive and would allow for whatever future markdown support is added. I prefer it over altering setup.py and maybe you would as well:
#!/usr/bin/env python
'''
Recursively and destructively creates a .rst file for all Markdown
files in the target directory and below.
Created to deal with PyPa without changing anything in setup based on
the idea that getting proper Markdown support later is worth waiting
for rather than forcing a pandoc dependency in sample packages and such.
Vote for
(https://bitbucket.org/pypa/pypi/issue/148/support-markdown-for-readmes)
'''
import sys, os, re
markdown_sufs = ('.md','.markdown','.mkd')
markdown_regx = '\.(md|markdown|mkd)$'
target = '.'
if len(sys.argv) >= 2: target = sys.argv[1]
md_files = []
for root, dirnames, filenames in os.walk(target):
for name in filenames:
if name.endswith(markdown_sufs):
md_files.append(os.path.join(root, name))
for md in md_files:
bare = re.sub(markdown_regx,'',md)
cmd='pandoc --from=markdown --to=rst "{}" -o "{}.rst"'
print(cmd.format(md,bare))
os.system(cmd.format(md,bare))

Simplest way to run Sphinx on one python file

We have a Sphinx configuration that'll generate a slew of HTML documents for our whole codebase. Sometimes, I'm working on one file and I just would like to see the HTML output from that file to make sure I got the syntax right without running the whole suite.
I looked for the simplest command I could run in a terminal to run sphinx on this one file and I'm sure the info's out there but I didn't see it.

Sphinx processes reST files (not Python files directly). Those files may contain references to Python modules (when you use autodoc). My experience is that if only a single Python module has been modified since the last complete output build, Sphinx does not regenerate everything; only the reST file that "pulls in" that particular Python module is processed. There is a message saying updating environment: 0 added, 1 changed, 0 removed.
To explicitly process a single reST file, specify it as an argument to sphinx-build:
sphinx-build -b html -d _build/doctrees . _build/html your_filename.rst

This is done in two steps:
Generate rst file from the python module with sphinx-apidoc.
Generate html from rst file with sphinx-build.
This script does the work. Call it while standing in the same directory as the module and provide it with the file name of the module:
#!/bin/bash
# Generate html documentation for a single python module
PACKAGE=${PWD##*/}
MODULE="$1"
MODULE_NAME=${MODULE%.py}
mkdir -p .tmpdocs
rm -rf .tmpdocs/*
sphinx-apidoc \
-f -e --module-first --no-toc -o .tmpdocs "$PWD" \
# Exclude all directories
$(find "$PWD" -maxdepth 1 -mindepth 1 -type d) \
# Exclude all other modules (apidoc crashes if __init__.py is excluded)
$(find "$PWD" -maxdepth 1 -regextype posix-egrep \
! -regex ".*/$MODULE|.*/__init__.py" -type f)
rm .tmpdocs/$PACKAGE.rst
# build crashes if index.rst does not exist
touch .tmpdocs/index.rst
sphinx-build -b html -c /path/to/your/conf.py/ \
-d .tmpdocs .tmpdocs .tmpdocs .tmpdocs/*.rst
echo "**** HTML-documentation for $MODULE is available in .tmpdocs/$PACKAGE.$MODULE_NAME.html"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Batch file downloading using Perl or any other language - python

This will do it in shell script using the wget program, dumping them all into the current directory: #!/bin/sh i=1 while [ $i -le 150 ]; do wget -O $i.out "http://www.example.com/content/download_content.php?content_id=$i" i = $((i + 1)) done

How about using curl instead: curl -O http://www.example.com/content/download_content.php?content_id=#[1-150] Should work on most linux distros and if its not there you can download curl from here: http://curl.haxx.se/ or with a 'apt-get install curl'

Related

unzip operation taking several hours

wget set file name in batch download

CMD in Windows 7 does not execute command (Python Django)

Have the same README both in Markdown and reStructuredText

Simplest way to run Sphinx on one python file

Categories

Resources