Check 10000K+ URL

Check 10000K+ URL - python

Well i want to check 100000k+ url in linux.
About those links those are actually OTA[zip] of my android.
Among those links there is only one valid link rest give 404 error.
So how to check all links in less time period in linux server or web server[apache].
structure of urls:
http://link.com/updateOTA_1.zip
http://link.com/updateOTA_2.zip
http://link.com/updateOTA_999999999.zip
Okay what i tried
i made this script but it is really slow. http://pastebin.com/KVxnzttA I also increase thread upto 500 then my server crashed :[
#!/bin/bash
for a in {1487054155500..1487055000000}
do
if [ $((a%50)) = 0 ]
then
curl -s -I http://link.com/updateOTA_$((a)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+1)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+2)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+3)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+4)).zip | head -n1 &
...
curl -s -I http://link.com/updateOTA_$((a+49)).zip | head -n1 &
curl -s -I http://link.com/updateOTA_$((a+50)).zip | head -n1
wait
echo "$((a))"
fi
done
i tried with aria2, but highest thread on aria2 is 16, so again failed.
tried with some online tools, but they give me 100url restrictions.

Running curl 100,000+ times is going to be slow. Instead, write batches of URLs to a single instance of curl to reduce the overhead of starting curl.
# This loop doesn't require pre-generating a list of a million integers
for ((a=1487054155500; a<=1487055000000; a+=50)); do
for(k=0; k<50; k++)); do
printf 'url = %s\n' "http://link.com/updateOTA_$((a+k)).zip"
done | curl -I -K - -w 'result: %{http_code} %{url_effective}' | grep -F 'result:' > batch-$a.txt
done
The -w option is used to produce output associating each URL with its result, should you want that.

However i found solution using aria2c
now it scanning 7k url per minute.
thanks to all
aria2c -i url -s16 -x16 --max-concurrent-downloads=1000

Related

How to convert a script that uses ssh to pbsdsh while using Ray?

I am stuck with converting my script that uses ssh to activate nodes to pbsdsh. I am using Ray for node communication. My script with ssh is:
#!/bin/bash
#PBS -N Experiment_1
#PBS -l select=2:ncpus=24:mpiprocs=24
#PBS -P CSCIxxxx
#PBS -q normal
#PBS -l walltime=01:30:00
#PBS -m abe
#PBS -M xxxxx#gmail.com
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR
jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
export thishostNport
echo "set up ray cluster..."
for n in `echo ${jobnodes}`
do
if [[ ${n} == "${thishost}" ]]
then
echo "first allocate node - use as headnode ..."
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/env1
ray start --head
sleep 5
else
ssh ${n} $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport}
sleep 10
fi
done
python -u example_trainer.py
rm $PBS_O_WORKDIR/$PBS_JOBID
#
where startWorkerNode.pbs is:
#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/poet
ray start --address="${param1}" --redis-password='5241590000000000'
and the example_trainer.py is:
from collections import Counter
import os
import socket
import sys
import time
import ray
num_cpus = int(sys.argv[1])
ray.init(address=os.environ["thishostNport"])
print("Nodes in the Ray cluster:")
print(ray.nodes()) # This should print all N nodes we are trying to access
#ray.remote
def f():
time.sleep(1)
return socket.gethostbyname(socket.gethostname()) + "--" + str(socket.gethostname())
# The following takes one second (assuming that
# ray was able to access all of the allocated nodes).
for i in range(60):
start = time.time()
ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
print("GOT IPs", ip_addresses)
print(Counter(ip_addresses))
end = time.time()
print(end - start)
This works perfectly and communicates across all nodes but when I try to change the command to pbsds it returns:
pbsdsh: task 0x00000000 exit status 254
pbsdsh: task 0x00000001 exit status 254
when mpiprocs=1 and if it is set to 24 it repeats 48 times.
As per the best of my knowledge, ray needs a host node and then worker nodes are connected to it and thus the for loop and if statement in it.
I have tried directly replacing pbsdsh in the script with/without identifying nodes. I have added pbsdsh out of the loop and tried a whole lot of possible combinations.
I have followed these questions but could not get my code to communicate throughout nodes:
PBS/TORQUE: how do I submit a parallel job on multiple nodes?
How to execute a script on every allocated node with PBS
Handle multiple nodes in one pbs job
I believe there might be something not too big that I am not able to implement. Your help and guidance will be highly appreciated!

there are a few main things that needed to change to solve this problem:
#PBS -l select=2:ncpus=24:mpiprocs=1 should be used as the selector line, specifically, change mpiprocs from 24 to 1, so that pbsdsh only launches one process per node instead of 24.
Inside jobscript.sh, inside the else, you can use pbsdsh -n $J -- $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport} & to run pbsdsh only on one node, and in the background. J is kept as a node index and is incremented at each iteration of the for loop. This results in the ray start being run on each node once.
Inside startWorkerNode.pbs, add this code at the end
# Here, sleep for the duration of the job, so ray does not stop
WALLTIME=$(qstat -f $PBS_JOBID | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p')
SECONDS=`echo $WALLTIME | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }'`
echo "SLEEPING FOR $SECONDS s"
sleep $SECONDS
This ensures that the ray start does not exit as soon as the pbsdsh command returns and is kept alive for the duration of the job. The & in the previous point is also necessary here, as pbsdsh will never return without it.
Here are the files for reference:
startWorkerNode.pbs
#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/poet
ray start --address="${param1}" --redis-password='5241590000000000'
# Here, sleep for the duration of the job, so ray does not stop
WALLTIME=$(qstat -f $PBS_JOBID | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p')
SECONDS=`echo $WALLTIME | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }'`
echo "SLEEPING FOR $SECONDS s"
sleep $SECONDS
jobscript.sh
#!/bin/bash
#PBS -N Experiment_1
#PBS -l select=2:ncpus=24:mpiprocs=1
#PBS -P CSCIxxxx
#PBS -q normal
#PBS -l walltime=01:30:00
#PBS -m abe
#PBS -M xxxxx#gmail.com
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR
jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
export thishostNport
echo "set up ray cluster..."
J=0
for n in `echo ${jobnodes}`
do
if [[ ${n} == "${thishost}" ]]
then
echo "first allocate node - use as headnode ..."
module load chpc/python/anaconda/3-2019.10
source /apps/chpc/chem/anaconda3-2019.10/etc/profile.d/conda.sh
conda activate /home/mnasir/env1
ray start --head
sleep 5
else
# Run pbsdsh on the J'th node, and do it in the background.
pbsdsh -n $J -- $PBS_O_WORKDIR/startWorkerNode.pbs ${thishostNport} &
sleep 10
fi
J=$((J+1))
done
python -u example_trainer.py 48
rm $PBS_O_WORKDIR/$PBS_JOBID

Snakemake integrate the multiple command lines in a rule

The output of my first command line "bcftools query -l {input.invcf} | head -n 1" prints the name of the first individual of vcf file (i.e. IND1). I want to use that output in selectvariants GATK in -sn IND1 option. How is it possible to integrate the 1st comamnd line in snakemake in order to use it's output in the next one?
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ind= ???
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
bcftools query -l {input.invcf} | head -n 1 > {params.ind}
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn {params.ind} -O {output.out}
"""

There are several options, but the easiest one is to store the results into a temporary bash variable:
rule selectvar:
...
shell:
"""
myparam=$(bcftools query -l {input.invcf} | head -n 1)
gatk -sn "$myparam" ...
"""
As noted by #dariober, if one modifies pipefail behaviour, there could be unexpected results, see the example in this answer.

When I have to do these things I prefer to use run instead of shell, and then shell out only at the end.
The reason for this is because it makes it possible for snakemake to lint the run statement, and to exit early if something goes wrong instead of following through with a broken shell command.
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
gatk_opts='--java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants'
output:
out="{family}.dn.vcf"
run:
opts = "{params.gatk_opts} -R {params.ref} -V {input.invcf} -O {output.out}"
sn_parameter = shell("bcftools query -l {input.invcf} | head -n 1")
# we could add a sanity check here if necessary, before shelling out
shell(f"gatk {options} {sn_parameter}")
"""

I think I found a solution:
rule selectvar:
input:
invcf="{family}_my.vcf"
params:
ref="ref.fasta"
output:
out="{family}.dn.vcf"
shell:
"""
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn `bcftools query -l {input.invcf} | head -n 1` -O {output.out}
"""

Search txt file line by line into variable and for each value run CURL command

I filtered issues without subtasks by python:
#!/usr/bin/python
import sys
import json
sys.stdout = open('output.txt','wt')
datapath = sys.argv[1]
data = json.load(open(datapath))
for issue in data['issues']:
if len(issue['fields']['subtasks']) == 0:
print(issue['key'])
in output.txt tasks without subtasks are stored (and it works fine):
TECH-729
TECH-124
Now have different issue, it seems values in $p variable isn't passed to CURL (able to login to JIRA but not to create subtasks):
while read -r p; do
echo $p
curl -D- -u user:pass -X POST --data "{\"fields\":{\"project\":{\"key\":\"TECH\"},\"parent\":{\"key\":\"$p\"},\"summary\":\"TestChargen#Nr\",\"description\":\"some description\",\"issuetype\":{\"name\":\"Sub-task\"},\"customfield_10107\":{\"id\":\"10400\"}}}" -H "Content-Type:application/jso#n" https://jira.companyu.com/rest/api/latest/issue/
done <output.txt
echo output is as it should
TECH-731
TECH-729 (so curl should run twice for every output value
But curl just logs in without creating subtasks, when hardcoding instead of $p then curl executes twice for same project ID

Really don't know why,but this code worked, thanks everyone
for project in `cat output.txt`; do
echo $project
curl -D- -u user:pass -X POST --data "{\"fields\":{\"project\":{\"key\":\"TECH\"},\"parent\":{\"key\":\"$project\"},\"summary\":\"TestChargenNr\",\"description\":\"some description\",\"issuetype\":{\"name\":\"Sub-task\"},\"customfield_10107\":{\"id\":\"10400\"}}}" -H "Content-Type:application/json" https://jira.company.com/rest/api/latest/issue/
done

Tail file till process exits

Going through the answers at superuser.
I'm trying to modify this to listen for multiple strings and echo custom messages such as ; 'Your server started successfully' etc
I'm also trying to tack it to another command i.e. pip
wait_str() {
local file="$1"; shift
local search_term="Successfully installed"; shift
local search_term2='Exception'
local wait_time="${1:-5m}"; shift # 5 minutes as default timeout
(timeout $wait_time tail -F -n0 "$file" &) | grep -q "$search_term" && echo 'Custom success message' && return 0 || || grep -q "$search_term2" && echo 'Custom success message' && return 0
echo "Timeout of $wait_time reached. Unable to find '$search_term' or '$search_term2' in '$file'"
return 1
}
The usage I have in mind is:
pip install -r requirements.txt > /var/log/pip/dump.log && wait_str /var/log/pip/dump.log
To clarify, I'd like to get wait_str to stop tailing when pip exits, whether successfully or not.

Following is general answer and tail could be replaced by any command that result in stream of lines.
IF different string needs different actions then use following:
tail -f var/log/pip/dump.log |awk '/condition1/ {action for condition-1} /condition-2/ {action for condition-2} .....'
If multiple conditions need same action them ,separate them using OR operator :
tail -f var/log/pip/dump.log |awk '/condition-1/ || /condition-2/ || /condition-n/ {take this action}'
Based on comments : Single awk can do this.
tail -f /path/to/file |awk '/Exception/{ print "Worked"} /compiler/{ print "worked"}'
or
tail -f /path/to/file | awk '/Exception/||/compiler/{ print "worked"}'
OR Exit if match is found
tail -f logfile |awk '/Exception/||/compiler/{ print "worked";exit}'

Using gpg --search-keys in --batch mode

I'm working on an application that will eventually graph the gpg signature connections between a predefined set of email addresses. I need it to programmatically collect the public keys from a key server. I have a working model that will use the --search-keys option to gpg. However, when run with the --batch flag, I get the error "gpg: Sorry, we are in batchmode - can't get input". When I run with out the --batch flag, gpg expects input.
I'm hoping there is some flag to gpg that I've missed. Alternatively, a library (preferably python) that will interact with a key server would do.

Use
gpg --batch --keyserver hkp://pool.sks-keyservers.net --search-keys ...
and parse the output to get key IDs.
After that
gpg --batch --keyserver hkp://pool.sks-keyservers.net --recv-keys key-id key-id ..
should work

GnuPG is not performing very well anyway when you import very large portions of the web of trust, especially during the import phase.
I'd go for setting up a local keyserver, just dumping all the keys in there (less than 10GB of download size in 2014) and directly querying your own, local keyserver.
Hockeypuck is rather easy to setup and especially query, as it stores the data in a PostgreSQL database.

Use --recv-keys to get the keys without prompting.

In the case of a hkps server the following would work :
gpg --keyserver hkps://***HKPSDOMAIN*** --recv-keys \
$(curl -s "https://***HKPSDOMAIN***/?op=index&options=mr&search=***SEARCHSTRING***"\
|grep pub|awk -F ":" '{print $2}')

We can store the std and err output of the gpg --search-keys commands into variables by specifying 2>&1, then do work on those variables. For example, get the public key ids or those with *.amazon.com email addresses:
pubkeyids=$(gpg --batch --keyserver hkp://keyserver.ubuntu.com --search-keys amazon.com 2>&1 | grep -Po '\d+\s*bit\s*\S+\s*key\s*[^,]+' | cut -d' ' -f5)
The regular expression is fully explained on regex101.com. We can automate searching for keys by their IDs and add them to the keyring using bash by parsing that output. As an illustration, I created the following GitHub gist to host the code below.
Example address list example.csv:
First Name
Last Name
Email Address
Hi
Bye
hi#bye.com
Yes
No
yes#no.com
Why
Not
why#not.com
Then we can pass the csv path to a bash script which will add all keys belonging to the email addresses in the csv:
$ getPubKeysFromCSV.sh ~/example.csv
Here is an implementation of the above idea, getPubKeysFromCSV.sh:
# CSV of email address
csv=$1
# Get headers from CSV
headers=$(head -1 $csv)
# Find the column number of the email address
emailCol=$(echo $headers | tr ',' '\n' | grep -n "Email Address" | cut -d':' -f1)
# Content of the CSV at emailCol column, skipping the first line
emailAddrs=$(tail -n +2 $csv | cut -d',' -f$emailCol)
gpgListPatrn='(?<entropy>\d+)\s*bit\s*(?<algo>\S+)\s*key\s*(?<pubkeyid>[^,]+)'
# Loop through the array and get the public keys
for email in "${emailAddrs[#]}"
do
# Get the public key ids for the email address by matching the regex gpgListPatt
pubkeyids=$(gpg --batch --keyserver hkp://keyserver.ubuntu.com --search-keys $email 2>&1 | grep -Po $gpgListPatrn | cut -d' ' -f5)
# For each public key id, get the public key
for pubkeyid in $pubkeyids
do
# Add the public key to the local keyring
recvr=$(gpg --keyserver hkp://keyserver.ubuntu.com --recv-keys $pubkeyids 2>&1)
# Check exit code to see if the key was added
if [ $? -eq 0 ]; then
# If the public key is added, do some extra work with it
# [do stuff]
fi
done
done
If we wanted, we could make getPubKeysFromCSV.sh more complex by verifying a file signature in the body of the loop, after successfully adding the public key. In addition to the CSV path, we will pass the signature path and file path as arguments two and three respectively:
$ getPubKeysFromCSV.sh ~/example.csv ./example.file.sig ./example.file
Here is the updated script difference as a diff:
--- original.sh
+++ updated.sh
## -1,6 +1,12 ##
# CSV of email address
csv=$1
+# signature file
+sig=$2
+
+# file to verify
+file=$3
+
# Get headers from CSV
headers=$(head -1 $csv)
## -22,5 +28,17 ##
recvr=$(gpg --keyserver hkp://keyserver.ubuntu.com --recv-keys $pubkeyids 2>&1)
# Check exit code to see if the key was added
+ if [ $? -eq 0 ]; then
+ verify=$(gpg --batch --verify $sig $file 2>&1)
+ # If the signature is verified, announce it was verified
+ # else, print error not verified and exit
+ if [[ $verify =~ "^gpg: Good signature from" ]]; then
+ echo "$file was verified by $email using $pubkeyid"
+ else
+ printf '%s\n' "$file was unable to be verified" >&2
+ exit 1
+ fi
+ fi
done
done

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check 10000K+ URL - python

However i found solution using aria2c now it scanning 7k url per minute. thanks to all aria2c -i url -s16 -x16 --max-concurrent-downloads=1000

Related

How to convert a script that uses ssh to pbsdsh while using Ray?

Snakemake integrate the multiple command lines in a rule

Search txt file line by line into variable and for each value run CURL command

Tail file till process exits

Using gpg --search-keys in --batch mode

Categories

Resources