Scrapyd deploy project on a server with dynamic ip - python

I want to deploy my scrapy project on a ip that is not listed in the scrapy.cfg file , because the ip can change and i want to automate the process of deploying. i tried giving the ip of the server directly in the deploy command but it did not work. any suggestion to do this?

First, you should consider assigning a domain to the server, so you can always get to it regardless of its dynamic IP. DynDNS comes handy at times.
Second, you probably won't do the first, because you haven't got access to the server, or for whatever other reason. In that case, I suggest mimicking above behavior by using your system's hosts file. As described at wikipedia article:
The hosts file is a computer file used by an operating system to map hostnames to IP addresses.
For example, lets say you set your url to remotemachine in your scrapy.cfg. You can write a script that would edit the hosts file with the latest IP address, and execute it before deploying your spider. This approach has a benefit of having a system-wide effect, so if you are deploying multiple spiders, or using the same server for some other purpose, you don't have to update multiple configuration files.
This script could look something like this:
import fileinput
import sys
def update_hosts(hostname, ip):
if 'linux' in sys.platform:
hosts_path = '/etc/hosts'
else:
hosts_path = 'c:\windows\system32\drivers\etc\hosts'
for line in fileinput.input(hosts_path, inplace=True):
if hostname in line:
print "{0}\t{1}".format(hostname, ip)
else:
print line.strip()
if __name__ == '__main__':
hostname = sys.argv[1]
ip = sys.argv[2]
update_hosts(hostname, ip)
print "Done!"
Ofcourse,you should do additional argument checks, etc., this is just a quick example.
You can then run it prior deploying like this:
python updatehosts.py remotemachine <remote_ip_here>
If you want to take it a step further and add this functionality as a simple argument to scrapyd-deploy, you can go ahead and edit your scrapyd-deploy file (its just a Python script) to add the additional parameter and update the hosts file from within. But I'm not sure this is the best thing to do, since leaving this implementation separate and more explicit would probably be a better choice.

This is not something you can solve on the scrapyd side.
According to the source code of scrapyd-deploy, it requires the url to be defined in the [deploy] section of the scrapy.cfg.
One of the possible workarounds could be having a placeholder in scrapy.cfg which you would replace with a real IP address of the target server, before starting scrapyd-deploy.

Related

Learning Python fast how can I protect some private connections from been exposed

Hi I'm new to the community and new to Python, experienced but rusty on other high level languages, so my question is simple.
I made a simple script to connect to a private ftp server, and retrieve daily information from it.
from ftplib import FTP
#Open ftp connection
#Connect to server to retrieve inventory
#Open ftp connection
def FTPconnection(file_name):
ftp = FTP('ftp.serveriuse.com')
ftp.login('mylogin', 'password')
#List the files in the current directory
print("Current File List:")
file = ftp.dir()
print(file)
# # #Get the latest csv file from server
# ftp.cwd("/pub")
gfile = open(file_name, "wb")
ftp.retrbinary('RETR '+ file_name, gfile.write)
gfile.close()
ftp.quit()
FTPconnection('test1.csv')
FTPconnection('test2.csv')
That's the whole script, it passes my credentials, and then calls the function FTPconnection on two different files I'm retrieving.
Then my other script that processes them has an import statement, as I tried to call this script as a module, what my import does it's just connect to the FTP server and fetch information.
import ftpconnect as ftpc
This is the on the other Python script, that does the processing.
It works but I want to improve it, so I need some guidance on best practices about how to do this, because in Spyder 4.1.5 I get an 'Module ftpconnect called but unused' warning ... so probably I am missing something here, I'm developing on MacOS using Anaconda and Python 3.8.5.
I'm trying to build an app, to automate some tasks, but I couldn't find anything about modules that guided me to better code, it simply says you have to import whatever .py file name you used and that will be considered a module ...
and my final question is how can you normally protect private information(ftp credentials) from being exposed? This has nothing to do to protect my code but the credentials.
There are a few options for storing passwords and other secrets that a Python program needs to use, particularly a program that needs to run in the background where it can't just ask the user to type in the password.
Problems to avoid:
Checking the password in to source control where other developers or even the public can see it.
Other users on the same server reading the password from a configuration file or source code.
Having the password in a source file where others can see it over your shoulder while you are editing it.
Option 1: SSH
This isn't always an option, but it's probably the best. Your private key is never transmitted over the network, SSH just runs mathematical calculations to prove that you have the right key.
In order to make it work, you need the following:
The database or whatever you are accessing needs to be accessible by SSH. Try searching for "SSH" plus whatever service you are accessing. For example, "ssh postgresql". If this isn't a feature on your database, move on to the next option.
Create an account to run the service that will make calls to the database, and generate an SSH key.
Either add the public key to the service you're going to call, or create a local account on that server, and install the public key there.
Option 2: Environment Variables
This one is the simplest, so it might be a good place to start. It's described well in the Twelve Factor App. The basic idea is that your source code just pulls the password or other secrets from environment variables, and then you configure those environment variables on each system where you run the program. It might also be a nice touch if you use default values that will work for most developers. You have to balance that against making your software "secure by default".
Here's an example that pulls the server, user name, and password from environment variables.
import os
server = os.getenv('MY_APP_DB_SERVER', 'localhost')
user = os.getenv('MY_APP_DB_USER', 'myapp')
password = os.getenv('MY_APP_DB_PASSWORD', '')
db_connect(server, user, password)
Look up how to set environment variables in your operating system, and consider running the service under its own account. That way you don't have sensitive data in environment variables when you run programs in your own account. When you do set up those environment variables, take extra care that other users can't read them. Check file permissions, for example. Of course any users with root permission will be able to read them, but that can't be helped. If you're using systemd, look at the service unit, and be careful to use EnvironmentFile instead of Environment for any secrets. Environment values can be viewed by any user with systemctl show.
Option 3: Configuration Files
This is very similar to the environment variables, but you read the secrets from a text file. I still find the environment variables more flexible for things like deployment tools and continuous integration servers. If you decide to use a configuration file, Python supports several formats in the standard library, like JSON, INI, netrc, and XML. You can also find external packages like PyYAML and TOML. Personally, I find JSON and YAML the simplest to use, and YAML allows comments.
Three things to consider with configuration files:
Where is the file? Maybe a default location like ~/.my_app, and a command-line option to use a different location.
Make sure other users can't read the file.
Obviously, don't commit the configuration file to source code. You might want to commit a template that users can copy to their home directory.
Option 4: Python Module
Some projects just put their secrets right into a Python module.
# settings.py
db_server = 'dbhost1'
db_user = 'my_app'
db_password = 'correcthorsebatterystaple'
Then import that module to get the values.
# my_app.py
from settings import db_server, db_user, db_password
db_connect(db_server, db_user, db_password)
One project that uses this technique is Django. Obviously, you shouldn't commit settings.py to source control, although you might want to commit a file called settings_template.py that users can copy and modify.
I see a few problems with this technique:
Developers might accidentally commit the file to source control. Adding it to .gitignore reduces that risk.
Some of your code is not under source control. If you're disciplined and only put strings and numbers in here, that won't be a problem. If you start writing logging filter classes in here, stop!
If your project already uses this technique, it's easy to transition to environment variables. Just move all the setting values to environment variables, and change the Python module to read from those environment variables.

Best way to copy multiple files to multiple remote servers

I'm developing a service which has to copy multiple files from a central node to remote servers.
The problem is that each time the service is executed, there are new servers and new files to dispatch to these servers. I mean, in each execution, I have the information of which files have to be copied to each server and in which directory.
Obviously, this information is very dynamically changing, so I would like to be able to automatize this task. I tried to get a solution with Ansible, FTP and SCP over Python.
I think Ansible is very difficult to automatize every scp task in each execution.
SCP is ok but I need to build each SCP command in Python to launch it.
FTP Is too much for this problem because there are not many files to dispatch to a single server.
Is there any better solution than what I thinked about?
In case you send some the same file (or files) to different destinations (that can be organized as sets), you could profit from solutions as dsh or parallel-scp.
If this will make sense depends on your use-case.
Parallel-SSH Documentation
from __future__ import print_function
from pssh.pssh_client import ParallelSSHClient
hosts = ['myhost1', 'myhost2']
client = ParallelSSHClient(hosts)
output = client.run_command('uname')
for host, host_output in output.items():
for line in host_output.stdout:
print(line)
How about rsync ?
Example: rsync -rave
Where source or destination could be:
user#IP:/path/to/dest
It knows incremental + you can Cron it or trigger a small script when anything changes

How to use the get method with multiple hosts in fabric v2?

I have a fabfile that run several tasks on the hosts. This results in the creation of a file result.txt in each of the hosts.
Now I want to get all these files locally. This is what I tried:
from invoke import task
#task
def getresult(ctx):
ctx.get('result.txt')
I run with:
fab -H host1 host2 host3 getresult
In the end, I have only one file result.txt in my local machine (it seems to be the copy from the last host of the command line). I would like to get all the files.
Is there a way to do this with fabric v2? I did not find anything in the documentation. It seems that this was possible in fabric v1, but I am not sure for the v2.
In Fabric v2, the get API signature is:
get(remote, local=None, preserve_mode=True)
So when you don't specify the name with which it has to be stored locally, it uses the same name it has in the remote location. Hence it is being overwritten for each host it executes on, and you're ending up with the last one.
One way to fix this would be to mention the local file name and add a random suffix or prefix to it. Something like this.
from invoke import task
import random
#task
def getresult(ctx):
ctx.get('result.txt', 'result%s.txt' % random.random()*100)
This way, each time it executes, it stores the file with a unique name.
You can even add the host name to the file, if you can find a way to use it inside the method.

Simple development http-proxy for multiple source servers

I developed till now with different webapp-servers (Tornado, Django, ...) and am encountering the same problem again and again:
I want a simple web proxy (reverse proxy) that allows me, that I can combine different source entities from other web-servers (that could be static files, dynamic content from an app server or other content) to one set of served files. That means, the browser should see them as if they come from one source.
I know, that I can do that with nginx, but I am searching for an even simpler tool for development. I want something, that can be started on command line and does not need to run as root. Changing the configuration (routing of the requests) should be as simple as possible.
In development, I just want to be able to mashup different sources. For example: On my production server runs something, that I don't want to copy, but I want to connect with static files on a different server and also a new application on my development system.
Speed of the proxy is not the issue, just flexibility and speed of development!
Preferred would be a Python or other scripting solution. I found also a big list of Python proxys, but after scanning the list, I found that all are lacking. Most of them just connect to one destination server and no way to have multiple servers (the proxy has to decide which to take by analysis of the local url).
I am just wondering, that nobody else has this need ...
You do not need to start nginx as root as long as you do not let it serve on port 80. If you want it to run on port 80 as a normal user, use setcap. In combination with a script that converts between an nginx configuration file and a route specification for your reverse proxy, this should give you the most reliable solution.
If you want something simpler/smaller, it should be pretty straight-forward to write a script using Python's BaseHTTPServer and urllib. Here's an example that only implements GET, you'd have to extend it at least to POST and add some exception handling:
#!/usr/bin/env python
# encoding: utf-8
import BaseHTTPServer
import SocketServer
import urllib
import re
FORWARD_LIST = {
'/google/(.*)': r'http://www.google.com/%s',
'/so/(.*)': r'http://www.stackoverflow.com/%s',
}
class HTTPServer(BaseHTTPServer.HTTPServer, SocketServer.ThreadingMixIn):
pass
class ProxyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
for pattern, url in FORWARD_LIST.items():
match = re.search(pattern, self.path)
if match:
url = url % match.groups()
break
else:
self.send_error(404)
return
dataobj = urllib.urlopen(url)
data = dataobj.read()
self.send_response(200)
self.send_header("Content-Length", len(data))
for key, value in dataobj.info().items():
self.send_header(key, value)
self.end_headers()
self.wfile.write(data)
HTTPServer(("", 1234), ProxyHandler).serve_forever()
Your use case should be covered by:
https://mitmproxy.org/doc/features/reverseproxy.html
There is now a proxy that covers my needs (and more) -- very lightweight and very good:
Devd

Django and root processes

In my Django project I need to be able to check whether a host on the LAN is up using an ICMP ping. I found this SO question which answers how to ping something in Python and this SO question which links to resources explaining how to use the sodoers file.
The Setting
A Device model stores an IP address for a host on the LAN, and after adding a new Device instance to the DB (via a custom view, not the admin) I envisage checking to see if the device responds to a ping using an AJAX call to an API which exposes the capability.
The Problem
However (from the docstring of a library suggested in the the first SO question) "Note that ICMP messages can only be sent from processes running as root."
I don't want to run Django as the root user, since it is bad practice. However this part of the process (sending and ICMP ping) needs to run as root. If with a Django view I wish to send off a ping packet to test the liveness of a host then Django itself is required to be running as root since that is the process which would be invoking the ping.
Solutions
These are the solutions I can think of, and my question is are there any better ways to only execute select parts of a Django project as root, other than these:
Run Django as root (please no!)
Put a "ping request" in a queue that another processes -- run as root -- can periodically check and fulfil. Maybe something like celery.
Is there not a simpler way?
I want something like a "Django run as root" library, is this possible?
Absolutely no way, do not run the Django code as root!
I would run a daemon as root (written in Python, why not) and then IPC between the Django instance and your daemon. As long as you're sure to validate the content and properly handle it (e.g. use subprocess.call with an array etc) and only pass in data (not commands to execute) it should be fine.
Here is an example client and server, using web.py
Server: http://gist.github.com/788639
Client: http://gist.github.com/788658
You'll need to install webpy.org but it's worth having around anyway. If you can hard-wire the IP (or hostname) into the server and remove the argument, all the better.
What's your OS here? You might be able to write a little program that does what you want given a parameter, and stick that in the sudoers file, and give your django user permission to run it as root.
/etc/sudoers
I don't know what kind of system you're on, but on any box I've encountered, one does not have to be root to run the command-line ping program (it has the suid bit set, so it becomes root as necessary). So you could just invoke that. It's a bit more overhead, but probably negligible compared to network latency.

Categories