How do you specify output log file using structlog? - python

I feel like this should be super simple but I cannot figure out how to specify the path for the logfile when using structlog. The documentation states that you can use traditional logging alongside structlog so I tried this:
logger = structlog.getLogger(__name__)
logging.basicConfig(filename=logfile_path, level=logging.ERROR)
logger.error("TEST")
The log file gets created but of course "TEST" doesn't show up inside it. It's just blank.

For structlog log entries to appear in that file, you have to tell structlog to use stdlib logging for output. You can find three different approaches in the docs, depending on your other needs.

I was able to get an example working by following the docs to log to both stdout and a file.
import logging.config
import structlog
timestamper = structlog.processors.TimeStamper(fmt="iso")
logging.config.dictConfig({
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"default": {
"level": "DEBUG",
"class": "logging.StreamHandler",
},
"file": {
"level": "DEBUG",
"class": "logging.handlers.WatchedFileHandler",
"filename": "test.log",
},
},
"loggers": {
"": {
"handlers": ["default", "file"],
"level": "DEBUG",
"propagate": True,
},
}
})
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
timestamper,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
],
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
structlog.get_logger("test").info("hello")
If if you just wanted to log to a file you could use the snippet hynek suggested.
logging.basicConfig(filename='test.log', encoding='utf-8', level=logging.DEBUG)

Related

Content Management-, Shop-System and Framework recognition with Tensorflow/Keras

I am very new to the World of Machine learning and up until now I've only done Tutorials and Example-Projects on this topic. I'm currently working on my first actual project and (hopefully apt) implementation of Machine Learning for my purpose.
The purpose is fairly simple: I want to automatically detect the Name of a known Content Management- (WordPress, TYPO3, Joomla!, Drupal, ...), Shop-System (Shopware5, Shopware6, Magento2, ...) or Framework (Symphony, Laravel, ...) only by using a list of files and directories from the document root of one of these systems with a max-depth of 3 directories.
I have nearly 150k installations of 22 different systems of which I know what system they are. Sadly my tool set on the Web servers, where those installations lie is limited. With these informations my though process was like this:
If I get a file-list of each of these installation's document root plus 2 directory levels recoursively (with unix tools (find mostly)) and optimize the output by filtering out known big directories like cache, media storage, libraries, ... I should get a List of 30 to 80 files and directories that should be iconic for the system at hand.
If I then use python to parse this list as JSON, I should be able to compile a list of nearly 150k example file-list-JSON, and their corresponding Software as a tag. With this I should be able to train a TF model.
Afterwards the model should be able to tell me what software is in use when I only give him a JSON of the directory list of the document root of an unknown system (that is one of the 22 known systems) which was created in the same way as the training data.
My focus is not the Performance of gathering the training data and the Performance of the training itself. I only want a fast way to later on identify a single system based on the directory and file list.
Now my problem is that I kinda lack the experience to know if this is going to work in the first place, if the JSON needs a certain format to work with TF or Keras or if this is even the most efficient way to do it.
any advice, hint, link to comparable project documentations or helpful docs in general is very welcome!
EDIT: Small addition to make it a little bit more transparent.
PHP based Content Management and Shop Systems have a vaguely similar Filestructure but everyone does what they want so the biggest similarity is a file that is served to the Apache Webserver, A configuration file containing the Databasecredentials, directories for themes, frontend, backend, libraries/plugins and so on.
naming conventions, order and hierarchy vary between the different systems.
So lets use an example.
Magento1: My initial thought as described above was to get a List of files with a simple find command with a maxdepth of 3 that also outputs if its a file ";f" or a directory ";d" and using grep afterwards to filter out the more redundant stuff like cache contents, user uploads and so on.
From this I get a list like this:
./index.php;f
./get.php;f
./cron.php;f
./api.php;f
./shell;d
./shell/log.php;f
./shell/compiler.php;f
./shell/abstract.php;f
./shell/.htaccess;f
./shell/indexer.php;f
./mage;f
./skin;d
./skin/frontend;d
./skin/frontend/rwd;d
./skin/frontend/base;d
./skin/frontend/default;d
./skin/adminhtml;d
./skin/adminhtml/default;d
./install;d
./install/default;d
./install.php;f
./app;d
./app/code;d
./app/code/core;d
./app/code/community;d
./app/Mage.php;f
./app/.htaccess;f
...
parsed as json I imagined it as something like this:
[
{
"name": "index.php",
"type": "file"
},
{
"name": "get.php",
"type": "file"
},
{
"name": "cron.php",
"type": "file"
},
{
"name": "api.php",
"type": "file"
},
{
"name": "shell",
"type": "directory",
"children": [
{
"name": "log.php",
"type": "file"
},
{
"name": "compiler.php",
"type": "file"
},
{
"name": "abstact.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
{
"name": "indexer.php",
"type": "file"
}
]
},
{
"name": "mage",
"type": "file"
},
{
"name": "skin",
"type": "directory",
"children": [
{
"name": "frontend",
"type": "directory",
"children": [
{
"name": "rwd",
"type": "directory"
},
{
"name": "base",
"type": "directory"
},
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "adminhtml",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "install",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
}
]
},
{
"name": "install.php",
"type": "file"
},
{
"name": "app",
"type": "directory",
"children": [
{
"name": "code",
"type": "directory",
"children": [
{
"name": "core",
"type": "directory"
},
{
"name": "community",
"type": "directory"
}
]
},
{
"name": "Mage.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
...
if I minify this I have a json in the form of something like this:
[{"name":"index.php","type":"file"},{"name":"get.php","type":"file"},{"name":"cron.php","type":"file"},{"name":"api.php","type":"file"},{"name":"shell","type":"directory","children":[{"name":"log.php","type":"file"},{"name":"compiler.php","type":"file"},{"name":"abstact.php","type":"file"},{"name":".htaccess","type":"file"},{"name":"indexer.php","type":"file"}]},{"name":"mage","type":"file"},{"name":"skin","type":"directory","children":[{"name":"frontend","type":"directory","children":[{"name":"rwd","type":"directory"},{"name":"base","type":"directory"},{"name":"default","type":"directory"}]},{"name":"adminhtml","type":"directory","children":[{"name":"default","type":"directory"}]},{"name":"install","type":"directory","children":[{"name":"default","type":"directory"}]}]},{"name":"install.php","type":"file"},{"name":"app","type":"directory","children":[{"name":"code","type":"directory","children":[{"name":"core","type":"directory"},{"name":"community","type":"directory"}]},{"name":"Mage.php","type":"file"},{"name":".htaccess","type":"file"}...
If I pour 150k of these into tensorflow, with a second list withe corelating labels for the 22 systems these filesystems represent, will it be able to tell me what label an until then not known json list of files and directories has? Will it be able to identify a different magento installation as Magento1 for example?

Preventing libtorrent creating torrents with the attr attribute

I'm trying to make a Python app that creates torrent files from other files. I need this app to always create the same infohash to any given file (including files with the same content but different name), so I create a symlink with the name "blob" before creating the torrent, and it works. The only problem I'm facing is that if the file has executable permissions, it writes the "attr" key in the bencoded torrent, changing its infohash.
I found this discussion by looking for this attribute: https://libtorrent-discuss.narkive.com/bXlfCX2K/libtorrent-extra-elements-in-the-torrent-files-array
The code I'm using is this:
import libtorrent as lt
import os
os.symlink(file,"blob")
fs = lt.file_storage()
lt.add_files(fs,"blob")
t = lt.create_torrent(fs)
lt.set_piece_hashes(t,".")
os.remove("blob")
t.add_tracker("udp://tracker.openbittorrent.com:6969/announce",0)
t.set_creator('libtorrent')
torrent = lt.bencode(t.generate())
with open("file.torrent", "wb") as f:
f.write(torrent)
The bencoded torrent in json format:
{
"announce": "udp://tracker.openbittorrent.com:6969/announce",
"announce-list": [
[
"udp://tracker.openbittorrent.com:6969/announce"
]
],
"created by": "libtorrent",
"creation date": 1650389319,
"info": {
"attr": "x",
"file tree": {
"blob": {
"": {
"attr": "x",
"length": 3317054,
"pieces root": "<hex>...</hex>"
}
}
},
"length": 3317054,
"meta version": 2,
"name": "blob",
"piece length": 32768,
"pieces": "<hex>...</hex>"
},
"piece layers": {
"<hex>...</hex>": "<hex>...</hex>"
}
}
Is there any way to tell libtorrent to ignore this key attribute when creating the torrent to ensure that it always makes the same infohash?

Combining Python trace information and logging

I'm trying to write a highly modular Python logging system (using the logging module) and include information from the trace module in the log message.
For example, I want to be able to write a line of code like:
my_logger.log_message(MyLogFilter, "this is a message")
and have it include the trace of where the "log_message" call was made, instead of the actual logger call itself.
I almost have the following code working except for the fact that the trace information is from the logging.debug() call rather than the my_logger.log_message() one.
class MyLogFilter(logging.Filter):
def __init__(self):
self.extra = {"error_code": 999}
self.level = "debug"
def filter(self, record):
for key in self.extra.keys():
setattr(record, key, self.extra[key])
class myLogger(object):
def __init__(self):
fid = logging.FileHandler("test.log")
formatter = logging.Formatter('%(pathname)s:%(lineno)i, %(error_code)%I, %(message)s'
fid.setFormatter(formatter)
self.my_logger = logging.getLogger(name="test")
self.my_logger.setLevel(logging.DEBUG)
self.my_logger.addHandler(fid)
def log_message(self, lfilter, message):
xfilter = lfilter()
self.my_logger.addFilter(xfilter)
log_funct = getattr(self.logger, xfilter.level)
log_funct(message)
if __name__ == "__main__":
logger = myLogger()
logger.log_message(MyLogFilter, "debugging")
This is a lot of trouble to go through in order to make a simple logging.debug call but in reality, I will have a list of many different versions of MyLogFilter at different logging levels that contain different values of the "error_code" attribute and I'm trying to make the log_message() call as short and sweet as possible because it will be repeated numerous times.
I would appreciate any information about how to do what I want to, or if I'm completely off on the wrong track and if that's the case, what I should be doing instead.
I would like to stick to the internal python modules of "logging" and "trace" if that's possible instead of using any external solutions.
or if I'm completely off on the wrong track and if that's the case, what I should be doing instead.
My strong suggestion is that you view logging as a solved problem and avoid reinventing the wheel.
If you need more than the standard library's logging module provides, it's probably something like structlog (pip install structlog)
Structlog will give you:
data binding
cloud native structured logging
pipelines
...and more
It will handle most local and cloud use cases.
Below is one common configuration that will output colorized logging to a .log file, to stdout, and can be extended further to log to eg AWS CloudWatch.
Notice there is an included processor: StackInfoRenderer -- this will include stack information to all logging calls with a 'truthy' value for stack_info (this is also in stdlib's logging btw). If you only want stack info for exceptions, then you'd want to do something like exc_info=True for your logging calls.
main.py
from structlog import get_logger
from logging_config import configure_local_logging
configure_local_logging()
logger = get_logger()
logger.info("Some random info")
logger.debug("Debugging info with stack", stack_info=True)
try:
assert 'foo'=='bar'
catch Exception as e:
logger.error("Error info with an exc", exc_info=e)
logging_config.py
import logging
import structlog
def configure_local_logging(filename=__name__):
"""Provides a structlog colorized console and file renderer for logging in eg ING tickets"""
timestamper = structlog.processors.TimeStamper(fmt="%Y-%m-%d %H:%M:%S")
pre_chain = [
structlog.stdlib.add_log_level,
timestamper,
]
logging.config.dictConfig({
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"plain": {
"()": structlog.stdlib.ProcessorFormatter,
"processor": structlog.dev.ConsoleRenderer(colors=False),
"foreign_pre_chain": pre_chain,
},
"colored": {
"()": structlog.stdlib.ProcessorFormatter,
"processor": structlog.dev.ConsoleRenderer(colors=True),
"foreign_pre_chain": pre_chain,
},
},
"handlers": {
"default": {
"level": "DEBUG",
"class": "logging.StreamHandler",
"formatter": "colored",
},
"file": {
"level": "DEBUG",
"class": "logging.handlers.WatchedFileHandler",
"filename": filename + ".log",
"formatter": "plain",
},
},
"loggers": {
"": {
"handlers": ["default", "file"],
"level": "DEBUG",
"propagate": True,
},
}
})
structlog.configure_once(
processors=[
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
timestamper,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
Structlog can do quite a bit more than this. I suggest you check it out.
It turns out the missing piece to the puzzle is using the "traceback" module rather than the "trace" one. It's simple enough to parse the output of traceback to pull out the source filename and line number of the ".log_message()" call.
If my logging needs become any more complicated then I'll definitely look into struct_log. Thank you for that information as I'd never heard about it before.

how to create multiple log files in a log folder when it exceeds the maxsize

I have a program that give a logging information in log file but now I have created a folder named as LogFolder I'm keeping my log file at the same folder but I want to create every time a new file when its increases the maxBytes size with different name
My log file is written in json format if you know normal format also for same query then you can help me with that.
My logging.json file is:
{
"version": 1,
"disable_existing_loggers": false,
"formatters": {
"json": {
"format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
"()": "pythonjsonlogger.jsonlogger.JsonFormatter"
}
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"level": "DEBUG",
"formatter": "json",
"stream": "ext://sys.stdout"
},
"file_handler": {
"class": "logging.handlers.RotatingFileHandler",
"level": "DEBUG",
"formatter": "json",
"filename": "..\\LogFloder\\Data.log",
"mode": "a",
"maxBytes": 25600,
"encoding": "utf8"
}
},
"root": {
"level": "DEBUG",
"handlers": ["console", "file_handler"]
}
}
This is how I call it in my python file main.py:
import logging.config
import json
fp = open('logging.json')
logging.config.dictConfig(json.load(fp))
logging.getLogger("requests").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)
logger.removeHandler(default_handler)
fp.close()
Here all the things are working very fine. I just want to make new log file with different name in LogFolder when its exceeds the maxbytes size ...please help me how to do it.
Thanks in advance..
Have a look at RotatingFileHandler
EDIT to expand the answer as per recommendation from comment
The RotatingFileHandler class supports rotation of disk log files. At instantiation you can supply two optional arguments - maxBytes with default value of 0 and backupCount with default value of 0
You can use the maxBytes and backupCount values to allow the file to rollover at a predetermined size. When the size is about to be exceeded, the file is closed and a new file is silently opened for output. Rollover occurs whenever the current log file is nearly maxBytes in length; but if either of maxBytes or backupCount is zero, rollover never occurs, so you generally want to set backupCount to at least 1, and have a non-zero maxBytes. When backupCount is non-zero, the system will save old log files by appending the extensions ‘.1’, ‘.2’ etc., to the filename. For example, with a backupCount of 5 and a base file name of app.log, you would get app.log, app.log.1, app.log.2, up to app.log.5. The file being written to is always app.log. When this file is filled, it is closed and renamed to app.log.1, and if files app.log.1, app.log.2, etc. exist, then they are renamed to app.log.2, app.log.3 etc. respectively.
There is also TimeRotatingFileHandler that allow to rotate log files based on time.

Python json log creation and processing

Using python logging, I create a log file in json format. Each line is a valid json and the whole log is just many lines of json. It is basically logging dta from some sensors. My code also controls some settings of those sensors.
I create loggers as per example below:
import logging
import logging.handlers
from pythonjsonlogger import jsonlogger
formatter = jsonlogger.JsonFormatter('%(asctime)s %(name)s %(message)s')
log = logging.getLogger('MSG')
log.setLevel(logging.INFO)
fh = logging.handlers.RotatingFileHandler(
filename='C:\\TEMP\\LOG\\test.log', maxBytes=2097152, backupCount=5)
fh.setFormatter(formatter)
log.addHandler(fh)
res = logging.getLogger('RES')
res.setLevel(logging.INFO)
res.addHandler(fh)
stg = logging.getLogger('SET')
stg.setLevel(logging.INFO)
stg.addHandler(fh)
The log file created by my code and the above loggers looks more or less like this:
{"asctime": "2016-05-13 11:25:32,154", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "1"}
{"asctime": "2016-05-13 11:25:32,155", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,155", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "23"}
{"asctime": "2016-05-13 11:25:32,157", "name": "RES", "message": "", "VAR": "LUX", "VAL": "150"}
{"asctime": "2016-05-13 11:25:32,159", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "2"}
{"asctime": "2016-05-13 11:25:32,164", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,166", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "25"}
{"asctime": "2016-05-13 11:25:32,171", "name": "RES", "message": "", "VAR": "LUX", "VAL": "170", "extra": "OV detected"}
{"asctime": "2016-05-13 11:25:32,177", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "3"}
{"asctime": "2016-05-13 11:25:32,178", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,178", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "28"}
{"asctime": "2016-05-13 11:25:32,178", "name": "RES", "message": "", "VAR": "LUX", "VAL": "190"}
Now my first question. Can I be sure that the lines in the log file will be written in the same order they are executed in Python code? (this was not always true when I was simple doing "print" to the console). If not how to ensure that? I was trying to scan the continuity of timestamps, but I noticed that even upto four lines can have the same timestamp.
As you can see I am setting one parameter logger('SET') Vdd = 1 .. 3 and reading a series of measurement results logger('RES') Temp,Lux. So my second question, what is the best way to parse this log file if I want to create a csv file or just create a stacked plot (Temp, Lux) vs Vdd ? (Note that there can be a variable number of keywards in each log line, there can be oder log messages so the log lines have to be filtered and the log file can be large)
I would want the procedure to be as generic and flexible as possible, as the parameters I set can be nested:
SET A
SET B
READ C
READ D
SET B
READ C
READ D
SET A
SET B
READ C
READ D
SET B
READ C
READ D
I was using simple code below, but is there a more efficient way to do it?
import json
data = []
with open('file') as f:
for line in f:
#lots of strange line filtering and keyword selection here
#in order to build Temp vs Vdd table, etc...
Can I be sure that the lines in the log file will be written in the same order they are executed in Python code? (this was not always true when I was simple doing "print" to the console)
If you use multiprocessing / multithreading that can happen. You need to sort out synchronisation between the processes and flushing of the buffers in that case, but the topic is too large to give any specific explanation here. If you're using only one process, no reordering is possible. If you think something was reordered, it's most likely a bug in your application.
I was using simple code below, but is there a more efficient way to do it?
You can't make that more efficient really. It already does all the buffering you need behind the scenes. But that code shouldn't be complicated. Writing it as a few generators doing the filtering / collection of data should be quite simple.

Categories