Python json log creation and processing - python

Using python logging, I create a log file in json format. Each line is a valid json and the whole log is just many lines of json. It is basically logging dta from some sensors. My code also controls some settings of those sensors.
I create loggers as per example below:
import logging
import logging.handlers
from pythonjsonlogger import jsonlogger
formatter = jsonlogger.JsonFormatter('%(asctime)s %(name)s %(message)s')
log = logging.getLogger('MSG')
log.setLevel(logging.INFO)
fh = logging.handlers.RotatingFileHandler(
filename='C:\\TEMP\\LOG\\test.log', maxBytes=2097152, backupCount=5)
fh.setFormatter(formatter)
log.addHandler(fh)
res = logging.getLogger('RES')
res.setLevel(logging.INFO)
res.addHandler(fh)
stg = logging.getLogger('SET')
stg.setLevel(logging.INFO)
stg.addHandler(fh)
The log file created by my code and the above loggers looks more or less like this:
{"asctime": "2016-05-13 11:25:32,154", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "1"}
{"asctime": "2016-05-13 11:25:32,155", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,155", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "23"}
{"asctime": "2016-05-13 11:25:32,157", "name": "RES", "message": "", "VAR": "LUX", "VAL": "150"}
{"asctime": "2016-05-13 11:25:32,159", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "2"}
{"asctime": "2016-05-13 11:25:32,164", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,166", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "25"}
{"asctime": "2016-05-13 11:25:32,171", "name": "RES", "message": "", "VAR": "LUX", "VAL": "170", "extra": "OV detected"}
{"asctime": "2016-05-13 11:25:32,177", "name": "SET", "message": "", "VAR": "Vdd", "VAL": "3"}
{"asctime": "2016-05-13 11:25:32,178", "name": "MSG", "message": "writting new setting successful"}
{"asctime": "2016-05-13 11:25:32,178", "name": "RES", "message": "", "VAR": "TEMP", "VAL": "28"}
{"asctime": "2016-05-13 11:25:32,178", "name": "RES", "message": "", "VAR": "LUX", "VAL": "190"}
Now my first question. Can I be sure that the lines in the log file will be written in the same order they are executed in Python code? (this was not always true when I was simple doing "print" to the console). If not how to ensure that? I was trying to scan the continuity of timestamps, but I noticed that even upto four lines can have the same timestamp.
As you can see I am setting one parameter logger('SET') Vdd = 1 .. 3 and reading a series of measurement results logger('RES') Temp,Lux. So my second question, what is the best way to parse this log file if I want to create a csv file or just create a stacked plot (Temp, Lux) vs Vdd ? (Note that there can be a variable number of keywards in each log line, there can be oder log messages so the log lines have to be filtered and the log file can be large)
I would want the procedure to be as generic and flexible as possible, as the parameters I set can be nested:
SET A
SET B
READ C
READ D
SET B
READ C
READ D
SET A
SET B
READ C
READ D
SET B
READ C
READ D
I was using simple code below, but is there a more efficient way to do it?
import json
data = []
with open('file') as f:
for line in f:
#lots of strange line filtering and keyword selection here
#in order to build Temp vs Vdd table, etc...

Can I be sure that the lines in the log file will be written in the same order they are executed in Python code? (this was not always true when I was simple doing "print" to the console)
If you use multiprocessing / multithreading that can happen. You need to sort out synchronisation between the processes and flushing of the buffers in that case, but the topic is too large to give any specific explanation here. If you're using only one process, no reordering is possible. If you think something was reordered, it's most likely a bug in your application.
I was using simple code below, but is there a more efficient way to do it?
You can't make that more efficient really. It already does all the buffering you need behind the scenes. But that code shouldn't be complicated. Writing it as a few generators doing the filtering / collection of data should be quite simple.

Related

Content Management-, Shop-System and Framework recognition with Tensorflow/Keras

I am very new to the World of Machine learning and up until now I've only done Tutorials and Example-Projects on this topic. I'm currently working on my first actual project and (hopefully apt) implementation of Machine Learning for my purpose.
The purpose is fairly simple: I want to automatically detect the Name of a known Content Management- (WordPress, TYPO3, Joomla!, Drupal, ...), Shop-System (Shopware5, Shopware6, Magento2, ...) or Framework (Symphony, Laravel, ...) only by using a list of files and directories from the document root of one of these systems with a max-depth of 3 directories.
I have nearly 150k installations of 22 different systems of which I know what system they are. Sadly my tool set on the Web servers, where those installations lie is limited. With these informations my though process was like this:
If I get a file-list of each of these installation's document root plus 2 directory levels recoursively (with unix tools (find mostly)) and optimize the output by filtering out known big directories like cache, media storage, libraries, ... I should get a List of 30 to 80 files and directories that should be iconic for the system at hand.
If I then use python to parse this list as JSON, I should be able to compile a list of nearly 150k example file-list-JSON, and their corresponding Software as a tag. With this I should be able to train a TF model.
Afterwards the model should be able to tell me what software is in use when I only give him a JSON of the directory list of the document root of an unknown system (that is one of the 22 known systems) which was created in the same way as the training data.
My focus is not the Performance of gathering the training data and the Performance of the training itself. I only want a fast way to later on identify a single system based on the directory and file list.
Now my problem is that I kinda lack the experience to know if this is going to work in the first place, if the JSON needs a certain format to work with TF or Keras or if this is even the most efficient way to do it.
any advice, hint, link to comparable project documentations or helpful docs in general is very welcome!
EDIT: Small addition to make it a little bit more transparent.
PHP based Content Management and Shop Systems have a vaguely similar Filestructure but everyone does what they want so the biggest similarity is a file that is served to the Apache Webserver, A configuration file containing the Databasecredentials, directories for themes, frontend, backend, libraries/plugins and so on.
naming conventions, order and hierarchy vary between the different systems.
So lets use an example.
Magento1: My initial thought as described above was to get a List of files with a simple find command with a maxdepth of 3 that also outputs if its a file ";f" or a directory ";d" and using grep afterwards to filter out the more redundant stuff like cache contents, user uploads and so on.
From this I get a list like this:
./index.php;f
./get.php;f
./cron.php;f
./api.php;f
./shell;d
./shell/log.php;f
./shell/compiler.php;f
./shell/abstract.php;f
./shell/.htaccess;f
./shell/indexer.php;f
./mage;f
./skin;d
./skin/frontend;d
./skin/frontend/rwd;d
./skin/frontend/base;d
./skin/frontend/default;d
./skin/adminhtml;d
./skin/adminhtml/default;d
./install;d
./install/default;d
./install.php;f
./app;d
./app/code;d
./app/code/core;d
./app/code/community;d
./app/Mage.php;f
./app/.htaccess;f
...
parsed as json I imagined it as something like this:
[
{
"name": "index.php",
"type": "file"
},
{
"name": "get.php",
"type": "file"
},
{
"name": "cron.php",
"type": "file"
},
{
"name": "api.php",
"type": "file"
},
{
"name": "shell",
"type": "directory",
"children": [
{
"name": "log.php",
"type": "file"
},
{
"name": "compiler.php",
"type": "file"
},
{
"name": "abstact.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
{
"name": "indexer.php",
"type": "file"
}
]
},
{
"name": "mage",
"type": "file"
},
{
"name": "skin",
"type": "directory",
"children": [
{
"name": "frontend",
"type": "directory",
"children": [
{
"name": "rwd",
"type": "directory"
},
{
"name": "base",
"type": "directory"
},
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "adminhtml",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "install",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
}
]
},
{
"name": "install.php",
"type": "file"
},
{
"name": "app",
"type": "directory",
"children": [
{
"name": "code",
"type": "directory",
"children": [
{
"name": "core",
"type": "directory"
},
{
"name": "community",
"type": "directory"
}
]
},
{
"name": "Mage.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
...
if I minify this I have a json in the form of something like this:
[{"name":"index.php","type":"file"},{"name":"get.php","type":"file"},{"name":"cron.php","type":"file"},{"name":"api.php","type":"file"},{"name":"shell","type":"directory","children":[{"name":"log.php","type":"file"},{"name":"compiler.php","type":"file"},{"name":"abstact.php","type":"file"},{"name":".htaccess","type":"file"},{"name":"indexer.php","type":"file"}]},{"name":"mage","type":"file"},{"name":"skin","type":"directory","children":[{"name":"frontend","type":"directory","children":[{"name":"rwd","type":"directory"},{"name":"base","type":"directory"},{"name":"default","type":"directory"}]},{"name":"adminhtml","type":"directory","children":[{"name":"default","type":"directory"}]},{"name":"install","type":"directory","children":[{"name":"default","type":"directory"}]}]},{"name":"install.php","type":"file"},{"name":"app","type":"directory","children":[{"name":"code","type":"directory","children":[{"name":"core","type":"directory"},{"name":"community","type":"directory"}]},{"name":"Mage.php","type":"file"},{"name":".htaccess","type":"file"}...
If I pour 150k of these into tensorflow, with a second list withe corelating labels for the 22 systems these filesystems represent, will it be able to tell me what label an until then not known json list of files and directories has? Will it be able to identify a different magento installation as Magento1 for example?

How do you specify output log file using structlog?

I feel like this should be super simple but I cannot figure out how to specify the path for the logfile when using structlog. The documentation states that you can use traditional logging alongside structlog so I tried this:
logger = structlog.getLogger(__name__)
logging.basicConfig(filename=logfile_path, level=logging.ERROR)
logger.error("TEST")
The log file gets created but of course "TEST" doesn't show up inside it. It's just blank.
For structlog log entries to appear in that file, you have to tell structlog to use stdlib logging for output. You can find three different approaches in the docs, depending on your other needs.
I was able to get an example working by following the docs to log to both stdout and a file.
import logging.config
import structlog
timestamper = structlog.processors.TimeStamper(fmt="iso")
logging.config.dictConfig({
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"default": {
"level": "DEBUG",
"class": "logging.StreamHandler",
},
"file": {
"level": "DEBUG",
"class": "logging.handlers.WatchedFileHandler",
"filename": "test.log",
},
},
"loggers": {
"": {
"handlers": ["default", "file"],
"level": "DEBUG",
"propagate": True,
},
}
})
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
timestamper,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
],
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
structlog.get_logger("test").info("hello")
If if you just wanted to log to a file you could use the snippet hynek suggested.
logging.basicConfig(filename='test.log', encoding='utf-8', level=logging.DEBUG)

Add stdout of subprocess to JSON report if test case fails

I'm investigating methods of adding to the JSON report generated by either pytest-json or pytest-json-report: I'm not hung up on either plugin. So far, I've done the bulk of my evaluation using pytest-json. So, for example, the JSON object has this for a test case
{
"name": "fixture_test.py::test_failure1",
"duration": 0.0012421607971191406,
"run_index": 2,
"setup": {
"name": "setup",
"duration": 0.00011181831359863281,
"outcome": "passed"
},
"call": {
"name": "call",
"duration": 0.0008759498596191406,
"outcome": "failed",
"longrepr": "def test_failure1():\n> assert 3 == 4, \"3 always equals 3\"\nE AssertionError: 3 always equals 3\nE assert 3 == 4\n\nfixture_test.py:19: AssertionError"
},
"teardown": {
"name": "teardown",
"duration": 0.00014257431030273438,
"outcome": "passed"
},
"outcome": "failed"
}
This is from experiments I'm trying. In practice, some of the test cases are done by spawning a sub-process via Popen and the assert is that a certain string appears in the stdout. In the event that the test case fails, I need to add a key/value to the call dictionary which contains the stdout of that subprocess. I have tried in vain thus far to find the correct fixture or apparatus to accomplish this. It seems that the pytest_exception_interact may be the way to go, but drilling into the JSON structure has thus far eluded me. All I need to do is add/modify the JSON structure at the point of an error. It seems that pytest_runtest_call is too heavy handed.
Alternatively, is there a means of altering the value of longrepr in the above? I've been unable to find the correct way of doing either of these and it's time to ask.
As it would appear, the pytest-json project is rather defunct. The developer/owner of pytest-json-report has this to say (under Related Tools at this link).
pytest-json has some great features but appears to be unmaintained. I borrowed some ideas and test cases from there.
The pytest-json-report project handles exactly the case that I'm requiring: capturing stdout from a subprocess and putting it into the JSON report. A crude example of doing so follows:
import subprocess as sp
import pytest
import sys
import re
def specialAssertHandler(str, assertMessage):
# because pytest automatically captures stdout,stderr this is all that's needed
# when the report is generated, this will be in a field named "stdout"
print(str)
return assertMessage
def test_subProcessStdoutCapture():
# NOTE: if you're version of Python 3 is sufficiently mature, add text=True also
proc = sp.Popen(['find', '.', '-name', '*.json'], stdout=sp.PIPE)
# NOTE: I had this because on the Ubuntu I was using, this is the version of
# Python and the return of proc.stdout.read() is a binary object not a string
if sys.version[0] == 3 and sys.version[6]:
output = proc.stdout.read().decode()
elif sys.version[0] == 2:
# The other version of Python I'm using is 2.7.15, it's exceedingly frustrating
# that the Python language def changed so between 2 and 3. In 2, the output
# was already a string object
output = proc.stdout.read()
m = re.search('some string', output)
assert m is not None, specialAssertHandler(output, "did not find 'some string' in output")
With the above, using the pytest-json-report, the full output of the subprocess is captured by the infrastructure and placed into the afore mentioned report. An excerpt showing this is below:
{
"nodeid": "expirment_test.py::test_stdout",
"lineno": 25,
"outcome": "failed",
"keywords": [
"PyTest",
"test_stdout",
"expirment_test.py"
],
"setup": {
"duration": 0.0002694129943847656,
"outcome": "passed"
},
"call": {
"duration": 0.02718186378479004,
"outcome": "failed",
"crash": {
"path": "/home/afalanga/devel/PyTest/expirment_test.py",
"lineno": 32,
"message": "AssertionError: Expected to find always\nassert None is not None"
},
"traceback": [
{
"path": "expirment_test.py",
"lineno": 32,
"message": "AssertionError"
}
],
"stdout": "./.report.json\n./report.json\n./report1.json\n./report2.json\n./simple_test.json\n./testing_addition.json\n\n",
"longrepr": "..."
},
"teardown": {
"duration": 0.0004875659942626953,
"outcome": "passed"
}
}
The field longrepr holds the full text of the test case but in the interest of brevety, it is made an ellipsis. In the field crash, the value of assertMessage from my example is placed. This shows that it is possible to place such messages into the report at the point of occurrence instead of post processing.
I think it may be possible to "cleverly" handle this using the hook I referenced in my original question pytest_exception_interact. If I find it is so, I'll update this answer with a demonstration.

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

Python: Access a dictionary that is located inside of a text file

I am working on a quick and dirty script to get Chromium's bookmarks and turn them into a pipe menu for Openbox. Chromium stores it's bookmarks in a file called Bookmarks that stores information in a dictionary form like this:
{
"checksum": "99999999999999999999",
"roots": {
"bookmark_bar": {
"children": [ {
"date_added": "9999999999999999999",
"id": "9",
"name": "Facebook",
"type": "url",
"url": "http://www.facebook.com/"
}, {
"date_added": "999999999999",
"id": "9",
"name": "Twitter",
"type": "url",
"url": "http://twitter.com/"
How would I open this dictionary in this file in Python and assign it to a variable. I know you open a file with open(), but I don't really know where to go from there. In the end, I want to be able to access the info in the dictionary from a variable like this bookmarks[bookmarks_bar][children][0][name] and have it return 'Facebook'
Do you know if this is a json file? If so, python provides a json library.
Json can be used as a data serialization/interchange format. It's nice because it's cross platform. Importing this like you ask seems fairly easy, an example from the docs:
>>> import json
>>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
So in your case it would look something like:
import json
with open(file.txt) as f:
text = f.read()
bookmarks = json.loads(text)
print bookmarks[bookmarks_bar][children][0][name]
JSON is definitely the "right" way to do this, but for a quick-and-dirty script eval() might suffice:
with open('file.txt') as f:
bookmarks = eval(f.read())

Categories