I am very new to the World of Machine learning and up until now I've only done Tutorials and Example-Projects on this topic. I'm currently working on my first actual project and (hopefully apt) implementation of Machine Learning for my purpose.
The purpose is fairly simple: I want to automatically detect the Name of a known Content Management- (WordPress, TYPO3, Joomla!, Drupal, ...), Shop-System (Shopware5, Shopware6, Magento2, ...) or Framework (Symphony, Laravel, ...) only by using a list of files and directories from the document root of one of these systems with a max-depth of 3 directories.
I have nearly 150k installations of 22 different systems of which I know what system they are. Sadly my tool set on the Web servers, where those installations lie is limited. With these informations my though process was like this:
If I get a file-list of each of these installation's document root plus 2 directory levels recoursively (with unix tools (find mostly)) and optimize the output by filtering out known big directories like cache, media storage, libraries, ... I should get a List of 30 to 80 files and directories that should be iconic for the system at hand.
If I then use python to parse this list as JSON, I should be able to compile a list of nearly 150k example file-list-JSON, and their corresponding Software as a tag. With this I should be able to train a TF model.
Afterwards the model should be able to tell me what software is in use when I only give him a JSON of the directory list of the document root of an unknown system (that is one of the 22 known systems) which was created in the same way as the training data.
My focus is not the Performance of gathering the training data and the Performance of the training itself. I only want a fast way to later on identify a single system based on the directory and file list.
Now my problem is that I kinda lack the experience to know if this is going to work in the first place, if the JSON needs a certain format to work with TF or Keras or if this is even the most efficient way to do it.
any advice, hint, link to comparable project documentations or helpful docs in general is very welcome!
EDIT: Small addition to make it a little bit more transparent.
PHP based Content Management and Shop Systems have a vaguely similar Filestructure but everyone does what they want so the biggest similarity is a file that is served to the Apache Webserver, A configuration file containing the Databasecredentials, directories for themes, frontend, backend, libraries/plugins and so on.
naming conventions, order and hierarchy vary between the different systems.
So lets use an example.
Magento1: My initial thought as described above was to get a List of files with a simple find command with a maxdepth of 3 that also outputs if its a file ";f" or a directory ";d" and using grep afterwards to filter out the more redundant stuff like cache contents, user uploads and so on.
From this I get a list like this:
./index.php;f
./get.php;f
./cron.php;f
./api.php;f
./shell;d
./shell/log.php;f
./shell/compiler.php;f
./shell/abstract.php;f
./shell/.htaccess;f
./shell/indexer.php;f
./mage;f
./skin;d
./skin/frontend;d
./skin/frontend/rwd;d
./skin/frontend/base;d
./skin/frontend/default;d
./skin/adminhtml;d
./skin/adminhtml/default;d
./install;d
./install/default;d
./install.php;f
./app;d
./app/code;d
./app/code/core;d
./app/code/community;d
./app/Mage.php;f
./app/.htaccess;f
...
parsed as json I imagined it as something like this:
[
{
"name": "index.php",
"type": "file"
},
{
"name": "get.php",
"type": "file"
},
{
"name": "cron.php",
"type": "file"
},
{
"name": "api.php",
"type": "file"
},
{
"name": "shell",
"type": "directory",
"children": [
{
"name": "log.php",
"type": "file"
},
{
"name": "compiler.php",
"type": "file"
},
{
"name": "abstact.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
{
"name": "indexer.php",
"type": "file"
}
]
},
{
"name": "mage",
"type": "file"
},
{
"name": "skin",
"type": "directory",
"children": [
{
"name": "frontend",
"type": "directory",
"children": [
{
"name": "rwd",
"type": "directory"
},
{
"name": "base",
"type": "directory"
},
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "adminhtml",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "install",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
}
]
},
{
"name": "install.php",
"type": "file"
},
{
"name": "app",
"type": "directory",
"children": [
{
"name": "code",
"type": "directory",
"children": [
{
"name": "core",
"type": "directory"
},
{
"name": "community",
"type": "directory"
}
]
},
{
"name": "Mage.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
...
if I minify this I have a json in the form of something like this:
[{"name":"index.php","type":"file"},{"name":"get.php","type":"file"},{"name":"cron.php","type":"file"},{"name":"api.php","type":"file"},{"name":"shell","type":"directory","children":[{"name":"log.php","type":"file"},{"name":"compiler.php","type":"file"},{"name":"abstact.php","type":"file"},{"name":".htaccess","type":"file"},{"name":"indexer.php","type":"file"}]},{"name":"mage","type":"file"},{"name":"skin","type":"directory","children":[{"name":"frontend","type":"directory","children":[{"name":"rwd","type":"directory"},{"name":"base","type":"directory"},{"name":"default","type":"directory"}]},{"name":"adminhtml","type":"directory","children":[{"name":"default","type":"directory"}]},{"name":"install","type":"directory","children":[{"name":"default","type":"directory"}]}]},{"name":"install.php","type":"file"},{"name":"app","type":"directory","children":[{"name":"code","type":"directory","children":[{"name":"core","type":"directory"},{"name":"community","type":"directory"}]},{"name":"Mage.php","type":"file"},{"name":".htaccess","type":"file"}...
If I pour 150k of these into tensorflow, with a second list withe corelating labels for the 22 systems these filesystems represent, will it be able to tell me what label an until then not known json list of files and directories has? Will it be able to identify a different magento installation as Magento1 for example?
Is it possible to have field names that do not conform to python variable naming rules? To elaborate, is it possible to have the field name as "Job Title" instead of "job_title" in the export file. While may not be useful in JSON or XML exports, such an functionality might be useful while exporting in CSV format. For instance, if I need to use this data to import to another system which is already configured to accept CSVs with a certain field name.
Tried to reading the Item Pipelines documentation but it appears to be for an "an item has been scraped by a spider" but not for the field names themselves (Could be totally wrong though).
Any help in this direction would be really helpful!
I would suggest you to use a third party lib called scrapy-jsonschema. With that you can define your Items like this:
from scrapy_jsonschema.item import JsonSchemaItem
class MyItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "MyItem",
"description": "My Item with spaces",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for the employee",
"type": "integer"
},
"name": {
"description": "Name of the employee",
"type": "string"
},
"job title": {
"description": "The title of employee's job.",
"type": "string",
}
},
"required": ["id", "name", "job title"]
}
And populate it like this:
item = MyItem()
item['job title'] = 'Boss'
You can read more about here.
This solution address the Item definition as you asked, but you can achieve similar results without defining an Item. For example, you could just scrape the data into a dict and yield it back to scrapy.
yield {
"id": response.xpath('...').get(),
"name": response.xpath('...').get(),
"job title": response.xpath('...').get(),
}
with scrapy crawl myspider -o file.csv that would scrape into a csv and the columns will have the names you chose.
You could also have the spider directly write into a csv, or it's pipeline, etc. Several ways to do it without a Item definition.
I want to use the pandas.read_json() function to import data into a pandas dataframe and I'm going to use table for the orient parameter in order to be able to provide data type information. In this case the input json has a schema property which can be used to specify input metadata, like this:
{
"schema": {
"fields": [
{
"name": "col1",
"type": "integer"
},
{
"name": "col2",
"type": "string"
},
{
"name": "col3",
"type": "integer"
}
],
"primaryKey": [
"col1"
]
},
"data": [...]
}
However, the pandas documentation (section "Encoding with Table Schema") does not elaborate on what kind of type specifications I can use. The "schema" property name suggests that maybe the type specs from json schema are supported? Can anyone confirm or otherwise provide info on the supported type specs?
The table schema that applies for orient='table' in both pandas.read_json() and Dataframe.to_json() is documented here.
I have this JSON:
{
"app_name": "my_app",
"version": {
"1.0": {
"path": "/my_app/1.0"
},
"2.0": {
"path": "/my_app/2.0"
}
}
}
Is it somehow possible to reference the keywords app_name and the key of version so that I don't have to repeat "my_app" and the version numbering?
I was thinking something along the lines of... (code totally made up):
{
"#app_name": "my_app",
"version": {
"1.0": {
"path": "/{{$app_name}}/{{key[-1]]}}"
},
"2.0": {
"path": "/{{$app_name}}/{{key[-1]}}"
}
}
}
Or is this something that could instead be handled better using YAML?
In the end, I intend to read this data into a Python dictionary.
No, JSON does not have references. (The functionality you request here, with substring expansion, would open itself to memory attacks against the parser; by not supporting this functionality, JSON avoids vulnerability to such attacks).
If you want such functionality, you need to implement it yourself.
Not in pure JSON, but you could performs string substitution after you parse the JSON.
Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?
So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p
python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.
I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo