Preventing libtorrent creating torrents with the attr attribute - python

I'm trying to make a Python app that creates torrent files from other files. I need this app to always create the same infohash to any given file (including files with the same content but different name), so I create a symlink with the name "blob" before creating the torrent, and it works. The only problem I'm facing is that if the file has executable permissions, it writes the "attr" key in the bencoded torrent, changing its infohash.
I found this discussion by looking for this attribute: https://libtorrent-discuss.narkive.com/bXlfCX2K/libtorrent-extra-elements-in-the-torrent-files-array
The code I'm using is this:
import libtorrent as lt
import os
os.symlink(file,"blob")
fs = lt.file_storage()
lt.add_files(fs,"blob")
t = lt.create_torrent(fs)
lt.set_piece_hashes(t,".")
os.remove("blob")
t.add_tracker("udp://tracker.openbittorrent.com:6969/announce",0)
t.set_creator('libtorrent')
torrent = lt.bencode(t.generate())
with open("file.torrent", "wb") as f:
f.write(torrent)
The bencoded torrent in json format:
{
"announce": "udp://tracker.openbittorrent.com:6969/announce",
"announce-list": [
[
"udp://tracker.openbittorrent.com:6969/announce"
]
],
"created by": "libtorrent",
"creation date": 1650389319,
"info": {
"attr": "x",
"file tree": {
"blob": {
"": {
"attr": "x",
"length": 3317054,
"pieces root": "<hex>...</hex>"
}
}
},
"length": 3317054,
"meta version": 2,
"name": "blob",
"piece length": 32768,
"pieces": "<hex>...</hex>"
},
"piece layers": {
"<hex>...</hex>": "<hex>...</hex>"
}
}
Is there any way to tell libtorrent to ignore this key attribute when creating the torrent to ensure that it always makes the same infohash?

Related

Content Management-, Shop-System and Framework recognition with Tensorflow/Keras

I am very new to the World of Machine learning and up until now I've only done Tutorials and Example-Projects on this topic. I'm currently working on my first actual project and (hopefully apt) implementation of Machine Learning for my purpose.
The purpose is fairly simple: I want to automatically detect the Name of a known Content Management- (WordPress, TYPO3, Joomla!, Drupal, ...), Shop-System (Shopware5, Shopware6, Magento2, ...) or Framework (Symphony, Laravel, ...) only by using a list of files and directories from the document root of one of these systems with a max-depth of 3 directories.
I have nearly 150k installations of 22 different systems of which I know what system they are. Sadly my tool set on the Web servers, where those installations lie is limited. With these informations my though process was like this:
If I get a file-list of each of these installation's document root plus 2 directory levels recoursively (with unix tools (find mostly)) and optimize the output by filtering out known big directories like cache, media storage, libraries, ... I should get a List of 30 to 80 files and directories that should be iconic for the system at hand.
If I then use python to parse this list as JSON, I should be able to compile a list of nearly 150k example file-list-JSON, and their corresponding Software as a tag. With this I should be able to train a TF model.
Afterwards the model should be able to tell me what software is in use when I only give him a JSON of the directory list of the document root of an unknown system (that is one of the 22 known systems) which was created in the same way as the training data.
My focus is not the Performance of gathering the training data and the Performance of the training itself. I only want a fast way to later on identify a single system based on the directory and file list.
Now my problem is that I kinda lack the experience to know if this is going to work in the first place, if the JSON needs a certain format to work with TF or Keras or if this is even the most efficient way to do it.
any advice, hint, link to comparable project documentations or helpful docs in general is very welcome!
EDIT: Small addition to make it a little bit more transparent.
PHP based Content Management and Shop Systems have a vaguely similar Filestructure but everyone does what they want so the biggest similarity is a file that is served to the Apache Webserver, A configuration file containing the Databasecredentials, directories for themes, frontend, backend, libraries/plugins and so on.
naming conventions, order and hierarchy vary between the different systems.
So lets use an example.
Magento1: My initial thought as described above was to get a List of files with a simple find command with a maxdepth of 3 that also outputs if its a file ";f" or a directory ";d" and using grep afterwards to filter out the more redundant stuff like cache contents, user uploads and so on.
From this I get a list like this:
./index.php;f
./get.php;f
./cron.php;f
./api.php;f
./shell;d
./shell/log.php;f
./shell/compiler.php;f
./shell/abstract.php;f
./shell/.htaccess;f
./shell/indexer.php;f
./mage;f
./skin;d
./skin/frontend;d
./skin/frontend/rwd;d
./skin/frontend/base;d
./skin/frontend/default;d
./skin/adminhtml;d
./skin/adminhtml/default;d
./install;d
./install/default;d
./install.php;f
./app;d
./app/code;d
./app/code/core;d
./app/code/community;d
./app/Mage.php;f
./app/.htaccess;f
...
parsed as json I imagined it as something like this:
[
{
"name": "index.php",
"type": "file"
},
{
"name": "get.php",
"type": "file"
},
{
"name": "cron.php",
"type": "file"
},
{
"name": "api.php",
"type": "file"
},
{
"name": "shell",
"type": "directory",
"children": [
{
"name": "log.php",
"type": "file"
},
{
"name": "compiler.php",
"type": "file"
},
{
"name": "abstact.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
{
"name": "indexer.php",
"type": "file"
}
]
},
{
"name": "mage",
"type": "file"
},
{
"name": "skin",
"type": "directory",
"children": [
{
"name": "frontend",
"type": "directory",
"children": [
{
"name": "rwd",
"type": "directory"
},
{
"name": "base",
"type": "directory"
},
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "adminhtml",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
},
{
"name": "install",
"type": "directory",
"children": [
{
"name": "default",
"type": "directory"
}
]
}
]
},
{
"name": "install.php",
"type": "file"
},
{
"name": "app",
"type": "directory",
"children": [
{
"name": "code",
"type": "directory",
"children": [
{
"name": "core",
"type": "directory"
},
{
"name": "community",
"type": "directory"
}
]
},
{
"name": "Mage.php",
"type": "file"
},
{
"name": ".htaccess",
"type": "file"
},
...
if I minify this I have a json in the form of something like this:
[{"name":"index.php","type":"file"},{"name":"get.php","type":"file"},{"name":"cron.php","type":"file"},{"name":"api.php","type":"file"},{"name":"shell","type":"directory","children":[{"name":"log.php","type":"file"},{"name":"compiler.php","type":"file"},{"name":"abstact.php","type":"file"},{"name":".htaccess","type":"file"},{"name":"indexer.php","type":"file"}]},{"name":"mage","type":"file"},{"name":"skin","type":"directory","children":[{"name":"frontend","type":"directory","children":[{"name":"rwd","type":"directory"},{"name":"base","type":"directory"},{"name":"default","type":"directory"}]},{"name":"adminhtml","type":"directory","children":[{"name":"default","type":"directory"}]},{"name":"install","type":"directory","children":[{"name":"default","type":"directory"}]}]},{"name":"install.php","type":"file"},{"name":"app","type":"directory","children":[{"name":"code","type":"directory","children":[{"name":"core","type":"directory"},{"name":"community","type":"directory"}]},{"name":"Mage.php","type":"file"},{"name":".htaccess","type":"file"}...
If I pour 150k of these into tensorflow, with a second list withe corelating labels for the 22 systems these filesystems represent, will it be able to tell me what label an until then not known json list of files and directories has? Will it be able to identify a different magento installation as Magento1 for example?

Extraction of data from pdf

I am developing a project as part of an internship for a research institute.
For that I have to feed my SQL database, with data from pdf files, like the one below:
https://www.mairie-orly.fr/content/download/11449/88884/file/Menu+du+mois+de+janvier+2021.pdf
I currently convert my pdf files to JSON via the pdftotext API, this is the best JSON output I have found so far. Here is the result:
{
"document": {
"page": {
"#index": "0",
"row": [
{
"column": [
{
"text": ""
},
{
"text": ""
},
{
"text": {
"#fontName": "Arial",
"#fontSize": "10.0",
"#fontStyle": "Bold",
"#x": "271.10",
"#y": "101.78",
"#width": "29.40",
"#height": "9.72",
"#text": "LUNDI"
}
},
{
"text": {
"#fontName": "Arial",
"#fontSize": "10.0",
"#fontStyle": "Bold",
"#x": "476.09",
"#y": "101.78",
"#width": "31.26",
"#height": "9.72",
"#text": "MARDI"
}
etc....etc...etc...
I need to feed my database with a menu and a corresponding date.
Maybe my conversion is not adapted to the problem?
How can I retrieve a menu and the corresponding date? Which path should I follow?
Thanks
If you want to extract the data from the pdf you can try to make a request using urllib and retrive the text using tika apache. With urllib you can choose to save the file local and get the text from the file like this
import urllib.request
from tika import parser
url = 'https://www.mairie-orly.fr/content/download/11449/88884/file/Menu+du+mois+de+janvier+2021.pdf'
urllib.request.urlretrieve(url, './a.pdf')
file_data = parser.from_file('./a.pdf')
text = file_data['content']
#print for debug
print(text)
For tika you need to install java (Latest version) and pip install tika
hope it helps! enois

How do I resolve a single reference with Python jsonschema RefResolver

I am writing Python code to validate a .csv file using a JSON schema and the jsonschema Python module. I have a clinical manifest schema that looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/veoibd_schema.json",
"title": "clinical data manifest schema",
"description": "Validates clinical data manifests",
"type": "object",
"properties": {
"individualID": {
"type": "string",
"pattern": "^CENTER-"
},
"medicationAtDx": {
"$ref": "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
}
},
"required": [
"individualID",
"medicationAtDx"
]
}
The schema referenced by the $ref looks like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://example.com/clinicalData.json",
"definitions":{
"ageDxYears": {
"description": "Age in years at diagnosis",
"type": "number",
"minimum": 0,
"maximum": 90
},
"ageOnset": {
"description": "Age in years of first symptoms",
"type": "number",
"exclusiveMinimum": 0
},
"medicationAtDx": {
"description": "Medication prescribed at diagnosis",
"type": "string"
}
}
}
(Note that both schemas are quite a bit larger and have been edited for brevity.)
I need to be able to figure out the "type" of "medicationAtDx" and am trying to figure out how to use jsonschema.RefResolver to de-reference it, but am a little lost in the terminology used in the documentation and can't find a good example that explains what the parameters are and what it returns "in small words", i.e. something that a beginning JSON schema user would easily understand.
I created a RefResolver from the clinical manifest schema:
import jsonschema
testref = jsonschema.RefResolver.from_schema(clin_manifest_schema)
I fed it the url in the "$ref":
meddx_url = "https://raw.githubusercontent.com/not-my-username/validation_schemas/reference_definitions/clinicalData.json#/definitions/medicationAtDx"
testref.resolve_remote(meddx_url)["definitions"].keys()
What I was expecting to get back was:
dict_keys(['medicationAtDx'])
What I actually got back was:
dict_keys(['ageDxYears', 'ageOnset', 'medicationAtDx'])
Is this the expected behavior? If not, how can I narrow it down to just the definition for "medicationAtDx"? I can traverse the whole dictionary to get what I want if I have to, but I'd rather have it return just the reference I need.
Thanks in advance!
ETA: per Relequestual's comment below, I took a couple of passes with resolve_fragment as follows:
ref_doc = meddx_url.split("#")[0]
ref_frag = meddx_url.split("#")[1]
testref.resolve_fragment(ref_doc, ref_frag)
This gives me "TypeError: string indices must be integers" and "RefResolutionError: Unresolvable JSON pointer". I tried tweaking the parameters in different ways (adding the "#" back into the fragment, removing the leading slash, etc.) and got the same results. Relequestual's explanation of a fragment was very helpful, but apparently I'm still not understanding the exact parameters that resolve_fragment is expecting.

How to insert variable value into JSON string for use with PDAL

I'm trying to use the Python extension for PDAL to read in a laz file.
To do so, I'm using the simple pipeline structure as exampled here: https://gis.stackexchange.com/questions/303334/accessing-raw-data-from-laz-file-in-python-with-open-source-software. It would be useful for me, however, to insert the value contained in a variable for the "filename:" field. To do so, I've tried the following, where fullFileName is a str variable containing the name (full path) of the file, but I am getting an error that no such file exists. I am assuming my JSON syntax is slightly off or something; can anyone help?
pipeline="""{
"pipeline": [
{
"type": "readers.las",
"filename": "{fullFileName}"
}
]
}"""
You can follow this code:
import json
import pdal
file = "D:/Lidar data/input.laz"
pipeline={
"pipeline": [
{
"type": "readers.las",
"filename": file
},
{
"type": "filters.sort",
"dimension": "Z"
}
]
}
r = pdal.Pipeline(json.dumps(pipeline))
r.validate()
points = r.execute()
print(points)

Python: Access a dictionary that is located inside of a text file

I am working on a quick and dirty script to get Chromium's bookmarks and turn them into a pipe menu for Openbox. Chromium stores it's bookmarks in a file called Bookmarks that stores information in a dictionary form like this:
{
"checksum": "99999999999999999999",
"roots": {
"bookmark_bar": {
"children": [ {
"date_added": "9999999999999999999",
"id": "9",
"name": "Facebook",
"type": "url",
"url": "http://www.facebook.com/"
}, {
"date_added": "999999999999",
"id": "9",
"name": "Twitter",
"type": "url",
"url": "http://twitter.com/"
How would I open this dictionary in this file in Python and assign it to a variable. I know you open a file with open(), but I don't really know where to go from there. In the end, I want to be able to access the info in the dictionary from a variable like this bookmarks[bookmarks_bar][children][0][name] and have it return 'Facebook'
Do you know if this is a json file? If so, python provides a json library.
Json can be used as a data serialization/interchange format. It's nice because it's cross platform. Importing this like you ask seems fairly easy, an example from the docs:
>>> import json
>>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
So in your case it would look something like:
import json
with open(file.txt) as f:
text = f.read()
bookmarks = json.loads(text)
print bookmarks[bookmarks_bar][children][0][name]
JSON is definitely the "right" way to do this, but for a quick-and-dirty script eval() might suffice:
with open('file.txt') as f:
bookmarks = eval(f.read())

Categories