Got an idea of making pet-project about secured filewriting and files disguise.The question is, how can I work with hidden usb-device partitions on Python 3.x?
I mean:
CRUD of hidden partitions on chosen device (U = increase/decrease size).
Determining the requested partition from all hidden partitions. (part, that was created by my app before).
Usage of partition for file I/O.
I know about PyUSB lib, but I have found hardly any of partition references. If you know anything about this topic, I'll be glad to hear it.
Related
Context
As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)
To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!
My considered solutions:
Put the domain names in a MySQL table
Advantages: no public access, live synchronization
Disadvantages: my scam detection script should be able to work offline
Hash the domain names before putting them on git
Advantages: no public access, easy to do, supports equality comparison
Disadvantages: does not support similarity comparison, which is an important part of the program
Hash domain names with locality-sensitive hashing
Advantages: no easy public access, supports equality and similarity comparison
Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems
My opinion
It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better.
For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.
EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)
At the end I think I'll use yet another solution. I'm thinking of using the MySQL database to store domain names, but only use it in my script to synchronize to it, keeping a local CSV version.
In short, the workflow I'm imagining:
I edit my SQL table when I want to add/remove items to it
When the bot is launched, the script connects to the DB and retrieves all the information from the table
Once the information is retrieved, it saves it in a CSV file and finishes running the rest of the script
If at launch no internet connection is available, the synchronization to the DB is not done and only the CSV file is used.
This way I have the advantages of no public access, an automatic synchronization, an access even offline after the first start, and I keep the support of comparison by similarity since no hash is done.
If you think you can improve my idea, I'm interested!
I have an interesting, non-novel but seemingly generally unsolved, at least in sources I could find online, problem. The issue is basically this:
I want to persist the states of my RNN in between calls/invokations to it via the TF Serving API.
I have found quite a few people online talking about this problem, such as in the following links:
https://github.com/tensorflow/serving/issues/724
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/00tipdqxRZk
Tensorflow Serving - Stateful LSTM
Unfortunetly, there seems to be some discussion of the problem and discussion of a built in fix being added to TF but not being there yet, but no actual guides/examples for how to work around this problem.
The closest I have been able to find a guide for how to do this was the following in the groups.google link:
"As background for future readers, there are two broad cases to consider. The first is where you want to run an entire sequence computation, with multiple steps, each with state to preserve for the next step. This case is addressed with dynamic unrolling."
I am not really sure how to go about implementing this at all though. I dont really want to share my code here just because it is really quite long for a SO post and don't expect anyone to read through it all and tell me what to do.
Just any general tips would be awesome.
I have two files right now that I am using to deploy and use my RNN in the serving api. They are called rnn_save_model.py and rnn_client.py. rnn_save_model just creates the actual model in code and then uses SavedModelBuilder to save it to file. rnn_client then passes in some parameters to the model when I have it loaded via the following command:
tensorflow_model_server --port=9000 --model_name=rnn --model_base_path=/tmp/rnn_model/
I am just not even sure where to add code to get the model to load a state stored to file or something as the model itself is "created" once in rnn_save_model and I do not see how to clearly pass in states via the rnn_client file that are then added to the graph as the file in its current state is just using a model loaded in memory and interfacing with it VS actually editing the model like would be needed to load previous state from file or something.
I really appreciate any help!
My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?
was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.
There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.
I need to perform searches in quite large files. The search operations need random access (think of binary search), and I will mmap the files for ease of use and performance. The search algorithm takes the page size into account so that whenever I need to access some memory area, I will try to make the most of it. Due to this there are several parameters to tune. I would like to find the parameters which give me the least number of reads from the block device.
I can do this with pen and paper, but the theoretical work carries only so far. The practical environment with a lot happening and different page caches is more complex. There are several processes accessing the files, and certain pages may usually be available in the file system page cache due to other activity on the files. (I assume the OS is aware of these when using mmap.)
In order to see the actual performance of my search algorithms in terms of number of blocks read from the block device, I would need to know the number of page misses occurring during my mmap accesses.
A dream-com-true solution would be one that would tell me which pages of the memory area are in the cache already. A very good solution would be a function which tells me whether a given page is in the real memory. This would both enable me to tune the parameters and possibly even be part of my algorithm ("if this page is in real memory, we'll extract some information out of it, if it isn't then we'll read another page").
The system will run on Linux (3-series kernel), so if there is no OS-agnostic answer, Linux-specific answers are acceptable. The benchmark will be written in python, but if the interfaces exist only in C, then I'll live with that.
Example
Let us have a file with fixed record length records carrying a sorted identifier and some data. We want to extract the data between some starting and ending position (as defined by the identifiers. The trivial solution is to use binary search to find the start position and then return everything until the end is reached.
However, the situation changes somewhat if we need to take cacheing into account. Then direct memory accesses are essentially free but page misses are expensive. A simple solution is to use binary search to find any position within the range. Then the file can be traversed backwards till the start position is reached. Then the file is traversed to the forward direction until the end is reached. This sounds quite stupid, but it ensures that once a single point within the range is found, no extra pages need to be loaded.
So, the essential thing is to find a single position within the range. Binary search is a good candidate, but if we know that, for example, the three last or three first pages of the file are usually in the page cache anyway, we should use that information, as well. If we knew which of the pages are in the cache, the search algorithm could be made much better, but even with a posteriori knowledge whether we hit or missed helps.
(The actual problem is a bit more complicated than that, but maybe this illustrates the need.)
Partial solution:
As JimB tells in his answer, there is no such API in Linux. That leaves us with more generic profiling tools (such as python's cProfile or perf stat in Linux).
The challenge with my code is that I know most of the time will be spent with the memory accesses which end up being cache misses. This is very easy, as they are the only points where the code may block. In the code I have something like b = a[i], and this will either be very fast or very slow depending on i.
Of course, seeing the total number of cache misses during the running time of the process may help with some optimizations, but I would really know if the rest of the system creates a situation where, e.g. first or last pages of the file are most of the time in the cache anyway.
So, I will implement timing of the critical memory accesses (ones that may miss the cache). As almost everything running in the system is I/O-limited (not CPU limited), it is unlikely that a context switch would too often spoil my timing. This is not an ideal solution, but it seems to be the least bad one.
This is really something that needs to be handled outside of your program. The virtual memory layer is handled by the kernel, and the details are not exposed to the process itself. You can profile your program within its process, and estimate what's happening based on the timing of your function calls, but to really see what's going on you needs to use OS specific profiling tools.
Linux has a great tool for this: perf. The perf stat command may be all you need to get an overview of how your program is executing.
Originally, I had to deal with just 1.5[TB] of data. Since I just needed fast write/reads (without any SQL), I designed my own flat binary file format (implemented using python) and easily (and happily) saved my data and manipulated it on one machine. Of course, for backup purposes, I added 2 machines to be used as exact mirrors (using rsync).
Presently, my needs are growing, and there's a need to build a solution that would successfully scale up to 20[TB] (and even more) of data. I would be happy to continue using my flat file format for storage. It is fast, reliable and gives me everything I need.
The thing I am concerned about is replication, data consistency etc (as obviously, data will have to be distributed -- not all data can be stored on one machine) across the network.
Are there any ready-made solutions (Linux / python based) that would allow me to keep using my file format for storage, yet would handle the other components that NoSql solutions normally provide? (data consistency / availability / easy replication)?
basically, all I want to make sure is that my binary files are consistent throughout my network. I am using a network of 60 core-duo machines (each with 1GB RAM and 1.5TB disk)
Approach: Distributed Map reduce in Python with The Disco Project
Seems like a good way of approaching your problem. I have used the disco project with similar problems.
You can distribute your files among n numbers of machines (processes), and implement the map and reduce functions that fit your logic.
The tutorial of the disco project, exactly describes how to implement a solution for your problems. You'll be impressed about how little code you need to write and definitely you can keep the format of your binary file.
Another similar option is to use Amazon's Elastic MapReduce
Perhaps some of the commentary on the Kivaloo system developed for Tarsnap will help you decide what's most appropriate: http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
Without knowing more about your application (size/type of records, frequency of reading/writing) or custom format it's hard to say more.