I need to perform searches in quite large files. The search operations need random access (think of binary search), and I will mmap the files for ease of use and performance. The search algorithm takes the page size into account so that whenever I need to access some memory area, I will try to make the most of it. Due to this there are several parameters to tune. I would like to find the parameters which give me the least number of reads from the block device.
I can do this with pen and paper, but the theoretical work carries only so far. The practical environment with a lot happening and different page caches is more complex. There are several processes accessing the files, and certain pages may usually be available in the file system page cache due to other activity on the files. (I assume the OS is aware of these when using mmap.)
In order to see the actual performance of my search algorithms in terms of number of blocks read from the block device, I would need to know the number of page misses occurring during my mmap accesses.
A dream-com-true solution would be one that would tell me which pages of the memory area are in the cache already. A very good solution would be a function which tells me whether a given page is in the real memory. This would both enable me to tune the parameters and possibly even be part of my algorithm ("if this page is in real memory, we'll extract some information out of it, if it isn't then we'll read another page").
The system will run on Linux (3-series kernel), so if there is no OS-agnostic answer, Linux-specific answers are acceptable. The benchmark will be written in python, but if the interfaces exist only in C, then I'll live with that.
Example
Let us have a file with fixed record length records carrying a sorted identifier and some data. We want to extract the data between some starting and ending position (as defined by the identifiers. The trivial solution is to use binary search to find the start position and then return everything until the end is reached.
However, the situation changes somewhat if we need to take cacheing into account. Then direct memory accesses are essentially free but page misses are expensive. A simple solution is to use binary search to find any position within the range. Then the file can be traversed backwards till the start position is reached. Then the file is traversed to the forward direction until the end is reached. This sounds quite stupid, but it ensures that once a single point within the range is found, no extra pages need to be loaded.
So, the essential thing is to find a single position within the range. Binary search is a good candidate, but if we know that, for example, the three last or three first pages of the file are usually in the page cache anyway, we should use that information, as well. If we knew which of the pages are in the cache, the search algorithm could be made much better, but even with a posteriori knowledge whether we hit or missed helps.
(The actual problem is a bit more complicated than that, but maybe this illustrates the need.)
Partial solution:
As JimB tells in his answer, there is no such API in Linux. That leaves us with more generic profiling tools (such as python's cProfile or perf stat in Linux).
The challenge with my code is that I know most of the time will be spent with the memory accesses which end up being cache misses. This is very easy, as they are the only points where the code may block. In the code I have something like b = a[i], and this will either be very fast or very slow depending on i.
Of course, seeing the total number of cache misses during the running time of the process may help with some optimizations, but I would really know if the rest of the system creates a situation where, e.g. first or last pages of the file are most of the time in the cache anyway.
So, I will implement timing of the critical memory accesses (ones that may miss the cache). As almost everything running in the system is I/O-limited (not CPU limited), it is unlikely that a context switch would too often spoil my timing. This is not an ideal solution, but it seems to be the least bad one.
This is really something that needs to be handled outside of your program. The virtual memory layer is handled by the kernel, and the details are not exposed to the process itself. You can profile your program within its process, and estimate what's happening based on the timing of your function calls, but to really see what's going on you needs to use OS specific profiling tools.
Linux has a great tool for this: perf. The perf stat command may be all you need to get an overview of how your program is executing.
Related
This may be impossible, but I am just wondering if there are any tools to help detect non-deterministic behaviour when I run a Python script. Some fancy options in a debugger perhaps? I guess I am imagining that theoretically it might be possible to compare the stack instruction-by-instruction or something between two subsequent runs of the same code, and thus pick out where any divergence begins.
I realise that a lot is going on under the hood though, so that this might be far too difficult to ask of a debugger or any tool...
Essentially my problem is that I have a test failing occasionally, almost certainly because somewhere the code relies accidentally on the ordering of output from iterating over a dictionary, or some such thing where the ordering isn't actually guaranteed. I'd just like a tool to help me locate the source of this sort of problem :).
The following question is similar, but there was not much suggestion of how to really deal with this in an automated or general way: Testing for non-deterministic behavior of python function
I'm not aware of a way to do this automatically, but what I would recommend doing is starting a debugger when a test fails, then running it automatically (overnight?) until you get a failure. You can then examine the variables and see if anything stands out. If you're using pytest, running with the --pdb flag will start a debugger on failure.
You might also consider using Hypothesis to run generative test cases.
You might also consider running the tests over and over, collecting the output of each run (success or failure). When you have a representative sample, compare the two, particularly the ordering of tests that were run!
I'm fairly certain this is fundamentally impossible with our current understanding of computation and automata theory. I could be wrong though.
Anyone more directly knowledgeable (rigorous background) feel free to pipe in, most of what follows is self-taught and heavily based on professional observations over the last decade doing Systems Engineering/SRE/automation.
Modern day computers are an implementation of Automata Theory and Computation Theory. At the signal level, they require certain properties to do work.
Signal Step, Determinism and Time Invariance are a few of those required properties.
Deterministic behavior, and deterministic properties rely on there being a unique solution, or a 1:1 mapping of directed nodes (instructions & data context) on the state graph from the current state to the next state. There is only one unique path. Most of this is implemented and hidden away at low level abstractions (i.e. the Signal, Firmware, and kernel/shell level).
Non-deterministic behavior is the absence of the determinism signal property. This can be introduced into any interconnected system due to a large range of issues (i.e. hardware failures, strong EM fields, Cosmic Rays, and even poor programming between interfaces).
Anytime determinism breaks, computers are unable to do useful work, the scope may be limited depending on where it happens. Usually it will either be caught as an error and the program or shell will halt, or it may continue running indefinitely or provide bogus data, both because of the class of problem it turns into, and the fundamental limitations on the types of problems turing machines can solve (i.e. computers).
Please bare in mind, I am not a Computer Science major, nor do I hold a degree in Computer Engineering or related IT field. I'm self taught, no degree.
Most of this explanation has been driven by doing years of automation, segmenting problem domains, design and seeking a more generalized solution to many of the issues I ran into, mostly to come to a better usage of my time (hence this non-rigorous explanation).
The class of non-deterministic behavior is the most costly type of errors I've run into because this behavior is the absence of the expected. There isn't a test for non-determinism as a set or group. You can infer it by the absence of properties which you can test (at least interactively)
Normal computer behavior is emergent from the previous mentioned required signals and systems properties, and we see problems when they don't always hold true and we can't quickly validate the system for non-determinism due to its nature.
Interestingly, testing for the presence of those properties interactively, is a useful shortcut, as if the properties are not present it will fall into this class of troubles which we as human beings can solve, but computers cannot, but it can only effectively be done by humans as you can run into issues with the halting problem, and other more theoretical aspects which I didn't bother understanding during my independent studies.
Unfortunately, know how to test for these properties does often require knowledgeable view of the systems and architecture being tested spanning most abstraction layers (depending on where the problem originates).
More formal or rigorous material may use NFAs v. DFAs with more complex vocabularies, non-finite versus discrete-finite automata iirc.
The differences being basically the presence of that 1:1 state map/path or its absence that define determinism.
Where most people trip up with this property, with regards to programming is between interfaces where the interface fails to preserve data and this property by extension, such as accidentally using the empty or NULL state of an output field to mean more than one thing that gets passed to another program.
A theoretical view of a shell program running a series of piped commands might look like this:
DFA->OutInterface ->DFA->OutInterface->NFA->crash/meaningless/unexpected data/infinite loop, etc depending on the code that comes after the NFA, the behavior varies unpredictably in indeterminable ways. (OutInterface being pipe at the shell '|' )
For an actual example in the wild, ldd on recent versions of linux had two such errors that injected non-determinism into the pipe. Trying to identify linked dependencies for a arbitrary binary, for use with a build system was not possible using ldd because of this issue.
More specifically, the in-memory structures, and then also the flattening of the output fields in a non-deterministic way that varies across different binaries.
Most of the material mentioned above is normally covered in a BS Compiler design course at the undergraduate level, one can also find it in the dragon compiler book which is what I did instead, it does require a decent background in math fundamentals (i.e. Abstract Algebra/Linear Algebra) to grok the basis and examples, and the properties are best described in Oppenheim's Signals and Systems.
Without knowing how to test that certain system properties hold true, you can easily waste months of labor trying to document and/or trying to narrow the issue down. All you really have in those non-deterministic cases is a guess and check model/strategy which becomes very expensive especially if you don't realize its an underlying systems property issue.
My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?
was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.
There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.
Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.
Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.
If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.
I need to compare values that i have by executing wmi commands(using python) and values from inside a db.. is it best to compare them without storing in separate files or is storing and then comparing is the only possible way?
can someone pls direct me in the right way.. also, where should i look for, for getting more knowledge regarding this?
If in doubt, go for the simplest solution. In this case, compare them in memory.
If you want to be ultra-reliable (i.e. survive after crashes of your application / power outage) or cache values for long times (i.e. it's a requirement to continue working even when the database is down), you may consider files. Be warned though - anything but an extremely careful implementation (you should have lots of try..except..finallys and at least one call to flock and fsync, storing in files tends to be less reliable. So unless you're interested in consistency research and willing to put up a few weeks, go for a simple Python comparison.
i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.
can i continue to use python on a 2gig file or should i move the data into a database?
I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.
Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.
ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.
You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...
You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.
As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.
If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.
Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.
Before an entire database, you may want to think of SQLite.
Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.
If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.
Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).
I'd only put it into a relational database if:
The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.
If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.
If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.
2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.
If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".
I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.
I always reach for a database for larger datasets.
A database gives me some stuff for "free"; that is, I don't have to code it.
searching
sorting
indexing
language-independent connections
Something like SQLite might be the answer for you.
Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.
At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.
This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.
Hope that helps with your decision-making!