Data analysis of log files – How to find a pattern?

Data analysis of log files – How to find a pattern? - python

My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?

was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.

There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.

Related

What is the best way to track / record the current programming project u work on

I have been in this problem for long time and i want to know how its done in real / big companies project.
Suppose i have the project to build a website. now i divide the project into sub tasks and do it.
But u know that suppose i have task1 in hand like export the page to pdf. Now i spend 3 days to do that , came accross various problems , many stack overflow questions and in the end i solve it.
Now 4 months after someone told me that there is some error in the code.
Now by that i comepletely forgot about(60%) how i did it and why i do this way. I document the code but i can't write the whole story of that in the code.
Then i have to spend much time on code to find what was the problem so that i added this line etc.
I want to know that is there any way that i can log steps in completeing the project.
So that i can see how i end up with code , what erros i got , what questions i asked on So and etc.
How people do it in real time. Which software to use.
I know in our project management softaware called JIRA we have tasks but that does not cover what steps i took to solve that tasks.
what is the besy way so that when i look backt at my 2 year old project , i know how i solve particular task

If you are already using JIRA consider integrating it with your SCM.
When committing your changes to SCM refer to your JIRA issue number in comments. Like the following:
PORTAL-778 fixed the alignment issue with PDF exports
JIRA periodically connects to your SCM and parses the comments. You can easily find out changes made for a particular issue.
Please see the following link for more information
Integrating JIRA with Subversion

Every time you revisit code, make a list of the information you are not finding. Then the next time you create code, make sure that information is present. It can be in comments, Wiki, bugs or even text notes in a separate file. Make the notes useful for other people, so private notebooks aren't a good idea except for personal notes.

Persistant database state strategies

Due to several edits, this question might have become a bit incoherent. I apologize.
I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.
Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.
So, I am currently thinking about two things:
When to run a backup?
What kind of backup?
When to run:
I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.
Directly related to that is the question what kind of backup I should do.
I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).
I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.
I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.
In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.
So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?
Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.

Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.
Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.
I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:
State changes made to a system after its last image was saved are lost
in the case of a system failure or shutdown. Saving an image for every
single change would be too time-consuming for most systems
Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.
However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.
So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.

If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

Selecting the most fluent text from a set of possibilities via grammar checking (Python)

Some background
I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.
I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)
Perhaps I need something like AtD, but hopefully less complex
I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.
On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)
Things I'd like to avoid
Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)
Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.
In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.
A simple solution? Scoring fluency by detecting obvious errors
If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.
Let's say we have a tiny bit of text already laid out:
existing_text = "The old river"
Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:
Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb; ['bear', 'bears', 'bearing', 'bore', 'borne'].
Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from existing_text + " " + possibility ("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear".
Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.

Very cool project, first of all.
I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.
I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.

Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.
Michael Heilman uses this approach for question generation. For more details, read these papers:
Good Question! Statistical Ranking for Question Generation and
Rating Computer-Generated Questions with Mechanical Turk.

The pylinkgrammar link provided above is a bit out of date. It points to version 0.1.9, and the code samples for that version no longer work. If you go down this path, be sure to use the latest version which can be found at:
https://pypi.python.org/pypi/pylinkgrammar

Data recognition, parsing, filtering, and transformation -- GUI?

Looking for a non-cloud based open source app for doing data transformation; though for a killer (and I mean killer) app just built for data transformations, I might be willing to spend up to $1000.
I've looked at Perl, Kapow Katalyst, Pentaho Kettle, and more.
Perl, Python, Ruby which are clearly languages, but unable to find any frameworks/DSLs just for processing data; meaning they're really not a great development environments, meaning there's no built GUI's for building RegEx, Input/Output (CSV, XML, JDBC, REST, etc.), no debugger for testing rows and rows of data -- they're not bad either, just not what I'm looking for, which is a GUI built for complex data transformations; that said, I'd love if the GUI/app file was in a scripting language, and NOT just stored in some not human readable XML/ASCII file.
Kapow Katalyst is made for accessing data via HTTP (HTML, CSS, RSS, JavaScript, etc.) it's got a nice GUI for transforming unstructured text, but that's not its core value offering, and is way, way too expensive. It does an okay job of traversing document namespace paths; guessing it's just XPath on the back-end, since the syntax appears to be the same.
Pentaho Kettle has a nice GUI for INPUT/OUTPUT of most common data stores, and its own take on handling data processing; which is okay, and just has a small learning curve. Kettle's debugger is ok, in that the data is easy to see, but the errors and exceptions are not threaded with the output, and there no way to really debug an issue; meaning you can't reload the output/error/exception, but are able to view the system feedback. All that said, Kettle data transformation is _______ well, let's just say it left me feeling like I must be missing something, because I was completely puzzled by "if it's not possible, just write the transformation in JavaScript"; umm, what?
So, any suggestions? Do realize that I haven't really spec'd out any transformations, but figure if you really use a product for data munging, I'd like to know about it; even excel, I guess.
In general though, currently I'm looking for a product that's able to handle 1000-100,000 rows with 10-100 columns. It'd be super cool if it could profile data sets, which is a feature Kettle sort of does, but not super well. I'd also like built in unit testing, meaning I'm able to build out control sets of data, and run changes made against the control set. Then I'd like to be able to selectively filter out rows and columns as I build out the transformation without altering the build; for example, I run a data set through transformation, filter the results, and the next run those sets are automatically blocked at the first "logical" occurrence; which in turn would mean less data to "look at" and a reduced runtime per each enhanced iteration; what would be crazy nice is if as I'd filtering out the rows/columns the app is tracking those, (and the output was filtered out). and unit tested/highlighted any changes. If I made a change that would effect the application logs and it's ability to track the unit tests based on me "breaking a branch" - it'd give me a warning, let me dump the data stored branch... and/or track the primary keys for difference in next generation of output, or even attempt to match them using fuzzy logic. And yes, I know this is a pipe dream, but hey, figured I'd ask, just in case there's something out there I've just never seen.
Feel free to comment, I'd be happy to answer any questions, or offer additional info.

Google Refine?

Talend will need more than 5 minutes of your time, perhaps closer to about 1 hour to begin to wire up a basic transformations and being able to fulfill your requirement to keep versioned control transformations as well. You described a Pipeline process that can be done easily in Talend when you know how, where you have multiple inputs and outputs in a project as the same raw data goes through various transformations and filtering, until it arrives as final output as you desired. Then you can schedule your jobs to repeat the process over similar data. Go back and spend more time with Talend, and you'll succeed in what you need, I'm sure.
I also happen to be one of the committers of Google Refine and also use Talend in my daily work. I actually sometimes model my transformations for Talend first in Google Refine. (Sometimes even using Refine to perform cleanup on borked ETL transforms themselves! LOL ) I can tell you that my experience with Talend played a small part in a few of the features of Google Refine. For instance, both Talend and Google Refine have the concept of an expression editor for your transformations (Talend goes down to Java language for this if need be).
Google Refine will never be an ETL tool, in the sense that we have not designed it to compete in that space were ETL is typically used for large data warehouse backend processing & transformations. However, we designed Google Refine to compliment existing ETL tools like Talend by allowing easy live previewing to make informed decisions about your transformations and cleanup, and if your data isn't incredibly huge, then you might opt to perform what you need within Refine itself.

I'm not sure exactly what kind of data or exactly what kind of transformations you're trying to do, but if it's primarily mathematical transformation, perhaps you can try FreeMat, Octave, or SciLab. If it's more data-warehouse-style munging, try open source ETL tools like Clover, Talend, JasperETL Community Edition, or Jitterbit.

How close is Python to being able to wrap it in a workbook type skin?

With my luck this question will be closed too quickly. I see a tremendous possibility for a python application that basically is like a workbook. Imagine if you will that instead of writing code you select from a menu of choices. For example, the File menu would have an open command that lets the user navigate to a file or directory of file or a webpage, even a list of web pages and specify those as the things that will be the base for the next actions.
Then you have a find menu. The menu would allow easy access to the various parsing tools, regular expression and string tools so you can specify the thing you want to find within the files.
Another menu item could allow you to create queries to interact with database objects.
I could go on and on. As the language becomes more higher level then these types of features become easier to implement. There is a tremendous advantage to developing something like this. How much time is spent reinventing the wheel for mundane tasks? Programmers have functions that they have built to do many mundane tasks but what about democratizing the power offered by a tool like Python.
I have people in my office all of the time asking how to solve problems that seem intractable to them, but when I show them how with a few lines of code their problem is solvable except for the edge cases they become amazed. I deflect their gratitude with the observation that it is not really that hard except for being able to construct the right google search to identify the right package or library to solve the problem. There is nothing amazing about my ability to use lxml and sets to pull all bolded sections from a collection of say 12,000 documents and compare across time and across unique identifiers in the collection how those bolded sections have evolved/changed or converged. The amazing piece is that someone wrote the libraries to do these things.
What is the advantage to the community for something like this. Imagine if you would an interface that looks like a workbook but interacts with an app-store. So if you want to pull something from html file you go to the app store and buy a plug-in that handles the work. If the workbook is built robustly enough it could be licensed to a machine, the 'apps' would be tied to a particular workbook.
Just imagine the creativity that could be unleashed by users if they could get over the feeling that access to this power is difficult. You guys may not see this but I see Python being so close to being able to port to something like a workbook framework. Weren't the early spreadsheet programs nothing more than a frame around some Fortran libraries that had been ported to C?
Comments or is there such an application and I have not found it.

There are Python application that are based on generating code -- the most amazing one probably Resolver One, which focuses on spreadsheets (and hinges on IronPython). With that exception, however, interacting based on the UI paradigm you have in mind (pick one of this, one of that, etc) tends to be pretty limited in the gamut of choices it offers to let the user generate the exact application they need -- there's just so much more you can say by writing even a little script, than what you can say by point-and-grunt.
That being said, Python would surely be a great choice both to implement such an app and as the language to generate... if and when you have a UI sketch that looks like it can actually allow non-programmers to specify a large-enough spectrum of apps in a broad-enough domain!-). Spreadsheets have proven themselves in this sense, but I don't know of other niches or approaches that have actually done so -- do you?

Your idea kinda reminded me of something I stumbled across months ago: http://www.ailab.si/orange/

Is your concept very similar to Microsoft Access? Generally programmers tend not to write such programs because they produce such horrible code that the authors themselves would never want to use their program.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.