Monitoring mongoDB in the cluster

Monitoring mongoDB in the cluster - python

I'm trying to monitor and analyze the results of sharded MongoDB instance in the cluster. There's a nice monitoring tool provided by mongo- MMS. However, I need to analyze and plot CPU/Disk IO, shard load graphs on my own. The question: is it possible to get data from MMS (i.e. timestamps,opcoutns, CPU utilization) in CVS or something that would be possible to load in R/Python?

You can build your own tool, although I highly doubt that it will be better then MMS. As Asya suggested, you can use db.serverStatus() to read some of the data. You can check here for more commands and tools for collecting data.
You can do a dirty test with some other parameters from mongostats command. Also the fields it output are slightly different from what you put in the brackets, but you try to build it easy. All you need is just redirect the output from this command to a text file.
In window you will do this with mongostat > stats.txt and if I remember this correctly in linux mongostat stats.txt. Then just parse the file with R/python and plot whatever you want.

Related

How can I transfer Data twice a day from an FTP Server to Python and then to a Website

I have climate data that is sent to a ftp server and I want to make it visable on our website. New data comes in twice a day. Beforehand the data needs to be prepared and I also want to visualize it preferably with python. I want to learn the best way to do this but have no idea where to start. I am not looking for a definite solution since I know this is a big project. I just need some suggestions on how to start or which tools might help me.

Like you said this is a project itself but I can give you some directions.
I would use an ftp library for handling pulling data to python for processing.
Check https://docs.python.org/3/library/ftplib.html
For visualization that would really depend on data you have. But generally, after processing the data, save it to a file(or db) and when backend recieves a request for that data, backend reads this file(or db) and you visualize it with javascript. Something like d3js would work.
Also you might use a tool for visualization but I don't know any(Power BI???).

Using TotalOrderPartitioner in Hadoop streaming

I'm using python with Hadoop streaming to do a project, and I need the similar functionality provided by the TotalOrderPartitioner and InputSampler in Hadoop, that is, I need to sample the data first and create a partition file, then use the partition file to decide which K-V pair will go to which reducer in the mapper. I need to do it in Hadoop 1.0.4.
I could only find some Hadoop streaming examples with KeyFieldBasedPartitioner and customized partitioners, which use the -partitioner option in the command to tell Hadoop to use these partitioners. The examples I found using TotalOrderPartitioner and InputSampler are all in Java, and they need to use the writePartitionFile() of InputSampler and the DistributedCache class to do the job. So I am wondering if it is possible to use TotalOrderPartitioner with hadoop streaming? If it is possible, how can I organize my code to use it? If it is not, is it practical to implement the total partitioner in python first and then use it?

Did not try, but taking the example with KeyFieldBasedPartitioner and simply replacing:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
with
-partitioner org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner
Should work.

One possible way to use TotalOrderPartitioner in Hadoop streaming is to recode a small part of it to get the pathname of its partition file from an environment variable then compile it, define that environment variable on your systems and pass its name to the streaming job with the -cmdenv option (documented at https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options.
Source code for TotalOrderPartitioner is available at TotalOrderPartitioner.java. In it getPartitionFile() is defined on two lines starting on line 143 and its second line shows that if its not given an argument it uses DEFAULT_PATH as the partition file name. DEFAULT_PATH is defined on line 54 as "_partition.lst" and line 83 has a comment that says its assumed to be in the DistributedCache. Based on that, without modifying getPartitionFile(), it should be possible to use _partition.lst as the partition filename as long as its in DistributedCache.
That leaves the issue of running an InputSampler to write content to the partition file. I think that's best done by running an already coded Java MapReduce job that uses TotalOrderPartitioner, at least to get an example of the output of InputSampler to determine its format. If the example job can be altered to process the type of data you want then you could use it to create a partition file usable for your purposes. A couple of coded MapReduce jobs using TotalOrderPartitioner are TotalOrderSorting.java and TotalSortMapReduce.java.
Alternatively at twittomatic there is a simple, custom IntervalPartitioner.java in which the partition file pathname is hardcoded as /partitions.lst and in the sorter directory is supplied a script, sample.sh, that builds partition.lst using hadoop, a live twitter feed and sample.py. It should be fairly easy to adapt this system to your needs starting with replacing the twitter feed with a sample of your data.

Hadoop: Process image files in Python code

I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...

There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.

Trying to automate the fpga build process in Xilinx using python scripts

I want to automate the entire process of creating ngs,bit and mcs files in xilinx and have these files be automatically be associated with certain folders in the svn repository. What I need to know is that is there a log file that gets created in the back end of the Xilinx gui which records all the commands I run e.g open project,load file,synthesize etc.
Also the other part that I have not been able to find is a log file that records the entire process of synthesis, map,place and route and generate programming file. Specially record any errors that the tool encountered during these processes.
If any of you can point me to such files if they exist it would be great. I haven't gotten much out of my search but maybe I didn't look enough.
Thanks!

Well, it is definitely a nice project idea but a good amount of work. There's always a reason why an IDE was built – a simple search yields the "Command Line Tools User Guide" for various versions of Xilinx ISE, like for 14.3, 380 pages about
Overview and list of features
Input and output files
Command line syntax and options
Report and message information
ISE is a GUI for various command line executables, most of them are located in the subfolder 14.5/ISE_DS/ISE/bin/lin/ (in this case: Linux executables for version 14.5) of your ISE installation root. You can review your current parameters for each action by right clicking the item in the process tree and selecting "Process properties".
On the Python side, consider using the subprocess module:
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
Is this the entry point you were looking for?

As phineas said, what you are trying to do is quite an undertaking.
I've been there done that, and there are countless challenges along the way. For example, if you want to move generated files to specific folders, how do you classify these files in order to figure out which files are which? I've created a project called X-MimeTypes that attempts to classify the files, but you then need a tool to parse the EDA mime type database and use that to determine which files are which.
However there is hope, so to answer the two main questions you've pointed out:
To be able to automatically move generated files to predetermined paths. From what you are saying it seems like you want to do this to make the versioning process easier? There is already a tool that does this for you based on "design structures" that you create and that can be shared within a team. The tool is called Scineric Workspace so check it out. It also have built in Git and SVN support which ignores things according to the design structure and in most cases it filters all generated things by vendor tools without you having to worry about it.
You are looking for a log file that shows all commands that were run. As phineas said, you can check out the Command Line Tools User guides for ISE, but be aware that the commands to run have changed again in Vivado. The log file of each process also usually states the exact command with its parameters that have been called. This should be close to the top of the report. If you look for one log file that contains everything, that does not exist. Again, Scineric Workspace supports evoking flows from major vendors (ISE, Vivado, Quartus) and it produces one log file for all processes together while still allowing each process to also create its own log file. Errors, warning etc. are also marked properly in this big report. Scineric has a tcl shell mode as well, so your python tool can run it in the background and parse the complete log file it creates.
If you have more questions on the above, I will be happy to help.
Hope this helps,
Jaco

CLI git log statistics

I'm being faced with the task of generating statistics about the history of a Git project, and I need to produce some specific numbers and representations for various metrics - things like commits per author, commits-over-time/date histograms, that sort of thing.
The trouble is that I need all this data generated in a format that can be dealt with via a script or similar - the output has to be text, and if I can get the numbers into a Python (or similar) script, so much the better.
My question is this: are there any existing frameworks or projects that will provide such an interface? I've seen GitStats, and it does a lot of what I want, but then it dumps the results into a HTML structure instead of just providing textual or programmatic representations back to me. Are there (for example) Python bindings for a Git log parser, or even a Git statistics generator that returns a big text dump of data?
I realize it's a very specific need, and I'm willing to do some serious coding to get the precise format I want, but I'd like to think there's a starting point out there somewhere. Ideas?

How about using XML logs instead, and then you can parse the xml in python relativily easily and build your stats
see this answer for how to get an xml log from git

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.