The project that I am working on is a bit confidential, but I will try to explain my issues and be as clear as possible because I need your opinion.
Project:
They asked me to set up a local ELK environment , and to use Python scripts to communicate with this stack (ELK), to store data, retrieve it, analyse it and visualise it thanks to Kibana, and finally there is a decision making based on that data(AI). So as you can see, it is a Data Engineering project with some AI for the decision making process. The issues that I am facing are:
I don't know how to use Python to communicate with the stack, I didn't find resources about it
Since the data is confidential, how can I assure a high security?
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I am new to ELK, so please any advice would be really helpful!
I don't know how to use Python to communicate with the stack, I didn't
find resources about it
For learning how to interact with your stack use the python library:
You can install using pip3 install elasticsearch and the following links contain a wealth of tutorials on almost anything you would need to be doing.
https://kb.objectrocket.com/category/elasticsearch?filter=python
Suggest you start with these two:
https://kb.objectrocket.com/elasticsearch/how-to-parse-lines-in-a-text-file-and-index-as-elasticsearch-documents-using-python-641
https://kb.objectrocket.com/elasticsearch/how-to-query-elasticsearch-documents-in-python-268
Since the data is confidential, how can I assure a high security?
You can mask the data or restrict index access.
https://www.elastic.co/guide/en/elasticsearch/reference/current/authorization.html
https://nl.devoteam.com/expert-view/field-level-security-and-data-masking-in-elasticsearch/
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I suggest you start with 1 Elasticsearch node, if you're on AWS use a t3a.large or equivalent and run Elasticsearch, Kibana and Logstash all on the same machine.
For setting it up: https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-stack-docker.html#run-docker-secure
If you want to use phyton as your integration tools to Elasticsearch you can use elasticsearch phyton client.
The other options you can use python to create the result and save it in log file or insert to database than Logstash will get your data.
For the security ELK have good security from API authorization user authentication to cluster security. you can see in here Secure the Elastic Stack
I just use 1 instance, but feel free if you think you will need to separate between Kibana and Elasticsearch and Logstash (if you use it) or you can use docker to separate it.
Based on my experience, if you are going to load a lot of data in a short time it will be wise If you separate it so the processes don't interfere with each other.
Related
Hi is there anyone who is help me to Integrate BIRT report with Django Projects? or any suggestion for connect third party reporting tools with Django like Crystal or Crystal Clear Report.
Some of the 3rd-party Crystal Reports viewers listed here provide a full command line API, so your python code can preview/export/print reports via subprocess.call()
The resulting process can span anything between an interactive Crystal Report viewer session (user can login, set/change parameters, print, export) and an automated (no user interaction) report printing/exporting.
While this would simplify your code, it would restrict deployment to Windows.
For prototyping, or if you don't mind performance, you can call from BIRT from the command line.
For example, download the POJO runtime and use the script genReport.bat (IIRC) to generate a report to a file (eg. PDF format). You can specify the output options and the report parameters on the command line.
However, the BIRT startup is heavy overhead (several seconds).
For achieving reasonable performance, it is much better to perform this only once.
To achieve this goal, there are at least two possible ways:
You can use the BIRT viewer servlet (which is included as a WAR file with the POJO runtime). So you start the servlet with a web server, then you use HTTP requests to generate reports.
This looks technically old-fashioned (eg. no JSON Requests), but it should work. However, I never used this approach.
The other option is to write your own BIRT server.
In our product, we followed this approach.
You can take the viewer servlet as a template for seeing how this could work.
The basic idea is:
You start one (or possibly more than one) Java process.
The Java process initializes the BIRT runtime (this is what takes some seconds).
After that, the Java process listens for requests somehow (we used a plain socket listener, but of course you could use HTTP or some REST server framework as well).
A request would contain the following information:
which module to run
which output format
report parameters (specific to the module)
possibly other data/metadata, e.g. for authentication
This would create a RunAndRenderTask or separate RunTask and RenderTasks.
Depending on your reports, you might consider returning the resulting output (e.g. PDF) directly as a response, or using an asynchronous approach.
Note that BIRT will happily create several reports at the same time - multi-threading is no problem (except for the initialization), given enough RAM.
Be warned, however, that you will need at least a few days to build a POC for this "create your own server" approach, and probably some weeks for prodction quality.
So if you just want to build something fast to see if the right tool for you, you should start with the command line approach, then the servlet approach and only then, and only if you find that the servlet approach is not quite good enough, you should go the "create your own server" way.
It's a pity that currently there doesn't seem to exist an open-source, production-quality, modern BIRT REST service.
That would make a really good contribution to the BIRT open-source project... (https://github.com/eclipse/birt)
I am working on a project where I have been using Python to make API calls to our organization's various technologies to get data, which I then push to Power BI to track metrics over time relating to IT Security.
My boss wants to see info added from Exchange Online Protection such as malware detected in emails, spam blocks etc., essentially replicating some of the email and collaboration reports you'd see in M365 defender > reports > email and collaboration (security.microsoft.com/emailandcollabreport).
I have tried the Defender API and MS Graph API, read through a ton of documentation, and can't seem to find anywhere to pull this info from. Has anyone done something similar, or know where this data can be pulled from?
Thanks in advance.
You can try using the Microsoft Graph Security API using which you can get the alerts, information protection, secure score using that. Also you can refer the alerts section in the documentation which talks about the list of supported providers at this point using the Microsoft Graph security api.
In case anyone else runs into this, this is the solution I ended up using (hacky as it may be);
The only way to extract the pertinent info seems to be through PowerShell, you need the modules ExchangeOnlineManagement and PSWSMan so those will need to be installed.
You need to add an app to your Azure instance with global reader role minimum (or something custom) and generate and upload self-signed certificates to the app.
I then ran the following lines as a ps1 script:
Connect-ExchangeOnline -CertificateFilePath "<PATH>" -AppID "<APPID>" -Organization "<ORG>.onmicrosoft.com" -CertificatePassword (ConvertTo-SecureString -String '<PASSWORD>' -AsPlainText -Force)
$dte = (Get-Date).AddDays(-30)
Get-MailflowStatusReport -StartDate $dte -EndDate (Get-Date)
Disconnect-ExchangeOnline
I used python to call the powershell script, then extract the info I needed from the output and push it to PowerBI.
I'm sure there is a more secure and efficient way to do this but I was able to accomplish the task this way.
I'm trying to get a list of registered virtual machines - by their name - in my vcenter. The problem that I have many vms (~5K), and I am doing it a lot of times (O(1000)/hour).
The SDKs I'm using causing it a lot of traffic (1-2MB/request):
pysphere: which ask for all vms, and filters on client side.
pyVmomi, which need to use recursion to list all vms (I saw SI.content.searchIndex.FindByDnsName on reboot_vm.py, but my machines' DNS configuration is not true)
Looking into SOAP documentation didn't help (got into RetrievePropertiesEx.objectSet but it doesn't looks to filter anything), and the new REST (v6.5) didn't help too (since I need to get its "datastore path", and all I can get is the name)
Have you tried a using a property collector, as in pyVmomi.vmodl.query.PropertyCollector.FilterSpec?
The pyVmomi Community Samples contain examples of using this API, such as
https://github.com/vmware/pyvmomi-community-samples/blob/master/samples/filter_vms.py.
vAPI is based on REST, even more, it does network requests every time, so it will be slow. Another way is to use VCDB, where are stored all VMs (there us such one, but I can not help you here more), see https://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.installclassic.doc_41/install/prep_db/c_preparing_the_vmware_infrastructure_databases.html or https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1025914
Wondering about durable architectures for distributed Python applications.
This question I asked before should provide a little guidance about the sort of application it is. We would like to have the ability to have several code servers and several database servers, and ideally some method of deployment that is manageable and not too much of a pain.
The question I mentioned provides an answer that I like, but I wonder how it could be made more durable, or if doing so requires using other technologies. In particular:
I would have my frontend endpoints be the WSGI (because you already have that written) and write the backend to be distributed via messages. Then you would have a pool of backend nodes that would pull messages off of the Celery queue and complete the required work. It would look sort of like:
Apache -> WSGI Containers -> Celery Message Queue -> Celery Workers.
The apache nodes would be behind a load balancer of some kind. This would be a fairly simple architecture to scale and is, if done correctly, fairly reliable. Code for failure in a system like this and you will be fine.
What is the best way to make durable applications? Any suggestions on how to either "code for failure" or design it differently so that we don't necessarily have to? If you think Python might not be suited for this, that is also a valid solution.
Well to continue on the previous answer I gave.
In my projects I code for failure, because I use AWS for a lot of my hosting needs.
I have implemented database backends that will make sure that the database, region, is accessible and if not it will choose another region from a specified list. This happens transparently to the rest of the system on that node. So, if the east-1a region goes down I have a few other regions that I also host in that it will failover into, such as the west coast. I keep track of currently going database transactions and send them over to the west coast and dump them to a file so I can import them into the old database region once it becomes available.
My front end servers sit behind a elastic load balancer that is distributed across multiple regions and this allows for durable recovery if a region fails. But, it cannot be relied upon so I am looking into solutions such as running a HAProxy and switching my DNS in the case that my ELB goes down. This is a work in progress and I cannot give specifics on my own solutions.
To make your data processing durable look into Celery and store the data in a distributed mongo server to keep your results safe. Using a durable data store to keep your results allows you to get them back in the event of a node crash. It comes at the cost of some performance, but it shouldn't be too terrible if you only rely on soft-realtime constraints.
http://www.mnxsolutions.com/amazon/designing-for-failure-with-amazon-web-services.html
The above article talks mostly about AWS but the ideas apply to any system that you need to keep high availability in and system durability. Just remember that downtime is ok as long as you minimize it for a subset of users.
I am an experienced Python developer starting to work on web service
backend system. The system feeds data (constantly) from the web to a
MySQL database. This data is later displayed by a frontend side (there
is no connection between the frontend and the backend). The backend
system constantly downloads flight information from the web (some of
the data is fetched via APIs, and some by downloading and parsing
text / xls files). I already have a script that downloads the data,
parses it, and inserts it to the MySQL db - all in a big loop. The
frontend side is just a bunch of php pages that properly display the
data by querying the MySQL server.
It is crucial that this web service be robust, strong and reliable.
Therefore, I have been looking into the proper ways to design it, and came across the following parts to comprise my system:
1) django as a framework (for HTTP connections and for using Piston)
2) Piston as an API provider (this is great because then my front-end can use the API instead of actually running queries)
3) SQLAlchemy as the DB layer (I don't like the little control you get when using django ORM, I want to be able to run a more complex DB framework)
4) Apache with mod_wsgi to run everything
5) And finally, Celery (or django-cron) to actually run my infinite loop that pulls the data off the web - hopefully in some sort of organized tasks format). This is the part I am least sure of, and any pointers are appreciated.
This all sounds great. I used django before to write websites (aka
request handlers that return data). However, other than using Celery or django-cron I can't really see how it fits a role of a constant data feeding backend.
I just wanted to run this by you guys to hear your ideas / comments. Any input you have / pointers to documentation and/or other libraries would be greatly greatly appreciated!
If You are about to use SQLAlchemy, I would refrain from using Django: Django is fine if You are using the whole stack, but as You are about to rip Models off, I do not see much value in using it and I would take a look at another option (perhaps Pylons or pure old CherryPy would do).
Even more so if FEs will not run queries, but only ask API providers.
As for robustness, I am more satisfied with starting separate fcgi processess with supervise and using more lightweight web server (ligty / nginx), but that's a matter of taste.
For the "infinite loop" part, it depends on what behavior you want: if there is a problem with the source, would you just like to skip the step or repeat it multiple times when source is back up?
Periodic Tasks might be good for former, while cron that would just spawn scraping tasks is better for latter.