configure custom storing paths for elasticsearch

configure custom storing paths for elasticsearch - python

i am a beginner with elasticsearch and i was looking for someway to change the configuration with which it stores indicies's data .
i am storing large files of different formats which come on daily basis to my system and i dont want to have 2 copies of the data , the original file and the file after it has been inserted in elasticsearch.
so what i have done so far is make a python script to use Apache Tika parsing library and the elasticsearch api on python to parse file and extract the contents,metadata etc. and store those inside elasticsearch. now, what i would like to do is create a custom storage pathing instead of the way elasticsearch saves the data of the nodes/indicies/foo to become for example -> nodes/index/date-document-added/document-extention/documents-of-same-extention
is this in anyway possible ?

Related

Is it possible to use Django to parse locally stored files (csv for example) without uploading them?

I would like to develop a WebApp that parses locally stored data and allows users to create a sorted excel file.
It would be amazing if I could somehow avoid the uploading of the files. Some users are worried because of the data, the files can get really big so I have to implement async processes and so on...
Is something like that possible?

Upload XML Files in Django, Parse XML & Compare with existing models dataset

Can someone give me a better approach for below use case please ?
upload XML file
Scan XML file with specific tags
Store required data in which format ? (I thought of building JSON dump ?)
I have data in different models for different components.
How can i compare data that i have in step3 with django models and produce some output ? ( Sort of data comparison )
Note : JSON Dump which i get at step 3 is full dump of required data and data at step 4 is being referred to small small chunks of data which has to be combined and compared against recently uploaded file JSON Dump

I would define a Model where you can store the uploaded file and a form.
(https://docs.djangoproject.com/en/3.2/topics/http/file-uploads/#handling-uploaded-files-with-a-model)
Either use lxml etree or generateDS to scan XML Files. (https://www.davekuhlman.org/generateDS.html)
To store you can use a JSON-Dump or a Picklefield where you can store the Object of the XML-File in it, if you use generateDS
Store the data in a the Database and write a model for it in Django. Try to make it as granular as possible so you can compare the new XML-File when you import it and maybe only store the difference as Objects with Pickle.
Hope that helps a bit.

Best practice in updating Django web app content when deployed on Heroku

I'm finally at the stage of deploying my Django web-app on Heroku. The web-app retrieves and performs financial analysis on a subset of public companies. Most of the content of the web-app is stored in .csv and .xslx files. There's a master .xslx file that I need to open and manually update on a daily basis. Then I run scripts which based on info in that .xsls file retrieve financial data, news, etc and store the data in .csv files. Then my Django views and html templates are referring to those files (and not to a sql/postgress database). This is my setup in a nutshell.
Now I'm wondering what is the best way to make this run smoothly in production.
Shall I store these .xslx and .csv files on AWS S3 and have Django access them and update them from there? I assume that this way I can easily open and edit the master .xslx file anytime. Is that a good idea and do I run any performance or security issues?
Is it better to convert all these data into a postgress database on Heroku? Is there a good guideline in terms of how I can technically do that? In that scenario wouldn't it be much more challenging for me to edit the data in the master .xsls file?
Are there any better ways you would suggest to handle this?
I'd highly appreciate any advice on this matter.

You need to perform a trade-off between easy of use (access/update the source XSLX file) and maintainability (storing safely and efficiently the data).
Option #1 is more convenient if you need to quickly open and change the file using your Excel/Numbers application. On the other hand your application needs to access physical files to perform the logic and render the views.
BTW some time ago I have created a repository Heroku Files to present some options in terms of using external files.
Option #2 is typically better from a design point of view: the data is organised in the database and can be more efficiently queried and manipulated. The challenge in this case is that you need a way to view/edit the data, and this normally requires more development (creating new screens, etc..)
Involving a database is normally the preferred approach as you can scale up to large dataset without problems (which is not the case with files).
On the other hand if the XLS file stays small and you only need simple (quick) updates your current architecture can work.

Access data in .realm file from Python

Purpose:
I want to create a web page within a django app where support staff can upload a .realm file and have the web application pull the user and figure out what information in the .realm is missing on the site.
Question:
Is there a way to open, read and/or manipulate .realm files with Python? If not, what are my options for converting it to something else like SQLite? Would I need to create some way for the support staff to convert the file before they upload it?

Is there a way to open, read and/or manipulate .realm files with python?
Realm does not currently have a Python SDK.
If not what are my options for converting it to something else like a sqlite?
To access data from a Realm file on the server side of a web application, your best bet at the present point of time would be to use Realm's Node.js SDK. Alternatively, you could use a client-side app using one of Realm's other SDKs (Objective-C, Swift, Android, .NET, etc.) to extract the data in question and covert it to a format that your web application can consume.

Storing pandas DataFrames in SQLAlchemy models

I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?

Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.