Upload XML Files in Django, Parse XML & Compare with existing models dataset - python

Can someone give me a better approach for below use case please ?
upload XML file
Scan XML file with specific tags
Store required data in which format ? (I thought of building JSON dump ?)
I have data in different models for different components.
How can i compare data that i have in step3 with django models and produce some output ? ( Sort of data comparison )
Note : JSON Dump which i get at step 3 is full dump of required data and data at step 4 is being referred to small small chunks of data which has to be combined and compared against recently uploaded file JSON Dump

I would define a Model where you can store the uploaded file and a form.
(https://docs.djangoproject.com/en/3.2/topics/http/file-uploads/#handling-uploaded-files-with-a-model)
Either use lxml etree or generateDS to scan XML Files. (https://www.davekuhlman.org/generateDS.html)
To store you can use a JSON-Dump or a Picklefield where you can store the Object of the XML-File in it, if you use generateDS
Store the data in a the Database and write a model for it in Django. Try to make it as granular as possible so you can compare the new XML-File when you import it and maybe only store the difference as Objects with Pickle.
Hope that helps a bit.

Related

Pyspark: Read in only certain fields from nested json data

I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?
At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?

I want to parse out a json with only the information that I want

Are there any tools for parsing a .json file format
You cannot selectively parse JSON. This makes no sense as how can yo tell where is the section you selectively demanding, without parsing whole file? That said, JSON have to be read usually as whole, however you can selectively process loaded JSON by own criteria. I'd sufficient to iterate over activities node, check value of action_type and act accordingly. You need to write that code yourself. There's no makePortal() magic in the language to help you with that.
Load the full json data and filter it after loading
import json
data = json.load(open('data.json'))
activities = [act for act in data['activities'] if '_STAGE_CHANGE' in act['composite_id']]

configure custom storing paths for elasticsearch

i am a beginner with elasticsearch and i was looking for someway to change the configuration with which it stores indicies's data .
i am storing large files of different formats which come on daily basis to my system and i dont want to have 2 copies of the data , the original file and the file after it has been inserted in elasticsearch.
so what i have done so far is make a python script to use Apache Tika parsing library and the elasticsearch api on python to parse file and extract the contents,metadata etc. and store those inside elasticsearch. now, what i would like to do is create a custom storage pathing instead of the way elasticsearch saves the data of the nodes/indicies/foo to become for example -> nodes/index/date-document-added/document-extention/documents-of-same-extention
is this in anyway possible ?

Storing pandas DataFrames in SQLAlchemy models

I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!

OpenERP 6.1 module with upload file field

I want to create OpenERP 6.1 module with a upload binary file field as one of the fields in view.
The file will be stored in database as binary data, but before storage in database I need to parse that file, and save data as part of the other created module.
So, I don't know how to specifiy filed for upload files in a view xml file, and also how to run the uploading process. Can somebody help me about this? Some code snippets or advice how to do that.
Take a look at the way the attachments module works, particularly the binary data column. You should also look at the screen definition.

Categories