I am working on project where real time data is stored in Mongo DB. I need to analyze that data and calculate new parameters using present one in database.
Analysis may vary from basic statistics to predicting value from previous one
I came across with two approaches that is Mongo dB queries using pymongo or ETL tool.
Please help me to find the best approach for this.
Related
Disclaimer: I am still pretty new to Django and am no veteran.
I am in the midst of building the "next generation" of a software package I built 10 years ago. The original software was built using CodeIgniter and the LAMP stack. The current software still works great, but it's just time to move on. The tech is now old. I have been looking at Django to write the new software in, but I have concerns using the ORM and the models file getting out of control.
So here's my situation, each client must have their own database. No exceptions due to data confidentiality and contracts. Each database mainly stores weather forecast data. There is a skeleton database that is currently used to setup a client. This skeleton does have tables that are common across all clients. What I am concerned about are the forecast data tables I have to dynamically create. Each forecast table is unique and different with the exception of the first four columns in the table being used for referencing/indexing and letting you know when the data was added. The rest of the columns are forecast values in a real/float datatype. There could be anything from 12 forecast data columns to over 365. Between all clients, there are hundreds of different/unique forecast tables.
I am trying to wrap my head around how I can use the ORM without having hundreds of methods in model.py. Even if I made a subdirectory and then a "model.py" for each client, I'd still have tons of model methods to deal with.
I have been reading up on how the ORM works for Django, but I haven't found anything (yet) out there that helps with my kind of situation. It's not the norm.
Without getting any more long winded about this, should I skip the ORM because of all these complexities or is there some stable way to deal with this besides going with SQL queries and stored procedures to get some performance gains?
Things to note: I did thorough benchmarking between MySQL and Postgres and will be using Postgres for the new project. I did test the option of using an array column vs having a column for each forecast value in Postgres hoping this would help with the potential modeling bloat issue. To my surprise, having a column for each forecast value provided faster querying than storing everything in an array column. So array storage is not a viable option for my data.
I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation
Let's say I have class User which inherits from the Document class (I am using Mongoengine). Now, I want to retrieve all users signed up after some timestamp. Here is the method I am using:
def get_users(cls, start_timestamp):
return cls.objects(ts__gte=start_timestamp)
1000 documents are returned in 3 seconds. This is extremely slow. I have done similar queries in SQL in a couple of miliseconds. I am new to MongoDB and No-SQL in general, so I guess I am doing something terribly wrong.
I suspect the retrieval is slow because it is done in several batches. I read somewhere that for PyMongo the batch size is 101, but I do not know if that is same for Mongoengine.
Can I change the batch size, so I could get all documents at once. I will know approximately how much data will be retrieved in total.
Any other suggestions are very welcome.
Thank you!
As you suggest there is no way that it should take 3 seconds to run this query. However, the issue is not going to be the performance of the pymongo driver, some things to consider:
Make sure that the ts field is included in the indexes for the user collection
Mongoengine does some aggressive de-referencing so if the 1000 returned user documents have one or more ReferenceField then each of those results in additional queries. There are ways to avoid this.
Mongoengine provides a direct interface to the pymongo method for the mongodb aggregation framework this is by far the most efficient way to query mongodb
mongodb recently released an official python ODM pymodm in part to provide better default performance than mongoengine
I'm completely new to managing data using databases so I hope my question is not too stupid but I did not find anything related using the title keywords...
I want to setup a SQL database to store computation results; these are performed using a python library. My idea was to use a python ORM like SQLAlchemy or peewee to store the results to a database.
However, the computations are done by several people on many different machines, including some that are not directly connected to internet: it is therefore impossible to simply use one common database.
What would be useful to me would be a way of saving the data in the ORM's format to be able to read it again directly once I transfer the data to a machine where the main database can be accessed.
To summarize, I want to do:
On the 1st machine: Python data -> ORM object -> ORM.fileformat
After transfer on a connected machine: ORM.fileformat -> ORM object -> SQL database
Would anyone know if existing ORMs offer that kind of feature?
Is there a reason why some of the machine cannot be connected to the internet?
If you really can't, what I would do is setup a database and the Python app on each machine where data is collected/generated. Have each machine use the app to store into its own local database and then later you can create a dump of each database from each machine and import those results into one database.
Not the ideal solution but it will work.
Ok,
thanks to MAhsan's and Padraic's answers I was able to find the how this can be done: the CSV format is indeed easy to use for import/export from a database.
Here are examples for SQLAlchemy (import 1, import 2, and export) and peewee
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.