Replacing big amount of documents from specific date in MongoDB - python

I'm storing business/ statistical data by date in different collections.
Every day thousands of thousands of rows are being inserted.
In some cases my application fetches or generate information that includes let's say the last 20 days with new values, so I need to update that old information in MongoDB with the new values for those dates.
The first option I thought of is removing all rows from 20 days ago until now by removing by date, and insert the new data with insertMany().
The problem with this is that the amount of rows is huge and it blocks the database which some times makes my worker process to die (It's a python celery task).
The second option I thought of to is to split the new coming data into chunks per date (using Pandas dataframes), and perform a "removal" then "insert of that date, and iterate that process until today. This way is the same but in smaller chunks.
Is the last option a good idea?
Is there any better approach for this type of problem?
Thanks a lot

Related

How to group specific number of records into subsets to be proceeded in Spark?

I'd like to start by saying that I'm completely new to Spark so this may be a trivial question but I appreciate any feedback.
What would I like to do?
I want to process data in the following way:
Read data using Structured Streaming Spark component.
I count statistics i.e. mean, median on 10 consecutive records in each column (6 columns) and on all 10 records through all columns.
I write this data into a SQL table
What is the issue?
My problem mainly concerns points 2 and 3. I do not know how to group the records into subsets (chunks) consisting of 10 records and process them per column and for all 10 records simultaneously. Do you have any suggestions?
The second question is about writing to an SQL table because I don't know exactly which functionality of SPARK to use for this. Could you provide me any hints about it?
If any more specifications is needed, feel free to inform me about it. I would be very glad to add it.

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.
I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

Should I create a large frame with index or many groups in HDF Store?

I have a daily time series of ~1.5 million rows per day, a 4-dimensional index, and 2 columns. Thus far I've put all this stuff into one DataFrame and shoved into a single group in an HDFStore. The problem now is that continuously appending to this very large frame is now uber slow and I'm wondering if I should just create one group per day and if this would speed up appends as well as reads. Muchas gracias por la ayuda!
The docs say that you can have 16384 children in one group. This would give your more than 44 years when putting one day in one group. You could even increase this number if necessary. There is a warning that a larger number could have unwanted performance and storage impacts.
I worked with a file with 15.000+ groups in root and it worked out nicely. I think the one-group-per-day approach is better when you need to access a day at time later. Searching for something in all days could be much slower though. You need to try this out.
Depending on your use case, you could also create one group per year, one subgroup per month, and one table per day of month. This might be helpful if somebody wants have a look at the data in a graphical tool such as vitables.
On the other hand, this might complicate some of your processing steps later on.

Table updates using daily data from other tables Postgres/Python

I have a database and csv file that gets updated once a day. I managed to updated my table1 from this file by creating a separate log file with the record of the last insert.
No, I have to create a new table table2 where I keep calculations from the table1.
My issue is that those calculations are based on 10, 20 and 90 previous rows from table1.
The question is - how can I efficiently update table2 from the data of the table1 on a daily basis? I don't want to re-do the calculations everyday from the beginning of the table since it will be very time consuming for me.
Thanks for your help!
The answer is "as well as one could possibly expect."
Without seeing your tables, data, and queries, and the stats of your machine it is hard to be too specific. However in general updates basically doing three steps. This is a bit of an oversimplification but it allows you to estimate performance.
First it selects the data necessary. Then it marks the rows that were updated as deleted, then it inserts new rows with the new data into the table. In general, your limit is usually the data selection. As long as you can efficiently run the SELECT query to get the data you want, update should perform relatively well.

Efficiently determining if a business is open or not based on store hours

Given a time (eg. currently 4:24pm on Tuesday), I'd like to be able to select all businesses that are currently open out of a set of businesses.
I have the open and close times for every business for every day of the week
Let's assume a business can open/close only on 00, 15, 30, 45 minute marks of each hour
I'm assuming the same schedule each week.
I am most interested in being able to quickly look up a set of businesses that is open at a certain time, not the space requirements of the data.
Mind you, some my open at 11pm one day and close 1am the next day.
Holidays don't matter - I will handle these separately
What's the most efficient way to store these open/close times such that with a single time/day-of-week tuple I can speedily figure out which businesses are open?
I am using Python, SOLR and mysql. I'd like to be able to do the querying in SOLR. But frankly, I'm open to any suggestions and alternatives.
If you are willing to just look at single week at a time, you can canonicalize all opening/closing times to be set numbers of minutes since the start of the week, say Sunday 0 hrs. For each store, you create a number of tuples of the form [startTime, endTime, storeId]. (For hours that spanned Sunday midnight, you'd have to create two tuples, one going to the end of the week, one starting at the beginning of the week). This set of tuples would be indexed (say, with a tree you would pre-process) on both startTime and endTime. The tuples shouldn't be that large: there are only ~10k minutes in a week, which can fit in 2 bytes. This structure would be graceful inside a MySQL table with appropriate indexes, and would be very resilient to constant insertions & deletions of records as information changed. Your query would simply be "select storeId where startTime <= time and endtime >= time", where time was the canonicalized minutes since midnight on sunday.
If information doesn't change very often, and you want to have lookups be very fast, you could solve every possible query up front and cache the results. For instance, there are only 672 quarter-hour periods in a week. With a list of businesses, each of which had a list of opening & closing times like Brandon Rhodes's solution, you could simply, iterate through every 15-minute period in a week, figure out who's open, then store the answer in a lookup table or in-memory list.
The bitmap field mentioned by another respondent would be incredibly efficient, but gets messy if you want to be able to handle half-hour or quarter-hour times, since you have to increase arithmetically the number of bits and the design of the field each time you encounter a new resolution that you have to match.
I would instead try storing the values as datetimes inside a list:
openclosings = [ open1, close1, open2, close2, ... ]
Then, I would use Python's "bisect_right()" function in its built-in "bisect" module to find, in fast O(log n) time, where in that list your query time "fits". Then, look at the index that is returned. If it is an even number (0, 2, 4...) then the time lies between one of the "closed" times and the next "open" time, so the shop is closed then. If, instead, the bisection index is an odd number (1, 3, 5...) then the time has landed between an opening and a closing time, and the shop is open.
Not as fast as bitmaps, but you don't have to worry about resolution, and I can't think of another O(log n) solution that's as elegant.
You say you're using SOLR, don't care about storage, and want the lookups to be fast. Then instead of storing open/close tuples, index an entry for every open block of time at the level of granularity you need (15 mins). For the encoding itself, you could use just cumulative hours:minutes.
For example, a store open from 4-5 pm on Monday, would have indexed values added for [40:00, 40:15, 40:30, 40:45]. A query at 4:24 pm on Monday would be normalized to 40:15, and therefore match that store document.
This may seem inefficient at first glance, but it's a relatively small constant penalty for indexing speed and space. And makes the searches as fast as possible.
Sorry I don't have an easy answer, but I can tell you that as the manager of a development team at a company in the late 90's we were tasked with solving this very problem and it was HARD.
It's not the weekly hours that's tough, that can be done with a relatively small bitmask (168 bits = 1 per hour of the week), the trick is the businesses which are closed every alternating Tuesday.
Starting with a bitmask then moving on to an exceptions field is the best solution I've ever seen.
In your Solr index, instead of indexing each business as one document with hours, index every "retail session" for every business during the course of a week.
For example if Joe's coffee is open Mon-Sat 6am-9pm and closed on Sunday, you would index six distinct documents, each with two indexed fields, "open" and "close". If your units are 15 minute intervals, then the values can range from 0 to 7*24*4. Assuming you have a unique ID for each business, store this in each document so you can map the sessions to businesses.
Then you can simply do a range search in Solr:
open:[* TO N] AND close:[N+1 TO *]
where N is computed to the Nth 15 minute interval that the current time falls into. For examples if it's 10:10AM on Wednesday, your query would be:
open:[* TO 112] AND close:[113 TO *]
aka "find a session that starts at or before 10:00am Wed and ends at or after 10:15am Wed"
If you want to include other criteria in your search, such as location or products, you will need to index this with each session document as well. This is a bit redundant, but if your index is not huge, it shouldn't be a problem.
If you can control your data well, I see a simple solution, similar to #Sebastian's. Follow the advice of creating the tuples, except create them of the form [time=startTime, storeId] and [time=endTime, storeId], then sort these in a list. To find out if a store is open, simply do a query like:
select storeId
from table
where time <= '#1'
group by storeId
having count(storeId) % 2 == 1
To optimize this, you could build a lookup table at each of time t, store the stores that are open at t, and the store openings/closings between t and t+1 (for any grouping of t).
However, this has the drawback of being harder to maintain (overlapping openings/closings need to be merged into a longer open-close period).
Have you looked at how many unique open/close time combinations there are? If there are not that many, make a reference table of the unique combinations and store the index of the appropriate entry against each business. Then you only have to search the reference table and then find the business with those indices.

Categories