I searched around and couldn't really find any information on this. Basically i have a database "A" and a database "B". What i want to do is create a python script (that will likely run as a cron job) that will collect data from database "A" via sql, perform an action on it, and then input that data into database "B".
I have written it using functions something along the lines of:
Function 1 gets the date the script was last run
Function 2 Gets the data from Database "A" based on function 1
Function 3-5 Perform the needed actions
Function 6 Inserts data into Database "B"
My question is, it was mentioned to me that i should use a Class to do this rather than just functions. The only problem is, I am honestly a bit hazy on Classes and when to use them.
Would a Class be better for this? Or is writing this out as functions that feed into each other better? If i would use a Class, could you tell me how it would look?
Would a Class be better for this?
Probably not.
Classes are useful when you have multiple, stateful instances that have shared methods. Nothing in your problem description matches those criteria.
There's nothing wrong with having a script with a handful of functions to perform simple data transfers (extract, transform, store).
Related
I are currently using nextval() function for postgresql insert id ,But now i want to implement this function in python to generate the id which will later inserted into the postgresql. Any idea what's the Pseudocode of this function, or in python.
Thanks in Advance
Look at nextval_internal in src/backend/commands/sequence.c.
Essentially, you lock the object against concurrent access, then increment or decrement according to the sequence definitions, unlock the object and return the new value.
The hard part is to persist this modification efficiently, so that you can be sure that you don't deal out duplicate values if the application happens to crash.
Databases are particularly good at persisting this efficiently and reliably, so you are best off using database sequences rather than trying to re-invent the wheel.
I have written a module in Python that reads a couple of tables from a database using pd.read_sql method, performs some operations on the data, and writes the results back to the same database using pd.to_sql method.
Now, I need to write unit tests for operations involved in the above mentioned module. As an example, one of the tests would check if the dataframe obtained from the database is empty, another one would check if the data types are correct etc. For such tests, how do I create sample data that reflects these errors (such as empty data frame, incorrect data type)? For other modules that do not read/write from a database, I created a single sample data file (in CSV), read the data, make necessary manipulations and test different functions. For the module related to database operations, how do I (and more importantly where do I) create sample data?
I was hoping to make a local data file (as I did for testing other modules), and then read using read_sql method, but that does not seem possible. Creating a local database using postegresql etc might be possible, but such tests cannot be deployed to clients without requiring them to create the same local databases.
Am I thinking of the problem correctly or missing something?
Thank you
You're thinking about the problem in the right way. Unit-tests should not rely on the existence of a database, as it makes them slower, more difficult to setup, and more fragile.
There are (at least) three approaches to the challenge you're describing:
The first, and probably the best one in your case, is to leave read_sql and write_sql out of the tested code. Your code should consist of a 'core' function that accepts a data frame and produces another data frame. You can unit-test this core function using local CSV files, or whatever other data you prefer. In production, you'll have another, very simple, function that just creates data using read_sql, pass it to the 'core' function, get the result, and write it using write_sql. You won't be unit-testing this wrapper function - but it's a really simple function and you should be fine.
Use sqlite. The tested function gets a database connection string. In prod, that would be a 'real' database. During your tests, it'll be a lightweight sqlite database that you can keep in your source control or create it as part of the test.
The last option, and the most sophisticated one, is to monkey-patch read_sql and write_sql in your test. I think it's an overkill in this case. Here's how one can do it.
def my_func(sql, con):
print("I'm here!")
return "some dummy dataframe"
pd.read_sql = my_func
pd.read_sql("select something ...", "dummy_con")
I'm trying to export all data connected to an User instance to CSV file. In order to do so, I need to get it from the DB first. Using something like
data = SomeModel.objects.filter(owner=user)
on every model possible seems to be very inefficient, so I want to use prefetch_related(). My question is, is there any way to prefetch all different model's instances with FK pointing at my User, at once?
Actually, you don't need to "prefetch everything" in order to create a CSV file – or, anything else – and you really don't want to. Python's CSV support is of course designed to work "row by row," and that's what you want to do here: in a loop, read one row at a time from the database and write it one row at a time to the file.
Remember that Django is lazy. Functions like filter() specify what the filtration is going to be, but things really don't start happening until you start to iterate over the actual collection. That's when Django will build the query, submit it to the SQL engine, and start retrieving the data that's returned ... one row at a time.
Let the SQL engine, Python and the operating system take care of "efficiency." They're really good at that sort of thing.
I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice
I have a question about updating field in GAE database. My problem looks like this:
class A(db.Model):
a = db.StringProperty()
and I added bool field:
class A(db.Model):
a = db.StringProperty()
b = db.BooleanProperty(default=False)
Now my problem is I'd like to have every instance of model b == False.
To update it I could of course drag them out of datastore and put them back there, but there is 700k elements there already and I really don't know how to do it efficiently. I can't take them out at once because I get soft memory exceeded errors. If I try to do it with little chunks - it costs me many db read operations. Do you have any idea how else I could update my datastore?
Cheers
I agree with #ShayErlichmen. However, if you really want to update every entity, the easiest way is to use the MapReduce library:
http://code.google.com/p/appengine-mapreduce/
It's not as easy as it sounds, because the documentation sucks, but this is the getting started point:
http://code.google.com/p/appengine-mapreduce/wiki/GettingStartedInPython
You just write a function foo() that will check the value of each entity passed to it, and if necessary, write update the value of your Boolean, and write it.
The library will grab batches of entities and send each batch to a separate task. Each task will run in a loop calling your function foo(). Note that the batches run in parallel, so it may launch a few instances in parallel, but it tends to be quick.
You new attribute can be in one of three states: None, False and True. just treat None as False in your code and you won't have todo the update.