My python project involves an externally provided database: A text file of approximately 100K lines.
This file will be updated daily.
Should I load it into an SQL database, and deal with the diff daily? Or is there an effective way to "query" this text file?
ADDITIONAL INFO:
Each "entry", or line, contains three fields - any one of which can be used as an index.
The update is is the form of the entire database - I would have to manually generate a diff
The queries are just looking up records and displaying the text.
Querying the database will be a fundamental task of the application.
How often will the data be queried? On the one extreme, if once per day, you might use a sequential search more efficiently than maintaining a database or index.
For more queries and a daily update, you could build and maintain your own index for more efficient queries. Most likely, it would be worth a negligible (if any) sacrifice in speed to use an SQL database (or other database, depending on your needs) in return for simpler and more maintainable code.
What I've done before is create SQLite databases from txt files which were created from database extracts, one SQLite db for each day.
One can query across SQLite db to check the values etc and create additional tables of data.
I added an additional column of data that was the SHA1 of the text line so that I could easily identify lines that were different.
It worked in my situation and hopefully may form the barest sniff of an acorn of an idea for you.
Related
Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.
I have financial statement data on thousands of different companies. Some of the companies have data only for 2019, but for some I have decade long data. Each company financial statement have its own table structured as follows with columns in bold:
lineitem---2019---2018---2017
2...............1000....800.....600
3206...........700....300....-200
56.................50....100.....100
200...........1200......90.....700
This structure is preferred over more of a flat file structure like lineitem-year-amount since one query gives me the correct structure of the output for a financial statement table. lineitem is a foreignkey linking to the primary key of a mapping table with over 10,000 records. 3206 can for example mean "Debt to credit instituions". I also have a companyIndex table which has the company ID, company name, and table name. I am able to get the data into the database and make queries using sqlite3 in python, but advanced queries is somewhat of a challenge at times, not to mention that it can take a lot of time and not be very readable. I like the potential of using ORM in Django or SQLAlchemy. The ORM in SQLAlchemy seems to want me to know the name of the table I am about to create and want me to know how many columns to create, but I don't know that since I have a script that parses a datadump in csv which includes the company ID and financial statement data for the number of years it has operated. Also, one year later I will have to update the table with one additional year of data.
I have been watching and reading tutorials Django and SQLAlchemy, but have not been able to try it out too much in practise due to this initial problem which is a prerequisite for succeding in my project. I have googled and googled, and checked stackoverflow for a solution, but not found any solved questions (which is really surprising since I always find the solution on here).
So how can I insert the data using Django/SQLAlchemy given the structure I plan to have it fit into? How can I have the selected table(s) (based on company ID or company name) be an object(s) in ORM just like any other object allowing me the select the data I want at the granularity level I want?
Ideally there is a solution to this in Django, but since I haven't found anything I suspect there is not or that how I have structured the database is insanity.
You cannot find a solution because there is none.
You are mixing the input data format with the table schema.
You establish an initial database table schema and then add data as rows to the tables.
You never touch the database table columns again, unless you decide that the schema has to be altered to support different, usually additional functionality in the application, because for example, at a certain point in the application lifetime, new attributes become required for data. Not because there is more data, wich simply translates to new data rows in one or more tables.
So first you decide about a proper schema for database tables, based on the data records you will be reading or importing from somewhere.
Then you make sure the database is normalized until 3rd normal form.
You really have to understand this. Haven't read it, just skimmed over but I assume it is correct. This is fundamental database knowledge you cannot escape. After learning it right and with practice it becomes second nature and you will apply the rules without even noticing.
Then your problems will vanish, and you can do what you want with whatever relational database or ORM you want to use.
The only remaining problem is that input data needs validation, and sometimes it is not given to us in the proper form. So the program, or an initial import procedure, or further data import operations, may need to give data some massaging before writing the proper data rows into the existing tables.
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.
My app reads dozens of SQLite databases ranging in size from 1MB to 100MB with a simple table structure like,
"create table dictionary(id INTEGER PRIMARY KEY, topics, definition)"
On startup the app reads all the databases to extract the "topics" column data. With the DB configured as above extracting the topic data is a lengthy process. It seems the whole DB file is being read to just access the relatively small "topics" column.
If I add another table with just the topics data,
"create table dictionary(id INTEGER PRIMARY KEY, topics)"
access is quicker. However, if I make another DB file with just the topics column, access is much quicker. I don't really want to use two files to access a single DB.
My questions,
Does table creation order affect the speed of access?
Is there a way to read only the part of the file that has the queried table/column thus saving time and hard dive grinding?
Is it possible to read a small table/column from a huge file nearly as quick as the small table alone from a small file? And if so, how?
Thanks,
I have created a database in PostgreSQL, let's call it testdb.
I have a generic set of tables inside this database, xxx_table_one, xxx_table_two and xxx_table_three.
Now, I have Python code where I want to dynamically create and remove "sets" of these 3 tables to my database with a unique identifier in the table name distinguishing different "sets" from each other, e.g.
Set 1
testdb.aaa_table_one
testdb.aaa_table_two
testdb.aaa_table_three
Set 2
testdb.bbb_table_one
testdb.bbb_table_two
testdb.bbb_table_three
The reason I want to do it this way is to keep multiple LARGE data collections of related data separate from each other. I need to regularly overwrite individual data collections, and it's easy if we can just drop the data collections table and recreate a complete new set of tables. Also, I have to mention, the different data collections fit into the same schemas, so I could save all the data collections in 1 set of tables using an identifier to distinguish data collections instead of separating them by using different tables.
I want to know, a few things
Does PostgreSQL limit the number of tables per database?
What is the effect on performance, if any, of having a large number of tables in 1 database?
What is the effect on performance of saving the data collections in different sets of tables compared to saving them all in the same set, e.g. I guess would need to write more queries if I want to query multiple data collections at once when the data is spread accross tables as compared to just 1 set of tables.
PostgreSQL doesn't have many limits, your hardware is much more limited, that's where you encounter most problems. http://www.postgresql.org/about/
You can have 2^32 tables in a single database, just over 4 billion.
PostgreSQL doesn't impose a direct limit on this, your OS does (it depends on maximum directory size)
This may depend on your OS as well. Some filesystems get slower with large directories.
PostgreSQL won't be able to optimize queries if they're across different tables. So using less tables (or a single table) should be more efficient
If your data were not related, I think your tables could be in different schema, and then you would use SET search_path TO schema1, public for example, this way you wouldn't have to dynamically generate table names in your queries. I am planning to try this structure on a large database which stores logs and other tracking information.
You can also change your tablespace if your os has a limit or suffers from large directory size.