I am trying to create a sqlite database for an app in python.
So far my database contains one table and it contains 500 entries already, I think. I say this because that is what the built-in database tool in Pycharm tells me. I then created another table to contain the rest of the data. I think they did not go into the table because it wasn't showing in the database tool. So I created another database to insert the rest of the data.
When I tried to delete some of the data from the first database it deleted but replaced it with some of the data I previously thought hadn't been entered in the first place due to 500 rows limit. I did this in PyCharm and all along it had thrown to exceptions. The diver is used was the xerial driver.
What am I doing wrong and how can I get to put all the data in one table? The final table is going to have a little over 1000 entries.
SQLite has a theoretical maximum row count of 264 rows:
The theoretical maximum number of rows in a table is 264 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first. A 140 terabytes database can hold no more than approximately 1e+13 rows, and then only if there are no indices and if each row contains very little data.
PyCharm displays database results in pages of a fixed size; use the paging controls (the left and right arrow buttons in the result page toolbar) to page through the results.
You can adjust the page size in your settings, see IDE settings -> database. I strongly suspect that the default is set to 500.
A more reliable way to count your current rows is to query the database:
SELECT COUNT(*) FROM <name_of_table>
Related
I have a very large (and growing) table of URLs, and I want to query the table to check if an item exists and return that item so I can edit it, or choose to add a new item. The code below works but runs very very slowly and, given the volume of queries I need to perform (several thousand per hour) is creating some issues. I haven't been able to find a better solution than below. I have a good sense of what is happening - it is loading the entire table every time, but there must be a faster way here.
Session = sessionmaker(bind=engine)
formatted_url = "%{}%".format(url)
matching_url = None
with Session.begin() as session:
matching_url = session.query(Link.id).filter(Link.URL.like(formatted_url)).yield_per(200).first()
This works great if the URL exists and is recent, but especially if the URL isn't in the database at all, the process takes as long as one minute.
You are doing a select from table where Linkid like %formatted_url% limit 1;
This needs a full table scan in the database.
If you are lucky, the row is still in memory or cache.
If not, or if it does not exist the database will need that full table scan.
If you are using postgres on cloud SQL, this question will help you to remediate the problem PostgreSQL: Full Text Search - How to search partial words?
I have one old MyISAM table where when I submit some count query, table gets locked. If I do the same query, on the same InnoDB table, query gets executed fast. The problem is, the old MyISAM table is still used in production and is under heavy load, while the new one is not.
Now we come to my problem and question. When I do explain on query's executed in both tables I get some result that confuses me.
Here is the query that I am executing in both tables :
SELECT COUNT(*)
FROM table
WHERE vrsta_dokumenta = 3
AND dostupnost = 0
Here is the explain from the old MyISAM table:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE old_table ref idx_vrsta_dokumenta idx_vrsta_dokumenta 1 const 564253 Using where
And here is the explain from the new InnoDB table:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE new_table ref idx_vrsta_dokumenta idx_vrsta_dokumenta 1 const 611905 Using where
As you can see the rows count in new table is higher than in old.
So in the case that higher number is bad, does this mean that query on the new table will be slower once it is fully in use ?
In case the higher number is good, then maybe that is the reason why new table is faster, and MyISAM gets locked after some time of execution.
Anyway, what is correct? What does this rows count mean?
EDIT: the old table has twice more columns than the new one. Since the old is has been split into 2 tables.
black-room-boy:
So in the case that higher number is bad, does this mean that query on the new table will be slower once it is fully in use?
MySQL manual says about the rows column in EXPLAIN:
The rows column indicates the number of rows MySQL believes it must
examine to execute the query.
For InnoDB tables, this number is an estimate, and may not always be
exact.
So, the higher number is not bad, it's just a guess based upon table metadata.
black-room-boy:
In case the higher number is good, then maybe that is the reason why
new table is faster, and MyISAM gets locked after some time of
execution.
Higher number is not good. MyISAM get's locked not because of this number.
Manual:
MySQL uses table-level locking for MyISAM, allowing only one session
to update those tables at a time, making them more suitable for
read-only, read-mostly, or single-user applications.
... Table updates are given higher priority than table retrievals...
If you have many updates for a table, SELECT statements wait until
there are no more updates.
If your table is frequently updated, it get's locked by INSERT, UPDATE, and DELETE (a.k.a. DML) statements, this blocks your SELECT query.
The row count tells you how many rows MySQL had to inspect in order to obtain the result for your query. This is where indexes help and where a number called index cardinality plays a very high role. Indexes help MySQL cut down on inspecting rows so the less it does - the faster it is. Since there are many rows that satisfy your condition of vrsta_dokumenta = 3 AND dostupnost = 0 then MySQL simply has to go through these, find them and increment the counter. For MyISAM that means MySQL has to read the data from the disk - this is expensive because disk access is very slow. For InnoDB it's extremely quick because InnoDB can store its working data set in memory. In other words, MyISAM reads from disk while InnoDB will read from memory. There are other optimizations available, but general rule is that InnoDB will be quicker than MyISAM is, even with table growth.
Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.
I have a specific problem where I have to query different database to show results in dashboard. The tables that I have to query in those db are exactly same. The number of database can be max 50 and minimum is 5.
NOTE: I can't put all the data in same database.
I am using postgres and django. I am not able to understand how to query those database to get data. I also need to filter and aggregate and sort the data and show 10 - 100 results on the search query params.
APPROACH that I have in mind
Loop through all database and fetch the data based on the search params and order it by created date. After that take 10 - 100 results as per search params.
I am not able to understand what should be the correct approach and how it should be done considering speed and reliability.
I am open to use any other database for temporary storage. or any other ideas also.
I want to select one record at a time from a MySQL table. I found a similar post on SO here -
How to select records one by one without repeating
However, in my case, the table size is not fixed, The data is continuously been added to the table and I want to select one record at a time from this table. Also, I'm using python to connect to the mysql database and do processing over each record. Any pointers?
P.S. : The size of the table is very large, hence everytime, I can not compute the number of records in the table
This functionality isn't built into SQL.
If you have a dense, incrementing, indexed column on the table, then you just do this:
i = 0
while True:
# your usual select here, but with "and MyIndexColumn = %d' % (i,)
With some databases, there's such a column built in, whether you want it or not, usually called either "RowID" or "Row ID"; if not, you can always add such a column explicitly.
If the values can be skipped (usually they can, e.g., if for no other reason than because someone can delete a row from the middle), you have to be able to handle a select that returns 0 rows, of course.
The only issue is that there's no way to tell when you're done. But that's inherent in your initial problem statement: data is being continuously added, and you don't know the size of it in advance, so how could you? If you need to know when the producer is done, it has to somehow (possibly through another table in the database) give you the highest rowid it created.