I have a question about a problem that I'm sure is pretty often, but I still couldn't find any satisfying answers.
Suppose I have a huge-size original SQLite database. Now, I want to create an empty database that is initialized with all the tables, columns, and primary/foreign keys but has no rows in the tables. Basically, I want to create an empty table using the original one as a template.
The "DELETE from {table_name}" query for every table in the initial database won't do it because the resulting empty database ends up having a very large size, just like the original one. As I understand it, this is because all the logs about the deleted rows are still being stored. Also, I don't know exactly what else is being stored in the original SQLite file, I believe it may store some other logs/garbage things that I don't need, so I would prefer just creating an empty one from scratch if it is possible. FWIW, I eventually need to do it inside a Python script with sqlite3 library.
Would anyone please provide me with an example of how this can be done? A set of SQLite queries or a python script that would do the job?
You can use Command Line Shell For SQLite
sqlite3 old.db .schema | sqlite3 new.db
Related
So I have this database (Size 3.1Gb total), but this is due to one specific table I've got, containing A LOT of console output text, from some test runs. The table itself is 2.7Gb, and I was wondering if there could be another solution for this table, so the database would get a lot smaller? It's getting a bit anoying to backup the database or even make a copy of the database to a playground, because it's so big this table.
The Table is this one
Would it be better to delete this table and make all the LogTextData <- LongText, be stored in a PDF, instead of the database? (Then I can't backup this data tho...)
Do anyone have an idea on how to make this table smaller, or another solution? I'm open for suggestions, to make this table smaller.
The way this console log data gets imported to the database is by Python scipts, so I have fully access to other python solutions, if there is any.
You could try enabling either the Storage-Engine Independent Column Compression or InnoDB page compression. Both provides ways to have a smaller on-disk database which is especially useful for the large text fields.
Since there's only one table with one particular field that's taking up space, trying out individual column compression seems like the easiest first step.
According to me you should just store the path of log files instead of the complete logs in the database. By using those paths you can access the files anytime you want.
It will decrease the size of database too.
Your new table would look like this,
LogID, BuildID, JenkinsJobName,LogTextData.
I have 10,000 dataframes (which can all be transformed into JSONs). Each dataframe has 5,000 rows. So, eventually it's quite a lot of data that I would like to insert to my AWS RDS databases.
I want to insert them into my databases but I find the process using PyMySQL a bit too slow as I iterate through every single row and insert them.
First question, is there a way to insert the whole dataframe into a table straight away. I've tried using the "to_sql" function in the dataframe library but it doesn't seem to work as I am using Python 3.6
Second question, should I use NoSQL instead of RDS? What would be the best way to structure my (big) data?
Many thanks
from sqlalchemy import create_engine
engine = create_engine("mysql://......rds.amazonaws.com")
con = engine.connect()
my_df.to_sql(name='Scores', con=con, if_exists='append')
The table "Scores" is already existing and I would like to put all of my databases into this specific table. Or is there a better way to organise my data?
It seems like you're either missing the package or the package is installed in a different directory. Use a file manager to look for the missing library libmysqlclient.21.dylib and then copy it to the correct folder /Users/anaconda3/lib/python3.6/site-packages/MySQLdb/_mysql.cpython-36m-darwin.so.
My best guess is it's in either your lib or MySQLdb directory. You may also be able to find it in a virtual environment that you have set up.
All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.
You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.
I am new to python so this might have wrong syntax as well. I want to bulk update my sql database. I have already created a column in the database with null values. My aim is to find the moving average on python then use those value to update the database. But the problem is that the moving averages that I have found is in a data table format where columns are time blocks and rows are dates. But in the database both dates and time blocks are different column.
MA30 = pd.rolling_mean(df, 30)
cur.execute(""" UPDATE feeder_ndmc_copy
SET time_block = CASE date, (MA30)""")
db.commit()
This is the error I am getting.
self.errorhandler(self, exc, value)
I have seen a lot of other question answers but there is no example of how to use the finding of python command to update the database. Any suggestions?
Ok, so it is very hard to give you a complete answer with what little information your question contains, but I'll try my best to explain how I would deal with this.
The easiest way is probably to automatically write a separate UPDATE query for each row you want to update. If I'm not mistaken, this will be relatively efficient on the database side of things but it will produce some overhead in your python program. I'm not a database guy, but since you didn't mention performance optimality in your question, i will assume that any solution that works will do for now.
I will be using sqlalchemy to handle interactions with the database. Take care that if you want to copy my code, you will need to install sqlalchemy and import the module in your code.
First, we will need to create a sqlalchemy engine. I will assume that you use a local database, if not you will need to edit this part.
engine = sqlalchemy.create_engine('mysql://localhost/yourdatabase')
Now lets create a string containing all our queries (i don't know the name of the columns you want to update, so I'll use place holders, I also do no know the format of your time index, so I'll have to guess):
queries = ''
for index, value in MA30.iterrows():
queries += 'UPDATE feeder_ndmc_copy SET column_name = {} WHERE index_column_name = {};\n'.format(value, index.strftime(%Y-%m-%d %H:%M:%S))
You will need to adapt this part heavily to conform with your requirements. I simply can't do any better without you supplying the proper schema of your database.
With the list of queries complete, we proceed to put the data into SQL:
with engine.connect() as connection:
with engine.begin():
connection.execute(queries)
Edit:
Obviously my solution does not deal in any way with things like if your pandas operations create datapoints for timestamps that are not in mysql, etc. You need to keep that in mind. If that is a problem, you will need to use queries of the form
INSERT INTO table (id,Col1,Col2) VALUES (1,1,1),(2,2,3),(3,9,3),(4,10,12)
ON DUPLICATE KEY UPDATE Col1=VALUES(Col1),Col2=VALUES(Col2);
I am pretty new to Python, so I like to ask you for some advice about the right strategy.
I've a textfile with fixed positions for the data, like this.
It can have more than 10000 rows. At the end the database (SQL) table should look like this. File & Table
The important col is nr. 42. It defines the kind of data in this row.
(2-> Titel, 3->Text 6->Amount and Price). So the data comes from different rows.
QUESTIONS:
Reading the Data: Since there are always more than 4 rows
containing the data, process them line by line, as soon as one sql
statement is complete, send it OR:read all the lines into a list of
lists, and then iterate over these lists? OR: read all the lines in
one list and iterate?
Would it be better to convert the data into a csv or json instead of preparing sql statements, and then use the database software to import to db? (Or use NoSQL DB)
I hope I made my problems clear, if not, I will try.....
Every advice is really appreciated.
The problem is pretty simple, so perhaps you are overthinking it a bit. My suggestion is to use the simplest solution: read a line, parse it, prepare an SQL statement and execute it. If the database is around 10000 records, anything would work, e.g. SQLLite would do just fine. The problem is in the form of a table already so translation to a relational database like SQLLite or MySQL is a pretty obvious and straightforward choice. If you need a different type of organization in your data then you can look at other types of databases: don't do it only because it is "fashionable".