I am new to python so this might have wrong syntax as well. I want to bulk update my sql database. I have already created a column in the database with null values. My aim is to find the moving average on python then use those value to update the database. But the problem is that the moving averages that I have found is in a data table format where columns are time blocks and rows are dates. But in the database both dates and time blocks are different column.
MA30 = pd.rolling_mean(df, 30)
cur.execute(""" UPDATE feeder_ndmc_copy
SET time_block = CASE date, (MA30)""")
db.commit()
This is the error I am getting.
self.errorhandler(self, exc, value)
I have seen a lot of other question answers but there is no example of how to use the finding of python command to update the database. Any suggestions?
Ok, so it is very hard to give you a complete answer with what little information your question contains, but I'll try my best to explain how I would deal with this.
The easiest way is probably to automatically write a separate UPDATE query for each row you want to update. If I'm not mistaken, this will be relatively efficient on the database side of things but it will produce some overhead in your python program. I'm not a database guy, but since you didn't mention performance optimality in your question, i will assume that any solution that works will do for now.
I will be using sqlalchemy to handle interactions with the database. Take care that if you want to copy my code, you will need to install sqlalchemy and import the module in your code.
First, we will need to create a sqlalchemy engine. I will assume that you use a local database, if not you will need to edit this part.
engine = sqlalchemy.create_engine('mysql://localhost/yourdatabase')
Now lets create a string containing all our queries (i don't know the name of the columns you want to update, so I'll use place holders, I also do no know the format of your time index, so I'll have to guess):
queries = ''
for index, value in MA30.iterrows():
queries += 'UPDATE feeder_ndmc_copy SET column_name = {} WHERE index_column_name = {};\n'.format(value, index.strftime(%Y-%m-%d %H:%M:%S))
You will need to adapt this part heavily to conform with your requirements. I simply can't do any better without you supplying the proper schema of your database.
With the list of queries complete, we proceed to put the data into SQL:
with engine.connect() as connection:
with engine.begin():
connection.execute(queries)
Edit:
Obviously my solution does not deal in any way with things like if your pandas operations create datapoints for timestamps that are not in mysql, etc. You need to keep that in mind. If that is a problem, you will need to use queries of the form
INSERT INTO table (id,Col1,Col2) VALUES (1,1,1),(2,2,3),(3,9,3),(4,10,12)
ON DUPLICATE KEY UPDATE Col1=VALUES(Col1),Col2=VALUES(Col2);
Related
I have a question about a problem that I'm sure is pretty often, but I still couldn't find any satisfying answers.
Suppose I have a huge-size original SQLite database. Now, I want to create an empty database that is initialized with all the tables, columns, and primary/foreign keys but has no rows in the tables. Basically, I want to create an empty table using the original one as a template.
The "DELETE from {table_name}" query for every table in the initial database won't do it because the resulting empty database ends up having a very large size, just like the original one. As I understand it, this is because all the logs about the deleted rows are still being stored. Also, I don't know exactly what else is being stored in the original SQLite file, I believe it may store some other logs/garbage things that I don't need, so I would prefer just creating an empty one from scratch if it is possible. FWIW, I eventually need to do it inside a Python script with sqlite3 library.
Would anyone please provide me with an example of how this can be done? A set of SQLite queries or a python script that would do the job?
You can use Command Line Shell For SQLite
sqlite3 old.db .schema | sqlite3 new.db
I have 10,000 dataframes (which can all be transformed into JSONs). Each dataframe has 5,000 rows. So, eventually it's quite a lot of data that I would like to insert to my AWS RDS databases.
I want to insert them into my databases but I find the process using PyMySQL a bit too slow as I iterate through every single row and insert them.
First question, is there a way to insert the whole dataframe into a table straight away. I've tried using the "to_sql" function in the dataframe library but it doesn't seem to work as I am using Python 3.6
Second question, should I use NoSQL instead of RDS? What would be the best way to structure my (big) data?
Many thanks
from sqlalchemy import create_engine
engine = create_engine("mysql://......rds.amazonaws.com")
con = engine.connect()
my_df.to_sql(name='Scores', con=con, if_exists='append')
The table "Scores" is already existing and I would like to put all of my databases into this specific table. Or is there a better way to organise my data?
It seems like you're either missing the package or the package is installed in a different directory. Use a file manager to look for the missing library libmysqlclient.21.dylib and then copy it to the correct folder /Users/anaconda3/lib/python3.6/site-packages/MySQLdb/_mysql.cpython-36m-darwin.so.
My best guess is it's in either your lib or MySQLdb directory. You may also be able to find it in a virtual environment that you have set up.
All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.
You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.
I have a sample data, like:
data = {'tag':'ball','color':'red'}
I want insert it to my collection. But if it has a same one, then do not insert.
I can do it in python like:
if not collection.find_one({data}):
collection.insert(data)
Or, I can do it with update:
collection.update(data,data,upsert=True)
But, my question is: weather the 'update' write data every time?
In this two method, they search same times for the duplicate data, but in 1st way, only write while not exist.
In 2nd way, is it means if data not exists, insert one. if data exists, update one. The database would be wrote in all situation?
So, which method is better? and why?
I recently came across a situation where I only want to write to the database if a record doesn't exist. In this case your first method is better.
For the 2nd way, the database would be written in all situation. However, if you want certain field not to be updated, you can try $setOnInsert as explained here: https://docs.mongodb.com/manual/reference/operator/update/setOnInsert/
I am pretty new to Python, so I like to ask you for some advice about the right strategy.
I've a textfile with fixed positions for the data, like this.
It can have more than 10000 rows. At the end the database (SQL) table should look like this. File & Table
The important col is nr. 42. It defines the kind of data in this row.
(2-> Titel, 3->Text 6->Amount and Price). So the data comes from different rows.
QUESTIONS:
Reading the Data: Since there are always more than 4 rows
containing the data, process them line by line, as soon as one sql
statement is complete, send it OR:read all the lines into a list of
lists, and then iterate over these lists? OR: read all the lines in
one list and iterate?
Would it be better to convert the data into a csv or json instead of preparing sql statements, and then use the database software to import to db? (Or use NoSQL DB)
I hope I made my problems clear, if not, I will try.....
Every advice is really appreciated.
The problem is pretty simple, so perhaps you are overthinking it a bit. My suggestion is to use the simplest solution: read a line, parse it, prepare an SQL statement and execute it. If the database is around 10000 records, anything would work, e.g. SQLLite would do just fine. The problem is in the form of a table already so translation to a relational database like SQLLite or MySQL is a pretty obvious and straightforward choice. If you need a different type of organization in your data then you can look at other types of databases: don't do it only because it is "fashionable".