I'm in the process of learning the pandas library. My task is to download the table from the website, transform it and send it to the database - in this case to ms-access. I download the data to my DataFrame.
My problem is that selected table in one of the columns (concerning prices) has value '-'. Looking for information how to deal with it I found 3 main possibilities:
Using 'replace' character '-' to 0. However, this solution does not meet my expectations because the value '-' means no data and not it`s value equal to 0.
The replacement of '-' with an empty string - this solution will not pass, because after changes the column has the data type - float.
Replace '-' with NaN using - .replace('-',np.nan) - This possibility is closest to solving my problem, but after loading data to the access using the "pyodbc" library the replaced records have the value '1,#QNAN'. I'm betting that such a format accepts Access for NaN type, but the problem occurs when I would like to pull the average from the column using SQL:
sql SELECT AVG (nameColumns) FROM nameTable name
returns the 'Overflow' message.
Does anyone have any idea what to do with '-'? Is there any way that the numeric field after loading is just empty?
EDIT - more code:
conn = pyodbc.connect(r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=PathToDB;')
cursor = conn.cursor()
for index,row in df.iterrows():
cursor.execute("INSERT INTO tableName(col1,col2,col3) VALUES (?,?,?)",
row['col1'], row['col2'],row['col3'])
conn.commit()
cursor.close()
conn.close()
EDIT 2 - more code
import pandas as pd
d ={'col1': [1,2,'-'],'col2':[5,'-',3]}
dfstack = pd.DataFrame(data=d)
dfstack.head()
dfstack = dfstack.replace("-",None)
dfstack.head()
Maybe you could replace - with the None keyword in python? I'm not sure how pyodbc works but SQL will ignore NULL values with its AVG function and pyodbc might convert None to NULL.
https://www.sqlservertutorial.net/sql-server-aggregate-functions/sql-server-avg/
You need to replace the '-' with None, that seems to convert it to NULL when inserting using pyodbc:
dfstack = dfstack.where(dfstack!='-', None)
Related
I'm brand new in python and updating tables using sql. I would like to ask how to update certain group of values in single column using SQL. Please see example below:
id
123
999991234
235
789
200
999993456
I need to add the missing prefix '99999' to the records without '99999'. The id column has integer data type by default. I've tried the sql statement, but I have a conflict between data types that's I've tried with cast statement:
update tablename
set id = concat('99999', cast(id as string))
where id not like '99999%';
To be able to use the LIKE operator and CONCAT() function, the column data type should be a STRING or BYTE. In this case, you would need to cast the WHERE clause condition as well as the value of the SET statement.
Using your sample data:
Ran this update script:
UPDATE mydataset.my_table
SET id = CAST(CONCAT('99999', CAST(id AS STRING)) AS INTEGER)
WHERE CAST(id as STRING) NOT LIKE '99999%'
Result:
Rows were updated successfully and the table ended up with this data:
I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.
I have 2 columns in a database(sent_time, accept_time) which each include time stamps, as well as a field that can have 2 different values (ref_col). I would like to find a way in my query to make a new column (result_col), which will check the value of ref_col, and copy sent_time if the value is 1, and copy accept_time if the value is 2.
I am using pandas to query the database in python, if that has any bearing on the answer.
Just use case expression statement :
SELECT sent_time,
accept_time,
ref_col,
CASE WHEN ref_col = 1 THEN sent_col
ELSE accept_col
END AS result_col
FROM Your_Table
When you say "I have 2 columns in a database", what you actually mean is that you have 2 columns in a table, right?
In sql for postgresql it would be something like:
select (case when ref_col = 1 then sent_time else accept_time end) as result_col
from mytable
don't know how close from SQL standard that is, but would assume it's not that far.
I have some data that contains NULLs, floats and the occasional Nan. I'm trying to insert this data into a MySQL database using python and MySqldb.
Here's the insert statement:
for row in zip(currents, voltages):
row = [id] + list(row)
for item in row:
sql_insert = ('INSERT INTO result(id, current, voltage)'
'VALUES(%s, "%s")')
cursor.execute(sql_insert, row)
This is the table:
CREATE TABLE langmuir_result (result_id INT auto_increment,
id INT,
current FLOAT,
voltage FLOAT,
PRIMARY KEY (result_id));
When I try to insert NaN into the table I get this error:
_mysql_exceptions.DataError: (1265, "Data truncated for column 'current' at row 1")
I want to insert the NaN values into the database as a float or a number, not a string or NULL. I've tried having the type of the column be FLOAT and DECIMAL but get the same error. How can I do this without making it a string or NULL? Is that possible?
No, it's not possible to store a NaN value in a FLOAT type columns in Mysql. Values allowed are only NULL or a number. You may solve it using some value that you don't use for NaN (maybe negatives, a big/low value)
Try converting your NaN to None since MySQL understands None and in the UI you'll see NULL
You may use pandas package isna function:
import pandas as pd # use pandas package
if pd.isna(row[column]):
values_data.append(None)
else:
values_data.append(row[column])
I have a table in SQLite3 database (using Python), Tweet Table (TwTbl) that has some values in the column geo_id. Most of the values in this column are NULL\None. I want to replace/update all NULLS in the geo_id column of TwTbl by a number 999. I am not sure about the syntax. I am trying the following query, but I am getting an error ("No such Column: None")
c.execute("update TwTbl SET geo_id = 999 where geo_id = None").fetchall()
I even tried using Null instead of None, that did not give any errors but did not do any update.
Any help will be appreciated.
As an answer, so that you can accept it if you're inclined.
You need Is Null instead of = Null. Null is a special value that's indeterminate, and neither equal nor non-equal in most database implementations.