I would like to write another table by partition date the table in bigquery. But I couldn't find how to do it. I use Python and google cloud library. I want to create a table using standard SQL.But I get an error.
Error : google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/astute-baton-272707/queries/f4b9dadb-1390-4260-bb0e-fb525aff662c?maxResults=0&location=US: The number of columns in the column definition list does not match the number of columns produced by the query at [2:72]
Please let me know if there is another solution. Day to day İnsert to table the next stage of the project.
I may have been doing it wrong from the beginning. I am not sure.
Thank You.
client = bigquery.Client()
sql = """
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
OPTIONS (
description="weather stations with precipitation, partitioned by day"
) AS
select
FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)), "Turkey") AS visitStartTime_ts,
date
,FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime+(h.time/1000) AS INT64)), "Turkey") AS hitsTime_ts
,h.appInfo.appId as appId
,fullVisitorId
,(SELECT value FROM h.customDimensions where index=1) as cUserId
,h.eventInfo.eventCategory as eventCategory
,h.eventInfo.eventAction as eventAction
,h.eventInfo.eventLabel as eventLabel
,REPLACE(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(1)],'}','') as player_type
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(0)] as PLAY_SESSION_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(1)] as CHANNEL_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(2)] as CONTENT_EPG_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(3)] as OFF_SET
FROM `zzzzz.yyyyyy.xxxxxx*` a,
UNNEST(hits) AS h
where
1=1
and SPLIT(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(0)],'/')[OFFSET(0)] like 'player'
and _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND (BYTE_LENGTH(h.eventInfo.eventCategory) - BYTE_LENGTH(REPLACE(h.eventInfo.eventCategory,'/{','')))/2 + 1 = 2
AND h.eventInfo.eventAction='heartBeat'
"""
job = client.query(sql) # API request.
job.result()
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
A quick solution for the problem presented here: When creating a table, you don't need to declare the schema of it, if there's a query where data is coming from. Right now there's a conflict between the data and the declared schema. So remove one.
Instead of starting the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
Start the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy
PARTITION BY date
Am trying to use petl library to build an ETL process that copied data between two tables. The table contain a unique slug field on the destination. For that, I wrote my script so It would identify duplicate slugs and convert them with by appending ID to the slug value.
table = etl.fromdb(source_con, 'SELECT * FROM user')
# get whatever remains as duplicates
duplicates = etl.duplicates(table, 'slug')
for dup in [i for i in duplicates.values('id')]:
table = etl.convert(
table,
'slug',
lambda v, row: '{}-{}'.format(slugify_unicode(v), str(row.id).encode('hex')),
where=lambda row: row.id == dup,
pass_row=True
)
The above did not work as expected, it seems like the table object remains with duplicate values after the loop.
Anyone can advise?
Thanks
I am trying to get the last record from my tinyDB, so I want to make a query that looks like that:
"SELECT * FROM table ORDER BY id DESC LIMIT 1"
which should give me the last row. though I can't figure out how to do it with TinyDB.
If you want to order db by time descending for example:
od = sorted(db.all(), key=lambda k: k['time'])
print(od[-1])
According to the documentation, the following would return the doc id of the final element in the db in TinyDB 4.7.0:
el = db.all()[-1]
record = db.get(doc_id=el.doc_id)
Using Document IDs
How about:
table = db.table('table_name')
table.get(doc_id=len(table))
See the way of doc_id here
Using a Query and doing an update :
with TinyDB('db.json') as db:
my_table = db.table('a_table_name')
my_query= Query()
first_of_table_for_this_query = my_table.search(my_query.some_field == some_value)[1]
last_of_table_for_this_query = my_table.search(my_query.some_field == some_value)[-1]
# exemple of use for updating first inserted and last inserted only
my_table.update({'some_field': some_value+42}, doc_ids=[
first_of_table_for_this_query,
last_of_table_for_this_query
])
I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!
I'd like to append to an existing table, using pandas df.to_sql() function.
I set if_exists='append', but my table has primary keys.
I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.
Is this possible with pandas, or do I need to write an explicit query?
There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action
You can do this with the method parameter of to_sql:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))
please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.
The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)
Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.
CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/
I had trouble where I was still getting the IntegrityError
...strange but I just took the above and worked it backwards:
for i, row in df.iterrows():
sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
found = pd.read_sql(sql, con=Engine)
if len(found) == 0:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.
As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:
import pandas as pd
def write_to_table(table_name, data):
df = pd.DataFrame(data)
# Sort by price, so we remove the duplicates after keeping the lowest only
data.sort(key=lambda row: row['price'])
df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
#
df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')
So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.
I hope this will help someone who got almost the same as my situation.
When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:
CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe]
PRIMARY KEY ([id] ASC)
WITH (IGNORE_DUP_KEY = ON)); <-- add
Taken from https://dba.stackexchange.com/a/111771.
Now your df.to_sql() should work again.
The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.
from sqlalchemy.dialects.mysql import insert
import itertools
def insertWithConflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data, whilst taking care of conflicts
Used to handle duplicate key errors during database population
This is my modification of the code snippet
from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key
The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
proved useful.
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted. It is a zip object.
The length of it is equal to the chunck size passed in df_to_sql()
"""
vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)]
insertStmt = insert(sqltable.table).values(vals)
doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
conn.execute(doNothingStmt)
I faced the same issue and I adopted the solution provided by #Huy Tran for a while, until my tables started to have schemas.
I had to improve his answer a bit and this is the final result:
def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
columns.append(column(c))
if sql_table.schema:
my_table = table(sql_table.name, *columns, schema=sql_table.schema)
# table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
my_table = table(sql_table.name, *columns)
# table_name = sql_table.name
# my_table = table(table_name, *columns)
insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()
conn.execute(do_nothing_stmt)
How to use it:
history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)
The idea is the same as #Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.
def insert(df):
try:
# inserting into backup table
df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema')
except:
rows = df.shape[0]
if rows>1:
df1 = df.iloc[:int(rows/2),:]
df2 = df.iloc[int(rows/2):,:]
insert(df1)
insert(df2)
else:
print(f"{df} not inserted. Integrity violation, duplicate primary key/s")