Running parameterized queries - python

Quite new to this google bigquery sql thing so please bear with me. I'm trying to build a google standardSQL parameterized query. The following sample was used and ran successfully on Google BigQuery WebUI.
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
OR outputs.output_pubkey_base58 = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= '2010-05-21' AND trans_time <= '2010-05-23' AND satoshis >= 1000000000000
--ORDER BY date
Sample extracted from here as a side note.
This gives 131 rows:
Table sample
What I would like to be able to do, is to use the ScalarQueryParameter, so I could programatically use some vars along the way. Like this:
myquery = """
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = #pubkey
OR outputs.output_pubkey_base58 = #pubkey
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= #mdate AND trans_time <= #tdate AND satoshis >= 1000000000000
--ORDER BY date
"""
varInitDate = '2010-05-21'
varEndDate = '2010-05-23'
pubkey = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
query_params = [
bigquery.ScalarQueryParameter('mdate', 'STRING', varInitDate),
bigquery.ScalarQueryParameter('tdate', 'STRING', varEndDate),
bigquery.ScalarQueryParameter('pubkey', 'STRING', pubkey)
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(myquery,job_config=job_config)
Nevertheless, i'm facing the following error:
<google.cloud.bigquery.table.RowIterator object at 0x7fa098be85f8>
Traceback...
TypeError: 'RowIterator' object is not callable
Can someone pls enlighten me on how can i achieve the mentioned purpose ?
P.S - '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4' is the Laszlo’s Pizza 10.000 bitcoin exchange (1000000000000 satoshis).

So ... the problem was with this line of code that didn't work as expected. Not sure why though, as it worked with queries that didn't have parameterized vars.
results = query_job.result()
df = results().to_dataframe()
And the actual code... Remember to replace with your own login credentials for this to work.
import datetime, time
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
#login
credentials = service_account.Credentials.from_service_account_file('your.json')
project_id = 'your-named-project'
client = bigquery.Client(credentials= credentials,project=project_id)
#The query
q_input = """
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = #pubkey
OR outputs.output_pubkey_base58 = #pubkey
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= #mdate AND trans_time <= #tdate AND satoshis >= #satoshis
--ORDER BY date
"""
#The desired purpose
def runQueryTransaction(varInitDate,varEndDate,pubkey,satoshis):
global df
query_params = [
bigquery.ScalarQueryParameter('mdate', 'STRING', varInitDate),
bigquery.ScalarQueryParameter('tdate', 'STRING', varEndDate),
bigquery.ScalarQueryParameter('pubkey', 'STRING', pubkey),
bigquery.ScalarQueryParameter('satoshis', 'INT64', satoshis),
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(q_input,job_config=job_config) # API request - starts the query
results = query_job.result() # Waits for job to complete.
df=pd.DataFrame(columns=['input_key', 'output_key', 'satoshis', 'trans_id', 'date'])
for row in results:
df.loc[len(df)] = [row.input_key, row.output_key, row.satoshis, row.trans_id, row.date]
#print("{} : {} : {} : {} : {}".format(row.input_key, row.output_key, row.satoshis, row.trans_id, row.date))
return df
#runQueryTransaction(InitialDate,EndDate,WalletPublicKey,Satoshis)
runQueryTransaction('2010-05-21','2010-05-23','1XPTgDRhN8RFnzniWCddobD9iKZatrvH4',1000000000000)
Cheers

Related

Unable to process large amount of data using for loop

I am downloading 2 years worth of OHLC for 10k symbols and writing it to database. When I try to pull the entire list it crashes (but doesn't if I download 20%):
import config
from alpaca_trade_api.rest import REST, TimeFrame
import sqlite3
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
start_date = (datetime.datetime.now() - relativedelta(years=2)).date()
start_date = pd.Timestamp(start_date, tz='America/New_York').isoformat()
end_date = pd.Timestamp(datetime.datetime.now(), tz='America/New_York').isoformat()
conn = sqlite3.connect('allStockData.db')
api = REST(config.api_key_id, config.api_secret, base_url=config.base_url)
origin_symbols = pd.read_sql_query("SELECT symbol, name from stock", conn)
df = origin_symbols
df_dict = df.to_dict('records')
startTime = datetime.datetime.now()
api = REST(config.api_key_id, config.api_secret, base_url=config.base_url)
temp_data = []
for key in df_dict:
symbol = key['symbol']
print(f"downloading ${symbol}")
# stock_id = key['id']
barsets = api.get_bars_iter(symbol, TimeFrame.Day, start_date, end_date)
barsets = list(barsets)
for index, bar in enumerate(barsets):
bars = pd.DataFrame({'date': bar.t.date(), 'symbol': symbol, 'open': bar.o, 'high': bar.h, 'low': bar.l, 'close': bar.c, 'volume': bar.v, 'vwap': bar.vw}, index=[0])
temp_data.append(bars)
print("loop complete")
data = pd.concat(temp_data)
# write df back to sql, replacing the previous table
data.to_sql('daily_ohlc_init', if_exists='replace', con=conn, index=True)
endTime = datetime.datetime.now()
print(f'time elapsed to pull data was {endTime - startTime}')
To make it work I add this line after df_dict to limit symbols downloaded:
df_dict = df_dict[0:2000]
This will allow me to write to database but I need the entire dictionary (about 10k symbols). How do I write to the database without it crashing?
Since you mentioned that you are able to make it work for 2000 records of df_dict at a time, a possible simple approach could be:
api = REST(config.api_key_id, config.api_secret, base_url=config.base_url)
num_records = len(df_dict)
chunk_size = 2000
num_passes = num_records // chunk_size + int(num_records % chunk_size != 0)
for i in range(num_passes):
start = i * chunk_size
end = min((i + 1) * chunk_size, num_records)
df_chunk = df_dict[start: end]
temp_data = []
for key in df_chunk:
symbol = key['symbol']
print(f"downloading ${symbol}")
barsets = api.get_bars_iter(symbol, TimeFrame.Day, start_date, end_date)
barsets = list(barsets)
for index, bar in enumerate(barsets):
bars = [bar.t.date(), symbol, bar.o, bar.h, bar.l, bar.c, bar.v, bar.vw]
temp_data.append(bars)
# should be a bit more efficient to create a dataframe just once
columns = ['date', 'symbol', 'open', 'high', 'low', 'close', 'volume', 'vwap']
data = pd.DataFrame(temp_data, columns=columns)
# should delete previous table when writing first chunk, then start appending from next passes through df_dict
data.to_sql('daily_ohlc_init', if_exists='replace' if i == 0 else 'append', con=conn, index=True)
print(f"Internal loop finished processing records {start} to {end} out of {num_records}.")
endTime = datetime.datetime.now()
print(f'time elapsed to pull data was {endTime - startTime}')

How to use pandas INPUT function to get a list of customers

I have created a code to get users of my platform based on 2 things:
choiceTitle: search for a specific word contained in the title of an Ad that users of my platform have looked at. For eg, the Ad is "We are offering free Gin" and I want to get the word 'Gin'
PrimaryTagPreviousChoice: the Ad has a "Food and Drink" tag
I can get those users who are interested in Gin and Food and Drink with:
(df2['choiceTitle'].str.contains("(?i)Gin")) & (df2['PrimaryTagPreviousChoice'].str.contains("(?i)Food and Drink"))
What I'd like to do is create a function with all my code inside (hence the sql query, the rename operation, the sort_values ​​operation etc....) and then use the INPUT function. So I'll just have to run my code, so that python will ask me 2 questions:
choiceTitle? ... Gin
PrimaryTagPreviousChoice? ...Food and Drink.
I enter the 2 options and it gives me the users interested in, let's say, Gin and Food and Drink.
How can I do it?
MY CODE:
df = pd.read_sql_query(""" select etc..... """, con)
df1 = pd.read_sql_query(""" select etc..... """, con)
df1['user_id'] = df1['user_id'].apply(str)
df2 = pd.merge(df, df1, left_on='user_id', right_on='user_id', how='left')
tag = df2[
(df2['choiceTitle'].str.contains("(?i)Gin")) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)Food and Drink"))
]
dw = tag[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
dw = dw.sort_values(['last_login'], ascending=[False])
dw = dw[dw.last_login > dt.datetime.now() - pd.to_timedelta("30day")]
dw = dw.rename({'user': 'user full name', 'title': 'user title'}
, axis='columns')
dw.drop_duplicates(subset ="Email",
keep = 'first', inplace = True)
Adding a function in Python is simple. Just use the def keyword to declare the function and put your existing code under it (indented). Put parameters in the parenthesis.
Here is the updated code:
def GetUsers (title, tag)
df = pd.read_sql_query(""" select etc..... """, con)
df1 = pd.read_sql_query(""" select etc..... """, con)
df1['user_id'] = df1['user_id'].apply(str)
df2 = pd.merge(df, df1, left_on='user_id', right_on='user_id', how='left')
tag = df2[
(df2['choiceTitle'].str.contains("(?i)" + title)) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)" + tag))]
dw = tag[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
dw = dw.sort_values(['last_login'], ascending=[False])
dw = dw[dw.last_login > dt.datetime.now() - pd.to_timedelta("30day")]
dw = dw.rename({'user': 'user full name', 'title': 'user title'}
, axis='columns')
dw.drop_duplicates(subset ="Email",
keep = 'first', inplace = True)
return dw # send back to print statement
# get input from user
inpTitle = input ("choiceTitle? ")
inpTag = input ("PrimaryTagPreviousChoice? ")
# run function
result = GetUsers (inpTitle, inpTag)
print(result)
Try this. Save your input() as variables and use string concatenation to edit your mask. Note that an additional set of {} is needed for escaping.
choiceTitle = input('choiceTitle?')
PrimaryTagPreviousChoice = input('PrimaryTagPreviousChoice?')
mask = df2[(df2['choiceTitle'].str.contains("(?i){{0}}".format(choiceTitle))) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)
{{0}}".format(PrimaryTagPreviousChoice)))]
dw = mask[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
....

Python 3.7 KeyError

I like to retrieve information from NewsApi and ran into an issue. Enclosed the code:
from NewsApi import NewsApi
import pandas as pd
import os
import datetime as dt
from datetime import date
def CreateDF(JsonArray,columns):
dfData = pd.DataFrame()
for item in JsonArray:
itemStruct = {}
for cunColumn in columns:
itemStruct[cunColumn] = item[cunColumn]
# dfData = dfData.append(itemStruct,ignore_index=True)
# dfData = dfData.append({'id': item['id'], 'name': item['name'], 'description': item['description']},
# ignore_index=True)
# return dfData
return itemStruct
def main():
# access_token_NewsAPI.txt must contain your personal access token
with open("access_token_NewsAPI.txt", "r") as f:
myKey = f.read()[:-1]
#myKey = 'a847cee6cc254d8495632f83d5c77d39'
api = NewsApi(myKey)
# get sources of news
# columns = ['id', 'name', 'description']
# rst_source = api.GetSources()
# df = CreateDF(rst_source['sources'], columns)
# df.to_csv('source_list.csv')
#
#
# # get news for specific country
# rst_country = api.GetHeadlines()
# columns = ['author', 'publishedAt', 'title', 'description','content', 'url']
# df = CreateDF(rst_country['articles'], columns)
# df.to_csv('Headlines_country.csv')
# get news for specific symbol
symbol = "coronavirus"
sources = 'bbc.co.uk'
columns = ['author', 'publishedAt', 'title', 'description', 'content', 'source']
limit = 500 # maximum requests per day
i = 1
startDate = dt.datetime(2020, 3, 1, 8)
# startDate = dt.datetime(2020, 3, 1)
df = pd.DataFrame({'author': [], 'publishedAt': [], 'title': [], 'description': [], 'content':[], 'source': []})
while i < limit:
endDate = startDate + dt.timedelta(hours=2)
rst_symbol = api.GetEverything(symbol, 'en', startDate, endDate, sources)
rst = CreateDF(rst_symbol['articles'], columns)
df = df.append(rst, ignore_index=True)
# DF.join(df.set_index('publishedAt'), on='publishedAt')
startDate = endDate
i += 1
df.to_csv('Headlines_symbol.csv')
main()
I got following error:
rst = CreateDF(rst_symbol['articles'], columns)
KeyError: 'articles'
In this line:
rst = CreateDF(rst_symbol['articles'], columns)
I think there is some problem regarding the key not being found or defined - does anyone has an idea how to fix that? I'm thankful for every hint!
MAiniak
EDIT:
I found the solution after I tried a few of your hints. Apparently, the error occurred when the NewsAPI API key ran into a request limit. This happened every time, until I changed the limit = 500 to limit = 20. For some reason, there is no error with a new API Key and reduced limit.
Thanks for your help guys!
Probably 'articles' is not one of your columns in rst_symbol object.
The python documentation [2] [3] doesn't mention any method named NewsApi() or GetEverything(), but rather NewsApiClient() and get_everything(), i.e.:
from newsapi import NewsApiClient
# Init
newsapi = NewsApiClient(api_key='xxx')
# /v2/top-headlines
top_headlines = newsapi.get_top_headlines(q='bitcoin',
sources='bbc-news,the-verge',
category='business',
language='en',
country='us')
# /v2/everything
all_articles = newsapi.get_everything(q='bitcoin',
sources='bbc-news,the-verge',
domains='bbc.co.uk,techcrunch.com',
from_param='2017-12-01',
to='2017-12-12',
language='en',
sort_by='relevancy',
page=2)
# /v2/sources
sources = newsapi.get_sources()

pass an array of values into bigquery query with pandas

After some processing I get the following array:
users = array([u'5451709866311680', u'4660301072957440', u'6370791394377728',
u'5121933955825664', u'4778500988862464', u'5841867648270336',
u'4751430816628736', u'4869137213947904', u'5152642703556608',
u'6531810976595968', u'4824167228637184', u'6058117842337792',
u'5969360933879808', u'4764494160986112', u'5443041280131072',
u'4846257587617792', u'5409371420884992', u'6197117949313024',
u'6643644022915072', u'5060273861820416'], dtype=object)
And then I would like to query this users in another table in bigquery but I'm having issues.
query = """
SELECT *
FROM games
WHERE user_id IN %users
"""
segment = pd.io.gbq.read_gbq(query, project_id='shared', dialect='standard)
Anyone knows how to proceed?
Thank you
Probably you are having issues in your query and not in pandas. In order for this query to work, you'd have to do something like:
query = """
SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(['user1', 'user2', 'user3'])
"""
If you do not UNNEST your array then BigQuery cannot look for its inner values.
One thing you could do then is something like:
query = """
SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(%s)
""" %(map(str, users))
Should result in:
query = """SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(['5451709866311680', '4660301072957440', '6370791394377728', '5121933955825664', '4778500988862464', '5841867648270336', '4751430816628736', '4869137213947904', '5152642703556608', '6531810976595968', '4824167228637184', '6058117842337792', '5969360933879808', '4764494160986112', '5443041280131072', '4846257587617792', '5409371420884992', '6197117949313024', '6643644022915072', '5060273861820416'])
Here is one possibility using the open dataset bigquery-public-data.github_repos:
from numpy import array
import pandas as pd
PROJEC_ID = 'choose-your-project-id'
input_array = array(['JavaScript', 'Python', 'R'], dtype=object)
query = """
SELECT lang.name, COUNT(*) AS count
FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) AS lang
WHERE lang.name IN UNNEST(#lang_names)
GROUP BY 1
ORDER BY 2 DESC;
"""
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'lang_names',
'parameterType': {'type': 'ARRAY',
'arrayType': {'type': 'STRING'}},
'parameterValue': {'arrayValues': [{'value': i} for i in input_array]}
}
]
}
}
result = pd.io.gbq.read_gbq(query, project_id=PROJEC_ID, dialect='standard',
configuration=query_config)
print(result.to_string())
Now this results in:
name count
0 JavaScript 1109499
1 Python 551257
2 R 29572
References:
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#QueryRequest
https://cloud.google.com/bigquery/docs/reference/rest/v2/QueryParameter

Group by column to get array results in Postgresql

I have a table called moviegenre which looks like:
moviegenre:
- movie (FK movie.id)
- genre (FK genre.id)
I have a query (ORM generated) which returns all movie.imdb and genre.id's which have genre.id's in common with a given movie.imdb_id.
SELECT "movie"."imdb_id",
"moviegenre"."genre_id"
FROM "moviegenre"
INNER JOIN "movie"
ON ( "moviegenre"."movie_id" = "movie"."id" )
WHERE ( "movie"."imdb_id" IN (SELECT U0."imdb_id"
FROM "movie" U0
INNER JOIN "moviegenre" U1
ON ( U0."id" = U1."movie_id" )
WHERE ( U0."last_ingested_on" IS NOT NULL
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND U1."genre_id" IN ( 2, 10 ) ))
AND "moviegenre"."genre_id" IN ( 2, 10 ) )
The problem is that I'll get results in the format:
[
('imdbid22`, 'genreid1'),
('imdbid22`, 'genreid2'),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
Is there a way within the query itself I can group all of the genre ids into a list under the movie.imdb_id's? I'd like do to grouping in the query.
Currently doing it in my web app code (Python) which is extremely slow when 50k+ rows are returned.
[
('imdbid22`, ['genreid1', 'genreid2']),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
thanks in advance!
edit:
here's the python code which runs against the current results
results_list = []
for item in movies_and_genres:
genres_in_common = len(set([
i['genre__id'] for i in movies_and_genres
if i['movie__imdb_id'] == item['movie__imdb_id']
]))
imdb_id = item['movie__imdb_id']
if genres_in_common >= min_in_comon:
result_item = {
'movie.imdb_id': imdb_id,
'count': genres_in_common
}
if result_item not in results_list:
results_list.append(result_item)
return results_list
select m.imdb_id, array_agg(g.genre_id) as genre_id
from
moviegenre g
inner join
movie m on g.movie_id = m.id
where
m.last_ingested_on is not null
and not m.imdb_id in ('tt0169547')
and not m.imdb_id in ('tt0169547')
and g.genre_id in (2, 10)
group by m.imdb_id
array_agg will create an array of all the genre_ids of a certain imdb_id:
http://www.postgresql.org/docs/current/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE
I hope python code will be fast enough:
movielist = [
('imdbid22', 'genreid1'),
('imdbid22', 'genreid2'),
('imdbid44, 'genreid1'),
('imdbid55', 'genreid8'),
]
dict = {}
for items in movielist:
if dict[items[0]] not in dict:
dict[items[0]] = items[1]
else:
dict[items[0]] = dict[items[0]].append(items[1])
print dict
Output:
{'imdbid44': ['genreid1'], 'imdbid55': ['genreid8'], 'imdbid22': ['genreid1', 'genreid2']}
If you just need movie name, count:
Change this in original query you will get the answer you dont need python code
SELECT "movie"."imdb_id", count("moviegenre"."genre_id")
group by "movie"."imdb_id"

Categories