pass an array of values into bigquery query with pandas

pass an array of values into bigquery query with pandas - python

After some processing I get the following array:
users = array([u'5451709866311680', u'4660301072957440', u'6370791394377728',
u'5121933955825664', u'4778500988862464', u'5841867648270336',
u'4751430816628736', u'4869137213947904', u'5152642703556608',
u'6531810976595968', u'4824167228637184', u'6058117842337792',
u'5969360933879808', u'4764494160986112', u'5443041280131072',
u'4846257587617792', u'5409371420884992', u'6197117949313024',
u'6643644022915072', u'5060273861820416'], dtype=object)
And then I would like to query this users in another table in bigquery but I'm having issues.
query = """
SELECT *
FROM games
WHERE user_id IN %users
"""
segment = pd.io.gbq.read_gbq(query, project_id='shared', dialect='standard)
Anyone knows how to proceed?
Thank you

Probably you are having issues in your query and not in pandas. In order for this query to work, you'd have to do something like:
query = """
SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(['user1', 'user2', 'user3'])
"""
If you do not UNNEST your array then BigQuery cannot look for its inner values.
One thing you could do then is something like:
query = """
SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(%s)
""" %(map(str, users))
Should result in:
query = """SELECT *
FROM crozzles.games
WHERE user_id IN UNNEST(['5451709866311680', '4660301072957440', '6370791394377728', '5121933955825664', '4778500988862464', '5841867648270336', '4751430816628736', '4869137213947904', '5152642703556608', '6531810976595968', '4824167228637184', '6058117842337792', '5969360933879808', '4764494160986112', '5443041280131072', '4846257587617792', '5409371420884992', '6197117949313024', '6643644022915072', '5060273861820416'])

Here is one possibility using the open dataset bigquery-public-data.github_repos:
from numpy import array
import pandas as pd
PROJEC_ID = 'choose-your-project-id'
input_array = array(['JavaScript', 'Python', 'R'], dtype=object)
query = """
SELECT lang.name, COUNT(*) AS count
FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) AS lang
WHERE lang.name IN UNNEST(#lang_names)
GROUP BY 1
ORDER BY 2 DESC;
"""
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'lang_names',
'parameterType': {'type': 'ARRAY',
'arrayType': {'type': 'STRING'}},
'parameterValue': {'arrayValues': [{'value': i} for i in input_array]}
}
]
}
}
result = pd.io.gbq.read_gbq(query, project_id=PROJEC_ID, dialect='standard',
configuration=query_config)
print(result.to_string())
Now this results in:
name count
0 JavaScript 1109499
1 Python 551257
2 R 29572
References:
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#QueryRequest
https://cloud.google.com/bigquery/docs/reference/rest/v2/QueryParameter

Related

unable to iterate through loop in python

i have a sql query which basically retrieves coin names and submits an order for each coin.
However, it only submits an order on one coin and fails to loop through the rest, not sure why thats happening .
import sys
**
import pandas as pd
postgreSQL_select_Query = "SELECT base,quote FROM instrument_static where exchange='ftx'"
cursor.execute(postgreSQL_select_Query)
row=([y for y in cursor.fetchall()])
for i in row:
base=i[0]
quote=i[1]
portfolioItems = [
{
'exchange': 'ftx',
'base': base,
'quote': quote,
'amount': 0.01,
},
]
def init():
username = us
password = passwordVal
initialise(clientId, clientSecret, us, password)
if __name__ == "__main__":
init()
result = construct_portfolio_with_params(us, portname, portfolioItems)
print(result)

You need to initialize portfolioItems prior to the loop, and then you can add to it. Try replacing this snippet of code:
...
row=([y for y in cursor.fetchall()])
portfolioItems = []
for i in row:
base=i[0]
quote=i[1]
portfolioItems.append(
{
'exchange': 'ftx',
'base': base,
'quote': quote,
'amount': 0.01,
}
)
...

How to use pandas INPUT function to get a list of customers

I have created a code to get users of my platform based on 2 things:
choiceTitle: search for a specific word contained in the title of an Ad that users of my platform have looked at. For eg, the Ad is "We are offering free Gin" and I want to get the word 'Gin'
PrimaryTagPreviousChoice: the Ad has a "Food and Drink" tag
I can get those users who are interested in Gin and Food and Drink with:
(df2['choiceTitle'].str.contains("(?i)Gin")) & (df2['PrimaryTagPreviousChoice'].str.contains("(?i)Food and Drink"))
What I'd like to do is create a function with all my code inside (hence the sql query, the rename operation, the sort_values operation etc....) and then use the INPUT function. So I'll just have to run my code, so that python will ask me 2 questions:
choiceTitle? ... Gin
PrimaryTagPreviousChoice? ...Food and Drink.
I enter the 2 options and it gives me the users interested in, let's say, Gin and Food and Drink.
How can I do it?
MY CODE:
df = pd.read_sql_query(""" select etc..... """, con)
df1 = pd.read_sql_query(""" select etc..... """, con)
df1['user_id'] = df1['user_id'].apply(str)
df2 = pd.merge(df, df1, left_on='user_id', right_on='user_id', how='left')
tag = df2[
(df2['choiceTitle'].str.contains("(?i)Gin")) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)Food and Drink"))
]
dw = tag[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
dw = dw.sort_values(['last_login'], ascending=[False])
dw = dw[dw.last_login > dt.datetime.now() - pd.to_timedelta("30day")]
dw = dw.rename({'user': 'user full name', 'title': 'user title'}
, axis='columns')
dw.drop_duplicates(subset ="Email",
keep = 'first', inplace = True)

Adding a function in Python is simple. Just use the def keyword to declare the function and put your existing code under it (indented). Put parameters in the parenthesis.
Here is the updated code:
def GetUsers (title, tag)
df = pd.read_sql_query(""" select etc..... """, con)
df1 = pd.read_sql_query(""" select etc..... """, con)
df1['user_id'] = df1['user_id'].apply(str)
df2 = pd.merge(df, df1, left_on='user_id', right_on='user_id', how='left')
tag = df2[
(df2['choiceTitle'].str.contains("(?i)" + title)) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)" + tag))]
dw = tag[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
dw = dw.sort_values(['last_login'], ascending=[False])
dw = dw[dw.last_login > dt.datetime.now() - pd.to_timedelta("30day")]
dw = dw.rename({'user': 'user full name', 'title': 'user title'}
, axis='columns')
dw.drop_duplicates(subset ="Email",
keep = 'first', inplace = True)
return dw # send back to print statement
# get input from user
inpTitle = input ("choiceTitle? ")
inpTag = input ("PrimaryTagPreviousChoice? ")
# run function
result = GetUsers (inpTitle, inpTag)
print(result)

Try this. Save your input() as variables and use string concatenation to edit your mask. Note that an additional set of {} is needed for escaping.
choiceTitle = input('choiceTitle?')
PrimaryTagPreviousChoice = input('PrimaryTagPreviousChoice?')
mask = df2[(df2['choiceTitle'].str.contains("(?i){{0}}".format(choiceTitle))) &
(df2['PrimaryTagPreviousChoice'].str.contains("(?i)
{{0}}".format(PrimaryTagPreviousChoice)))]
dw = mask[['user', 'title', 'user_category', 'email', 'last_login',
'PrimaryTagPreviousChoice', 'choiceTitle'
]].drop_duplicates()
....

Firebase Database - How to extract elements from a documents query?

I have a python script that is requesting data from the database through the code below:
from datetime import datetime
import re
global dateTimeObj
dateTimeObj = datetime.now()
path = db.collection(u'users').document(u'a#a.com')
doc_ref = path.collection(u'feedback').order_by(u'time_stamp').stream()
for doc in doc_ref:
a = u'{} => {}'.format(doc.id, doc.to_dict())
print(a)
output:
feedback_1 => {'feedback_sub_item': 'feedback sub item', 'feedback': 'feedback message', 'record_id': '111', 'cycle_id': 'normal', 'rating': 3.5, 'time_stamp': '02/28/2020 16:15:58'}
feedback_2 => {'feedback_sub_item': 'feedback sub item', 'feedback': 'feedback message', 'record_id': '112', 'cycle_id': 'normal', 'rating': 4, 'time_stamp': '02/28/2020 16:16:52'}
My question is, how I can convert this message to extract the parameters from the last feedback based on the time_stamp field?
My desired output is:
Feedback_number = feedback_2
feedback_sub_item = feedback sub item
feedback = feedback message
record_id = 112
cycle_id = normal
rating = 4
time_stamp = 02/28/2020 16:16:52
thnks

To extract the dictionary, you can use the split function. Then you can iterate through this dictionary with its keys.
To do this, you can use a 'for' loop to display the values contained in the dictionary.
# a is your last feedback
new_list = a.split("=>")
new_dict = eval(new_list[1])
print("feedback_number = ", new_list[0])
for key in new_dict:
print(key," = ",new_dict[key])
With this exemple your obtain:
feedback_number = feedback_2
feedback_sub_item = feedback sub item
feedback = feedback message
record_id = 112
cycle_id = normal
rating = 4
time_stamp = 02/28/2020 16:16:52
I hope that answers your question

Running parameterized queries

Quite new to this google bigquery sql thing so please bear with me. I'm trying to build a google standardSQL parameterized query. The following sample was used and ran successfully on Google BigQuery WebUI.
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
OR outputs.output_pubkey_base58 = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= '2010-05-21' AND trans_time <= '2010-05-23' AND satoshis >= 1000000000000
--ORDER BY date
Sample extracted from here as a side note.
This gives 131 rows:
Table sample
What I would like to be able to do, is to use the ScalarQueryParameter, so I could programatically use some vars along the way. Like this:
myquery = """
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = #pubkey
OR outputs.output_pubkey_base58 = #pubkey
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= #mdate AND trans_time <= #tdate AND satoshis >= 1000000000000
--ORDER BY date
"""
varInitDate = '2010-05-21'
varEndDate = '2010-05-23'
pubkey = '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4'
query_params = [
bigquery.ScalarQueryParameter('mdate', 'STRING', varInitDate),
bigquery.ScalarQueryParameter('tdate', 'STRING', varEndDate),
bigquery.ScalarQueryParameter('pubkey', 'STRING', pubkey)
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(myquery,job_config=job_config)
Nevertheless, i'm facing the following error:
<google.cloud.bigquery.table.RowIterator object at 0x7fa098be85f8>
Traceback...
TypeError: 'RowIterator' object is not callable
Can someone pls enlighten me on how can i achieve the mentioned purpose ?
P.S - '1XPTgDRhN8RFnzniWCddobD9iKZatrvH4' is the Laszlo’s Pizza 10.000 bitcoin exchange (1000000000000 satoshis).

So ... the problem was with this line of code that didn't work as expected. Not sure why though, as it worked with queries that didn't have parameterized vars.
results = query_job.result()
df = results().to_dataframe()
And the actual code... Remember to replace with your own login credentials for this to work.
import datetime, time
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
#login
credentials = service_account.Credentials.from_service_account_file('your.json')
project_id = 'your-named-project'
client = bigquery.Client(credentials= credentials,project=project_id)
#The query
q_input = """
#standardSQL
WITH time AS
(
SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
inputs.input_pubkey_base58 AS input_key,
outputs.output_pubkey_base58 AS output_key,
outputs.output_satoshis AS satoshis,
transaction_id AS trans_id
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
JOIN UNNEST (inputs) AS inputs
JOIN UNNEST (outputs) AS outputs
WHERE inputs.input_pubkey_base58 = #pubkey
OR outputs.output_pubkey_base58 = #pubkey
)
SELECT input_key, output_key, satoshis, trans_id,
EXTRACT(DATE FROM trans_time) AS date
FROM time
WHERE trans_time >= #mdate AND trans_time <= #tdate AND satoshis >= #satoshis
--ORDER BY date
"""
#The desired purpose
def runQueryTransaction(varInitDate,varEndDate,pubkey,satoshis):
global df
query_params = [
bigquery.ScalarQueryParameter('mdate', 'STRING', varInitDate),
bigquery.ScalarQueryParameter('tdate', 'STRING', varEndDate),
bigquery.ScalarQueryParameter('pubkey', 'STRING', pubkey),
bigquery.ScalarQueryParameter('satoshis', 'INT64', satoshis),
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(q_input,job_config=job_config) # API request - starts the query
results = query_job.result() # Waits for job to complete.
df=pd.DataFrame(columns=['input_key', 'output_key', 'satoshis', 'trans_id', 'date'])
for row in results:
df.loc[len(df)] = [row.input_key, row.output_key, row.satoshis, row.trans_id, row.date]
#print("{} : {} : {} : {} : {}".format(row.input_key, row.output_key, row.satoshis, row.trans_id, row.date))
return df
#runQueryTransaction(InitialDate,EndDate,WalletPublicKey,Satoshis)
runQueryTransaction('2010-05-21','2010-05-23','1XPTgDRhN8RFnzniWCddobD9iKZatrvH4',1000000000000)
Cheers

Group by column to get array results in Postgresql

I have a table called moviegenre which looks like:
moviegenre:
- movie (FK movie.id)
- genre (FK genre.id)
I have a query (ORM generated) which returns all movie.imdb and genre.id's which have genre.id's in common with a given movie.imdb_id.
SELECT "movie"."imdb_id",
"moviegenre"."genre_id"
FROM "moviegenre"
INNER JOIN "movie"
ON ( "moviegenre"."movie_id" = "movie"."id" )
WHERE ( "movie"."imdb_id" IN (SELECT U0."imdb_id"
FROM "movie" U0
INNER JOIN "moviegenre" U1
ON ( U0."id" = U1."movie_id" )
WHERE ( U0."last_ingested_on" IS NOT NULL
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND U1."genre_id" IN ( 2, 10 ) ))
AND "moviegenre"."genre_id" IN ( 2, 10 ) )
The problem is that I'll get results in the format:
[
('imdbid22`, 'genreid1'),
('imdbid22`, 'genreid2'),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
Is there a way within the query itself I can group all of the genre ids into a list under the movie.imdb_id's? I'd like do to grouping in the query.
Currently doing it in my web app code (Python) which is extremely slow when 50k+ rows are returned.
[
('imdbid22`, ['genreid1', 'genreid2']),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
thanks in advance!
edit:
here's the python code which runs against the current results
results_list = []
for item in movies_and_genres:
genres_in_common = len(set([
i['genre__id'] for i in movies_and_genres
if i['movie__imdb_id'] == item['movie__imdb_id']
]))
imdb_id = item['movie__imdb_id']
if genres_in_common >= min_in_comon:
result_item = {
'movie.imdb_id': imdb_id,
'count': genres_in_common
}
if result_item not in results_list:
results_list.append(result_item)
return results_list

select m.imdb_id, array_agg(g.genre_id) as genre_id
from
moviegenre g
inner join
movie m on g.movie_id = m.id
where
m.last_ingested_on is not null
and not m.imdb_id in ('tt0169547')
and not m.imdb_id in ('tt0169547')
and g.genre_id in (2, 10)
group by m.imdb_id
array_agg will create an array of all the genre_ids of a certain imdb_id:
http://www.postgresql.org/docs/current/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE

I hope python code will be fast enough:
movielist = [
('imdbid22', 'genreid1'),
('imdbid22', 'genreid2'),
('imdbid44, 'genreid1'),
('imdbid55', 'genreid8'),
]
dict = {}
for items in movielist:
if dict[items[0]] not in dict:
dict[items[0]] = items[1]
else:
dict[items[0]] = dict[items[0]].append(items[1])
print dict
Output:
{'imdbid44': ['genreid1'], 'imdbid55': ['genreid8'], 'imdbid22': ['genreid1', 'genreid2']}
If you just need movie name, count:
Change this in original query you will get the answer you dont need python code
SELECT "movie"."imdb_id", count("moviegenre"."genre_id")
group by "movie"."imdb_id"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pass an array of values into bigquery query with pandas - python

Related

unable to iterate through loop in python

How to use pandas INPUT function to get a list of customers

Firebase Database - How to extract elements from a documents query?

Running parameterized queries

Group by column to get array results in Postgresql

Categories

Resources