Data not matching when fetching from Google Analytics API (python) - python

I'm making a script to pull data from Google Analytics API v4. The script works fine. However, when validating the data by comparing GA with my fetched data I can see some discrepancies. Not too different but I don't understand why is not the same.
Just to mention that I'm using dynamic segments on my script which has the exact same condition as the segment I have in my GA view.
The segment is just filtering spam traffic by only including traffic where session duration > 1sec.
Here is the structure I'm pulling:
body={
"reportRequests":[
{
"viewId": view_id,
"dimensions":[{"name": "ga:date"},{"name": "ga:sourceMedium"},{"name": "ga:campaign"},{"name": "ga:adContent"},{"name": "ga:channelGrouping"},{"name": "ga:segment"}],
"dateRanges":[
{
"startDate":"2018-12-16",
"endDate":"2018-12-20"
}],
"metrics":[{"expression":"ga:sessions","alias":"sessions"}],
"segments":[
{
"dynamicSegment":
{
"name": "sessions_no_spam",
"userSegment":
{
"segmentFilters":[
{
"simpleSegment":
{
"orFiltersForSegment":
{
"segmentFilterClauses": [
{
"metricFilter":
{
"metricName":"ga:sessionDuration",
"operator":"GREATER_THAN",
"comparisonValue":"1"
}
}]
}
}
}]
}
}
}]
}]
}).execute()
Not sure if the answer to my question will be more conceptual rather than technical but just in case I'm also including the function where I bulk the results in my database:
def print_results(no_spam_traffic):
connection = psycopg2.connect(database = 'web_insights_data', user = 'XXXX', password = 'XXXXX', host = 'XXX', port = 'XXXXX')
cursor = connection.cursor()
for report in no_spam_traffic.get('reports', []):
for row in report.get('data', {}).get('rows', []):
gadate = row['dimensions'][0]
gadate = gadate[0:4]+'/'+gadate[4:6]+'/'+gadate[6:8]
gasourcemedium = row['dimensions'][1]
gacampaign = row['dimensions'][2]
gaadcontent = row['dimensions'][3]
gachannel = row['dimensions'][4]
gasessions = row['metrics'][0]['values'][0]
cursor.execute("SELECT * from GA_no_spam_traffic where gadate = %s AND sourcemedium = %s AND campaign = %s AND adcontent = %s", (str(gadate),str(gasourcemedium),str(gacampaign),str(gaadcontent)))
if len(cursor.fetchall())>0: #update old entries
cursor.execute("UPDATE GA_no_spam_traffic set sessions = %s where gadate = %s AND sourcemedium = %s AND campaign = %s AND adcontent = %s", (str(gasessions),str(gadate),str(gasourcemedium),str(gacampaign),str(gaadcontent)))
connection.commit()
else: #Insert new rows
cursor.execute("INSERT INTO GA_no_spam_traffic (gadate,sourcemedium,campaign,adcontent,channel,sessions) VALUES (%s,%s,%s,%s,%s,%s)", (gadate,gasourcemedium,gacampaign,gaadcontent,gachannel,gasessions))
connection.commit()
connection.close()
Any ideas what the issue might be?
Thanks!!

I managed to improve it, although it's not exact. But well, it's an acceptable discrepancy. I had a problem with the page size so I increased the pagesize parameter.
Here's the link to the pagination section from a google guide: https://developers.google.com/analytics/devguides/reporting/core/v4/migration#pagination
Thanks

Related

Python GraphQL query issue

I'm using Python to make requests to Pipefy GraphQL API.
I already read the documentation and make search in pipefy forum, but
I could not figure what is wrong with the query bellow:
pipeId = '171258'
query ="""
{
"query": "{allCards(pipeId: %s, first: 30, after: 'WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0'){pageInfo{endCursor hasNextPage}edges{node{id title}}}}"
}
"""%(pipeid)
The query worked pretty well until I added the after parameter.
I already tried variations like:
after: "WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0"
after: \"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\"
after: \n"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\n"
I know the issue is related with the escaping, because the API return messages like this:
'{"errors":[{"locations":[{"column":45,"line":1}],"message":"token recognition error at: \'\'\'"},{"locations":[{"column":77,"line":1}],"message":"token recognition error at: \'\'\'"}]}\n'
(this message is returned when the request is made with after: 'WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0')
Any help here would be immensely handful!
Thanks
I had the same problem as you today (and saw your post on Pipefy's Support page). I personally entered in contact with Pipefy's developers but they weren't helpful at all.
I solved it by escaping the query correctly.
Try like this:
query = '{"query": "{ allCards(pipeId: %s, first: 30, after: \\"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\\"){ pageInfo{endCursor hasNextPage } edges { node { id title } } } }"}'
Using single quotes to define the string and double-backslashes before the doublequotes included in the cursor.
With the code snippet below you are able to call the function get_card_list passing the authentication token (as String) and the pipe_id (as integer) and retrieve the whole card list of your pipe.
The get_card_list function will call the function request_card_list until the hasNextpage is set to False, updating the cursor in each call.
# Function responsible to get cards from a pipe using Pipefy's GraphQL API
def request_card_list(auth_token, pipe_id, hasNextPage=False, endCursor=""):
url = "https://api.pipefy.com/graphql"
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer %s' %auth_token
}
if not hasNextPage:
payload = '{"query": "{ allCards(pipeId: %i, first: 50) { edges { node { id title phases_history { phase { name } firstTimeIn lastTimeOut } } cursor } pageInfo { endCursor hasNextPage } } }"}' %pipe_id
else:
payload = '{"query": "{ allCards(pipeId: %i, first: 50, after: \\"%s\\") { edges { node { id title phases_history { phase { name } firstTimeIn lastTimeOut } } cursor } pageInfo { endCursor hasNextPage } } }"}' % (pipe_id, endCursor)
response = requests.request("POST", url, data=payload, headers=headers)
response_body = response.text
response_body_dict = json.loads(response_body)
response_dict_list = response_body_dict['data']['allCards']['edges']
card_list = []
for d in response_dict_list:
for h in d['node']['phases_history']:
h['firstTimeIn'] = datetime.strptime(h['firstTimeIn'], date_format)
if h['lastTimeOut']:
h['lastTimeOut'] = datetime.strptime(h['lastTimeOut'], date_format)
card_list.append(d['node'])
return_list = [card_list, response_body_dict['data']['allCards']['pageInfo']['hasNextPage'], response_body_dict['data']['allCards']['pageInfo']['endCursor']]
return return_list
# Function responsible to get all cards from a pipe using Pipefy's GraphQL API and pagination
def get_card_list(auth_token, pipe_id):
card_list = []
response = request_card_list(auth_token, pipe_id)
card_list = card_list + response[0]
while response[1]:
response = request_card_list(auth_token, pipe_id, response[1], response[2])
card_list = card_list + response[0]
return(card_list)
Thanks for Lodi answer, I was able to do the next step.
How to use a variable to pass the "after" parameter for the query
As it was quite difficult I decide to share it here for those facing the same challenge.
end_cursor = 'WyIxLjAiLCI2NTcuMCIsNDgwNDA2OV0'
end_cursor = "\\" + "\"" + end_cursor + "\\" + "\""
# desired output: end_cursor = '\"WyIxLjAiLCI2NTcuMCIsNDgwNDA2OV0\"'
query ="""
{
"query": "{allCards(pipeId: %s, first: 50, after: %s){pageInfo{endCursor hasNextPage}edges{node{id title}}}}"
}
"""%(pipeid, end_cursor)

connect to Redshift from lambda and fetch some record using python

is any one able successfully connect to Redshift from lambda.
I want to fetch some records from Redshift table and feed to my bot (aws lex)
Please suggest - this code is working outside lambda how to make it work inside lambda.
import psycopg2
con=psycopg2.connect(dbname= 'qa', host='name',
port= '5439', user= 'dwuser', password= '1234567')
cur = con.cursor()
cur.execute("SELECT * FROM pk.fact limit 4;")
for result in cur:
print (result)
cur.close()
con.close()
Here is the node lambda that works to connecting to Redshift and pulling data from it.
exports.handler = function(event, context, callback) {
var response = {
status: "SUCCESS",
errors: [],
response: {},
verbose: {}
};
var client = new pg.Client(connectionString);
client.connect(function(err) {
if (err) {
callback('Could not connect to RedShift ' + JSON.stringify(err));
} else {
client.query(sql.Sql, function(err, result) {
client.end();
if (err) {
callback('Error Cleaning up Redshift' + err);
} else {
callback(null, ' Good ' + JSON.stringify(result));
}
});
}
});
};
Hope it helps.
You need to fetch the records first.
results = cur.fetchall()
for result in results:
...

elasticsearch scrolling using python client

When scrolling in elasticsearch it is important to provide at each scroll the latest scroll_id:
The initial search request and each subsequent scroll request returns
a new scroll_id — only the most recent scroll_id should be used.
The following example (taken from here) puzzle me. First, the srolling initialization:
rs = es.search(index=['tweets-2014-04-12','tweets-2014-04-13'],
scroll='10s',
search_type='scan',
size=100,
preference='_primary_first',
body={
"fields" : ["created_at", "entities.urls.expanded_url", "user.id_str"],
"query" : {
"wildcard" : { "entities.urls.expanded_url" : "*.ru" }
}
}
)
sid = rs['_scroll_id']
and then the looping:
tweets = [] while (1):
try:
rs = es.scroll(scroll_id=sid, scroll='10s')
tweets += rs['hits']['hits']
except:
break
It works, but I don't see where sid is updated... I believe that it happens internally, in the python client; but I don't understand how it works...
This is an old question, but for some reason came up first when searching for "elasticsearch python scroll". The python module provides a helper method to do all the work for you. It is a generator function that will return each document to you while managing the underlying scroll ids.
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan
Here is an example of usage:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
query = {
"query": {"match_all": {}}
}
es = Elasticsearch(...)
for hit in scan(es, index="my-index", query=query):
print(hit["_source"]["field"])
Using python requests
import requests
import json
elastic_url = 'http://localhost:9200/my_index/_search?scroll=1m'
scroll_api_url = 'http://localhost:9200/_search/scroll'
headers = {'Content-Type': 'application/json'}
payload = {
"size": 100,
"sort": ["_doc"]
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
r1 = requests.request(
"POST",
elastic_url,
data=json.dumps(payload),
headers=headers
)
# first batch data
try:
res_json = r1.json()
data = res_json['hits']['hits']
_scroll_id = res_json['_scroll_id']
except KeyError:
data = []
_scroll_id = None
print 'Error: Elastic Search: %s' % str(r1.json())
while data:
print data
# scroll to get next batch data
scroll_payload = json.dumps({
'scroll': '1m',
'scroll_id': _scroll_id
})
scroll_res = requests.request(
"POST", scroll_api_url,
data=scroll_payload,
headers=headers
)
try:
res_json = scroll_res.json()
data = res_json['hits']['hits']
_scroll_id = res_json['_scroll_id']
except KeyError:
data = []
_scroll_id = None
err_msg = 'Error: Elastic Search Scroll: %s'
print err_msg % str(scroll_res.json())
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll
In fact the code has a bug in it - in order to use the scroll feature correctly you are supposed to use the new scroll_id returned with each new call in the next call to scroll(), not reuse the first one:
Important
The initial search request and each subsequent scroll request returns
a new scroll_id — only the most recent scroll_id should be used.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
It's working because Elasticsearch does not always change the scroll_id in between calls and can for smaller result sets return the same scroll_id as was originally returned for some time. This discussion from last year is between two other users seeing the same issue, the same scroll_id being returned for awhile:
http://elasticsearch-users.115913.n3.nabble.com/Distributing-query-results-using-scrolling-td4036726.html
So while your code is working for a smaller result set it's not correct - you need to capture the scroll_id returned in each new call to scroll() and use that for the next call.
self._elkUrl = "http://Hostname:9200/logstash-*/_search?scroll=1m"
self._scrollUrl="http://Hostname:9200/_search/scroll"
"""
Function to get the data from ELK through scrolling mechanism
"""
def GetDataFromELK(self):
#implementing scroll and retriving data from elk to get more than 100000 records at one search
#ref :https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
try :
dataFrame=pd.DataFrame()
if self._elkUrl is None:
raise ValueError("_elkUrl is missing")
if self._username is None:
raise ValueError("_userNmae for elk is missing")
if self._password is None:
raise ValueError("_password for elk is missing")
response=requests.post(self._elkUrl,json=self.body,auth=(self._username,self._password))
response=response.json()
if response is None:
raise ValueError("response is missing")
sid = response['_scroll_id']
hits = response['hits']
total= hits["total"]
if total is None:
raise ValueError("total hits from ELK is none")
total_val=int(total['value'])
url = self._scrollUrl
if url is None:
raise ValueError("scroll url is missing")
#start scrolling
while(total_val>0):
#keep search context alive for 2m
scroll = '2m'
scroll_query={"scroll" : scroll, "scroll_id" : sid }
response1=requests.post(url,json=scroll_query,auth=(self._username,self._password))
response1=response1.json()
# The result from the above request includes a scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results
sid = response1['_scroll_id']
hits=response1['hits']
data=response1['hits']['hits']
if len(data)>0:
cleanDataFrame=self.DataClean(data)
dataFrame=dataFrame.append(cleanDataFrame)
total_val=len(response1['hits']['hits'])
num=len(dataFrame)
print('Total records recieved from ELK=',num)
return dataFrame
except Exception as e:
logging.error('Error while getting the data from elk', exc_info=e)
sys.exit()
from elasticsearch import Elasticsearch
elasticsearch_user_name ='es_username'
elasticsearch_user_password ='es_password'
es_index = "es_index"
es = Elasticsearch(["127.0.0.1:9200"],
http_auth=(elasticsearch_user_name, elasticsearch_user_password))
query = {
"query": {
"bool": {
"must": [
{
"range": {
"es_datetime": {
"gte": "2021-06-21T09:00:00.356Z",
"lte": "2021-06-21T09:01:00.356Z",
"format": "strict_date_optional_time"
}
}
}
]
}
},
"fields": [
"*"
],
"_source": False,
"size": 2000,
}
resp = es.search(index=es_index, body=query, scroll="1m")
old_scroll_id = resp['_scroll_id']
results = resp['hits']['hits']
while len(results):
for i, r in enumerate(results):
# do something whih data
pass
result = es.scroll(
scroll_id=old_scroll_id,
scroll='1m' # length of time to keep search context
)
# check if there's a new scroll ID
if old_scroll_id != result['_scroll_id']:
print("NEW SCROLL ID:", result['_scroll_id'])
# keep track of pass scroll _id
old_scroll_id = result['_scroll_id']
results = result['hits']['hits']

Querying ElasticSearch with Python Requests not working fine

I'm trying to do full-text search on a mongodb db with the Elastic Search engine but I ran into a problem: no matters what search term I provide(or if I use query1 or query2), the engine always returns the same results. I think the problem is in the way I make the requests, but I don't know how to solve it.
Here is the code:
def search(search_term):
query1 = {
"fuzzy" : {
"art_text" : {
"value" : search_term,
"boost" : 1.0,
"min_similarity" : 0.5,
"prefix_length" : 0
}
},
"filter": {
"range" : {
"published": {
"from" : "20130409T000000",
"to": "20130410T235959"
}
}
}
}
query2 = {
"match_phrase": { "art_text": search_term }
}
es_query = json.dumps(query1)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, params=es_query)
results = json.loads( r.text )
data = [res['_source']['api_id'] for res in results['hits']['hits'] ]
print "results: %d" % len(data)
pprint(data)
The params parameter is not for data being sent. If you're trying to send data to the server you should specifically be using the data parameter. If you're trying to send query parameters, then you shouldn't be JSON-encoding them and just give it to params as a dict.
I suspect your first request should be the following:
r = requests.get(uri, data=es_query)
And before someone downvotes me, yes the HTTP/1.1 spec allows data to be sent with GET requests and yes requests does support it.
search = {'query': {'match': {'test_id':13} }, 'sort' {'date_utc':{'order':'desc'}} }
data = requests.get('http://localhost:9200/newsidx/test/_search?&pretty',params = search)
print data.json()
http://docs.python-requests.org/en/latest/user/quickstart/

Freebase + GoogleAPI query returns errors when query results are over a certain amount

I'm trying to query all US counties and their goelocation (longitude + latitude) on Freebase. I've noticed that sometimes the query will work, but on other tries it returns the following: <"HttpError 503 when requesting...returned "Backend Error">.
I've tried changing the query result limits, and what I've found is that the limit at which my query breaks down varies; sometimes it works when "limit":2900, and sometimes it returns the above-mentioned error at "limit":1200.
Here's the code I've written so far:
from itertools import islice
from apiclient import discovery
from apiclient import model
import json
from CREDENTIALS import FREEBASE_KEY
from pandas import DataFrame, Series
DEVELOPER_KEY = FREEBASE_KEY
model.JsonModel.alt_param = ""
freebase = discovery.build('freebase', 'v1', developerKey=DEVELOPER_KEY)
query_json = """
[{
"id": null,
"name": null,
"/location/us_county/fips_6_4_code": [],
"/location/location/geolocation": {
"latitude": null,
"longitude": null
},
"limit": 3050
}]""".replace("\n", " ")
query = json.loads(query_json)
response = json.loads(freebase.mqlread(query=json.dumps(query)).execute())
results = list()
for result in islice(response['result'], None):
results.append( {'id': result['id'],
'name': result['name'],
'latitude': float(result['/location/location/geolocation']['latitude']),
'longitude': float(result['/location/location/geolocation']['longitude']),
'fips': result['/location/us_county/fips_6_4_code'],
} )
states = DataFrame(results)
plt.scatter(states["longitude"], states["latitude"])
It doesn't seem like a quota issue, and others have noted a similar issue on the Freebase mailing list: http://lists.freebase.com/pipermail/freebase-discuss/2011-December/007710.html
But this was for another type of data, so it seems like their solution isn't applicable to what I'm working on.
[EDIT]
I used a cursor to iterate through the data, and it work fine. Here's the final code I used:
from itertools import islice
from apiclient import discovery
from apiclient import model
import json
from CREDENTIALS import FREEBASE_KEY
from pandas import DataFrame, Series
DEVELOPER_KEY = FREEBASE_KEY
model.JsonModel.alt_param = ""
freebase = discovery.build('freebase', 'v1', developerKey=DEVELOPER_KEY)
query = [{
"id": None,
"name": None,
"type": "/location/us_county",
"/location/location/geolocation": {
"latitude": None,
"longitude": None
}
}]
results = []
count = 0
def do_query(cursor=""):
response = json.loads(freebase.mqlread(query=json.dumps(query), cursor=cursor).execute())
for result in islice(response['result'], None):
results.append( {'id': result['id'],
'name': result['name'],
'latitude': result['/location/location/geolocation']['latitude'],
'longitude': result['/location/location/geolocation']['longitude'],
} )
return response.get("cursor")
cursor = do_query()
while(cursor):
cursor = do_query(cursor)
# Check how many iterations this loop has gone through.
#print count
count+=1
# Plug results into a pandas DataFrame and plot.
states = DataFrame(results)
plt.scatter(states["longitude"], states["latitude"])
It's a relatively simple query, but to put it in perspective the default limit is 100, which is a LOT lower than what you're asking for. I'd suggest using a lower limit and a cursor to page through the results (and filing a bug report because it shouldn't be returning a generic "backend error" but some kind of MQL specific error)
Here's some sample code to show you how to iterate through the results with a cursor:
cursor = ''
while cursor != False:
response = json.loads(freebase.mqlread(query=json.dumps(query), cursor=cursor).execute())
for county in response['result']:
print county['name']
cursor = response['cursor']
Just leave the limit clause out of your query and it will iterate through the whole list of counties in batches of 100 results.

Categories