Python- performance issue with mysql and google play subscription - python

I am trying to extract the purchase_state from google play, following the steps bellow:
import base64
import requests
import smtplib
from collections import OrderedDict
import mysql.connector
from mysql.connector import errorcode
......
Query db ,returning thousand of lines with purchase_idfield from my table
Check for each row from db, and extract purchase_id, then query google play for all of them. for example if the results of my previous
step is 1000, 1000 times is querying google (refresh token + query).
Add a new field purchase status from the google play to a new dictionary apart from some other fields whcih are grabbed from mysql query.
The last step is doing a loop over my dic as a follow and prepare the desirable report
AFTER EDITED:
def build_dic_from_db(data,access_token):
dic = {}
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
status = check_purchase_status(access_token, product_id,purchase_id)
cnt = 1
if row['user'] not in dic:
dic[row['user']] = {'id':row['user_id'],'country': row['country_name'],'reg_ts': row['user_registration_timestamp'],'last_active_ts': row['user_last_active_action_timestamp'],
'total_credits': row['user_credits'],'total_call_sec_this_month': row['outgoing_call_seconds_this_month'],'user_status':row['user_status'],'mobile':row['user_mobile_phone_number_num'],'plus':row['user_assigned_msisdn_num'],
row['product_id']:{'tAttemp': cnt,'tCancel': status}}
else:
if row['product_id'] not in dic[row['user']]:
dic[row['user']][row['product_id']] = {'tAttemp': cnt,'tCancel':status}
else:
dic[row['user']][row['product_id']]['tCancel'] += status
dic[row['user']][row['product_id']]['tAttemp'] += cnt
return dic
The problem is that my code is working slowly ~ Total execution time: 448.7483880519867 and I am wondering if there is away to improve my script. Is there any suggestion?

I hope I'm right about this, but the bottleneck seems to be the connection to the playstore. Doing it sequentially will take a long time, whereas the server is able to take a million requests at a time. So here's a way to process your jobs with executors (you need the "concurrent" package installed)
In that example, you'll be able to send 100 requests at the same time.
from concurrent import futures
EXECUTORS = futures.ThreadPoolExecutor(max_workers=100)
jobs = dict()
for row in data:
product_id = row['product_id']
purchase_id = row['purchase_id']
job = EXECUTORS.submit(check_purchase_status,
access_token, product_id,purchase_id)
jobs[job] = row
for job in futures.as_completed(jobs.keys()):
# here collect your results and do something useful with them :)
status = job.result()
# make the connection with current row
row = jobs[job]
# now you have your status and the row
And BTW try to use temp variables or you're constantly accessing your dictionary with the same keys, which is not good for performance AND readability of your code.

Related

Batch create campaigns via Facebook ads API with Python?

I'm trying to build an API tool for creating 100+ campaigns at a time, but so far I keep running into timeout errors. I have a feeling it's because I'm not doing this as a batch/async request, but I can't seem to find straightforward instructions specifically for batch creating campaigns in Python. Any help would be GREATLY appreciated!
I have all the campaign details prepped and ready to go in a Google sheet, which my script then reads (using pygsheets) and attempts to create the campaigns. Here's what it looks like so far:
from facebookads.adobjects.campaign import Campaign
from facebookads.adobjects.adaccount import AdAccount
from facebookads.api import FacebookAdsApi
from facebookads.exceptions import FacebookRequestError
import time
import pygsheets
FacebookAdsApi.init(access_token=xxx)
gc = pygsheets.authorize(service_file='xxx/client_secret.json')
sheet = gc.open('Campaign Prep')
tab1 = sheet.worksheet_by_title('Input')
tab2 = sheet.worksheet_by_title('Output')
# gets range size, offsetting it by 1 to account for the range starting on row 2
row_range = len(tab1.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))+1
# finds first empty row in the output sheet
start_row = len(tab2.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))
def create_campaigns(row):
campaign = Campaign(parent_id=row[6])
campaign.update({
Campaign.Field.name: row[7],
Campaign.Field.objective: row[9],
Campaign.Field.buying_type: row[10],
})
c = campaign.remote_create(params={'status': Campaign.Status.active})
camp_name = c['name']
camp_id = 'cg:'+c['id']
return camp_name, camp_id
r = start_row
# there's a header so I have the range starting at 2
for x in range(2, int(row_range)):
r += 1
row = tab1.get_row(x)
camp_name, camp_id = create_campaigns(row)
# pastes the generated campaign ID, campaign name and account id back into the sheet
tab2.update_cells('A'+str(r)+':C'+str(r).format(r),[[camp_id, camp_name, row[6].rsplit('_',1)[1]]])
I've tried putting this in a try loop and if it runs into a FacebookRequestError have it do time.sleep(5) then keep trying, but I'm still running into timeout errors every 5 - 10 rows it loops through. When it doesn't timeout it does work, I guess I just need to figure out a way to make this handle big batches of campaigns more efficiently.
Any thoughts? I'm new to the Facebook API and I'm still a relative newb at Python, but I find this stuff so much fun! If anyone has any advice for how this script could be better (as well as general Python advice), I'd love to hear it! :)
Can you post the actual error message?
It sounds like what you are describing is that you hit the rate limits after making a certain amount of calls. If that is so, time.sleep(5) won't be enough. The rate score decays over time and will be reset after 5 minutes https://developers.facebook.com/docs/marketing-api/api-rate-limiting. In that case I would suggest making a sleep between each call instead. However a better option would be to upgrade your API status. If you hit the rate limits this fast I assume you are on Developer level. Try upgrading first to Basic and then Standard and you should not have these problems. https://developers.facebook.com/docs/marketing-api/access
Also, as you mention, utilizing Facebook's batch request API could be a good idea. https://developers.facebook.com/docs/marketing-api/asyncrequests/v2.11
Here is a thread with examples of the Batch API working with the Python SDK: https://github.com/facebook/facebook-python-ads-sdk/issues/116
I paste the code snippet (copied from the last link that #reaktard pasted), credit to github user #williardx
it helped me a lot in my development.
# ----------------------------------------------------------------------------
# Helper functions
def generate_batches(iterable, batch_size_limit):
# This function can be found in examples/batch_utils.py
batch = []
for item in iterable:
if len(batch) == batch_size_limit:
yield batch
batch = []
batch.append(item)
if len(batch):
yield batch
def success_callback(response):
batch_body_responses.append(response.body())
def error_callback(response):
# Error handling here
pass
# ----------------------------------------------------------------------------
batches = []
batch_body_responses = []
api = FacebookAdsApi.init(your_app_id, your_app_secret, your_access_token)
for ad_set_list in generate_batches(ad_sets, batch_limit):
next_batch = api.new_batch()
requests = [ad_set.get_insights(pending=True) for ad_set in ad_set_list]
for req in requests:
next_batch.add_request(req, success_callback, error_callback)
batches.append(next_batch)
for batch_request in batches:
batch_request.execute()
time.sleep(5)
print batch_body_responses

web2py Scheduled task to recreate (reset) database

I am dealing with a CRON job that places a text file with 9000 lines device names.
The job recreates the file every day with an updated list from a network crawler in our domain.
What I was running into is when I have the following worker running my import into my database the db.[name].id kept growing with this method below
scheduler.py
# -*- coding: utf-8 -*-
from gluon.scheduler import Scheduler
def demo1():
db(db.asdf.id>0).delete()
db.commit()
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit()
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
default.py (initial kickoff)
#auth.requires_membership('!Group-IS_MASTER')
def rgroup():
mysched.queue_task('demo1',start_time=request.now,stop_time = None,prevent_drift=True,repeats=0,period=86400)
return 'you are member of a group!'
So the next time the job kicked off it would start at db.[name].id = 9001. So every day the ID number would grow by 9000 or so depending on the crawler's return. It just looked sloppy and I didn't want to run into issues years down the road with database limitations that I don't know about.
(I'm a DB newb (I know, I don't know stuff))
SOOOOOOO.....
This is what I came up with and I don't know if this is the best practice or not. And an issue that I ran into when using db.[name].drop() in the same function that is creating entries is the db tables didn't exist and my job status went to 'FAILED'. So I defined the table in the job. see below:
scheduler.py
from gluon.scheduler import Scheduler
def demo1():
db.asdf.drop() #<=====Kill db.asdf
db.commit() #<=====Commit Kill
db.define_table('asdf',Field('asdf'),auth.signature ) #<==== Phoenix Rebirth!!!
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
db.asdf.insert(asdf = line)
db.commit() #<=========== Magic
mysched = Scheduler(db, tasks = dict(demo1 = demo1) )
In the line of Phoenix Rebirth in the comments of code above. Is that the best way to achieve my goal?
It starts my ID back at 1 and that's what I want but is that how I should be going about it?
Thanks!
P.S. Forgive my example with windows dir structure as my current non-prod sandbox is my windows workstation. :(
Why wouldn't you check if the line is present prior to inserting its corresponding record ?
...
with open('c:\(project)\devices.list') as f:
content = f.readlines()
for line in content:
# distinguishing t_ for tables and f_ for fields
db_matching_entries = db(db.t_asdf.f_asdf==line).select()
if len(db_matching_entries) == 0:
db.t_asdf.insert(f_asdf = line)
else:
# here you could update your record, just in case ;-)
pass
db.commit() #<=========== Magic
Got a similar process that takes few seconds to complete with 2k-3k entries. Yours should not take longer than half a minute.

Python Facebook API - cursor pagination

My question involves learning how to retrieve my entire list of friends using Facebook's Python API. The current result returns an object with limited number of friends and a link to the 'next' page. How do I use this to fetch the next set of friends ? (Please post the link to possible duplicates) Any help would be much appreciated. In general, I need to learn about the pagination involved the API usage.
import facebook
import json
ACCESS_TOKEN = "my_token"
g = facebook.GraphAPI(ACCESS_TOKEN)
print json.dumps(g.get_connections("me","friends"),indent=1)
Sadly the documentation of pagination is an open issue since almost 2 years. You should be able to paginate like this (based on this example) using requests:
import facebook
import requests
ACCESS_TOKEN = "my_token"
graph = facebook.GraphAPI(ACCESS_TOKEN)
friends = graph.get_connections("me","friends")
allfriends = []
# Wrap this block in a while loop so we can keep paginating requests until
# finished.
while(True):
try:
for friend in friends['data']:
allfriends.append(friend['name'].encode('utf-8'))
# Attempt to make a request to the next page of data, if it exists.
friends=requests.get(friends['paging']['next']).json()
except KeyError:
# When there are no more pages (['paging']['next']), break from the
# loop and end the script.
break
print allfriends
Update: There's a new generator method available which implements above behavior and can be used to iterate over all friends like this:
for friend in graph.get_all_connections("me", "friends"):
# Do something with this friend.
Meanwhile I was searching answer here is much better approach:
import facebook
access_token = ""
graph = facebook.GraphAPI(access_token = access_token)
totalFriends = []
friends = graph.get_connections("me", "/friends&summary=1")
while 'paging' in friends:
for i in friends['data']:
totalFriends.append(i['id'])
friends = graph.get_connections("me", "/friends&summary=1&after=" + friends['paging']['cursors']['after'])
At end point you will get one response where data will be empty and then there will be no 'paging' key so at that time it will break and all the data will be stored.
I couldn't find this anywhere, these answers seem super complicated and just no way I would even use an SDK if I had do stuff like that when Paging from a simple POST is so easy to start with, however:
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)
my_account = AdAccount('act_23423423423423423')
# In the below, I added the limit to the max rows, 250.
# Also more importantly, paging. the SDK has a really sneaky way of doing this,
# enclose the request in a list() the results end up the same, but this will make the script request new objects until there are no more
#I tested this example and compared to Graph API and as of right now, 1/22 9:47AM, I get 81 from Graph and 81 here.
fields = ['name']
params = {'limit':250}
ads = list(my_account.get_ads(
fields = fields,
params = params,
))
Trick from the docs: "NOTE: We wrap the return value of get_ad_accounts with list() because get_ad_accounts returns an EdgeIterator object (located in facebook_business.adobjects) and we want to get the full list right away instead of having the iterator lazily loading accounts."
https://github.com/facebook/facebook-python-business-sdk
in this example you off set / pagination by one at the time, i think my while loop is simple since it only looking for the pagination key"next" to be none, if doesnt exists means we finish looping, and you will have your results in a list.
in this example i am just looking for all the people call jacob
import requests
import facebook
token = access_token="your token goes here"
fb = facebook.GraphAPI(access_token=token)
limit = 1
offset = 0
data = {"q": "jacob",
"type": "user",
"fields": "id",
"limit": limit,
"offset": offset}
req = fb.request('/search', args=data, method='GET')
users = []
for item in req['data']:
users.append(item["id"])
pag = req['paging']
while pag.get("next") is not None:
offset += limit
data["offset"] = offset
req = fb.request('/search', args=data, method='GET')
for item in req['data']:
users.append(item["id"])
pag = req.get('paging')
print users

Improve Speed of Python Script: Multithreading or Multiple Instances?

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.
I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.
Any guidance is much appreciated!
At first you should measure time that your script spends on every step. May be you discover something interesting :)
Second, you can split your urls on chunks:
chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division
After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:
import multiprocessing as mp
p = mp.Pool(5)
# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
res = p.map(get_social_stat, urls_chunk)
for record in res:
save_to_db(record)
# or, simple
res = p.map(get_social_stat, urls)
for record in res:
save_to_db(record)
Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

Python MySQL queue: run code/queries in parallel, but commit separately

Program 1 inserts some jobs into a table job_table.
Program 2 needs to :
1. get the job from the table
2. handle the job
-> this needs to be multi-threaded (because each job involves urllib waiting time, which should run in parallel)
3. insert the results into my_other_table, commiting the result
Any good (standard?) ways to implement this? The issue is that commiting inside one thread, also commits the other threads.
I was able to pick the records from the mysql table and put them in queue later get them from queue but not able to insert into a new mysql table.
Here i am able to pick up only the new records when ever they fall into the table.
Hope this may help you.
Any mistakes please assist me.
from threading import Thread
import time
import Queue
import csv
import random
import pandas as pd
import pymysql.cursors
from sqlalchemy import create_engine
import logging
queue = Queue.Queue(1000)
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-9s) %(message)s', )
conn = pymysql.connect(conn-details)
cursor = conn.cursor()
class ProducerThread(Thread):
def run(self):
global queue
cursor.execute("SELECT ID FROM multi ORDER BY ID LIMIT 1")
min_id = cursor.fetchall()
min_id1 = list(min_id[0])
while True:
cursor.execute("SELECT ID FROM multi ORDER BY ID desc LIMIT 1")
max_id = cursor.fetchall()
max_id1 = list(max_id[0])
sql = "select * from multi where ID between '{}' and '{}'".format(min_id1[0], max_id1[0])
cursor.execute(sql)
data = cursor.fetchall()
min_id1[0] = max_id1[0] + 1
for row in data:
num = row
queue.put(num) # acquire();wait()
logging.debug('Putting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
print num
logging.debug('Getting ' + str(num) + ' : ' + str(queue.qsize()) + ' items in queue')
**sql1 = """insert into multi_out(ID,clientname) values ('%s','%s')""",num[0],num[1]
print sql1
# cursor.execute(sql1, num)
cursor.execute("""insert into multi_out(ID,clientname) values ('%s','%s')""",(num[0],num[1]))**
# conn.commit()
# conn.close()
def main():
ProducerThread().start()
num_of_consumers = 20
for i in range(num_of_consumers):
ConsumerThread().start()
main()
What probably happens is you share the MySQL connection between the two threads. Try creating a new MySQL connection inside each thread.
For program 2, look at http://www.celeryproject.org/ :)
This is a common task when doing some sort of web crawling. I have implemented a single thread which grabs a job, waits for the http response, then writes the response to a database table.
The problems I have come across with my method, is you need to lock the table where you are grabbing jobs from, and mark them as in progress or complete, in order for multiple threads to not try and grab the same task.
Just used threading.Thread in python and override the run method.
Use 1 database connection per thread. (some db libraries in python are not thread safe)
If you have X number of threads running, periodically reading from the jobs table then MySQL will do the concurrency for you.
Or if you need even more assurance, you can always lock the jobs table yourself before reading the next available entry. This way you can be 100% sure that a single job will only be processed once.
As #Martin said, keep connections separate for all threads. They can use the same credentials.
So in short:
Program one -> Insert into jobs
Program two -> Create a write lock on the jobs table, so no one else can read from it
Program two -> read next available job
Program two -> Unlock the table
Do everything else as usual, MySQL will handle concurrency

Categories