I need to shorten multiple urls using the Google Shortener API. Since each shortening process is not interdependent I decided to use multiprocessing library in Python.
raw_data is a data frame which contains my long url. Api_Key is a list which contains api key from multiple google accounts(as i dont want to hit the limit of api usage for a day)
Here is the underlying code.
import pandas as pd
import time
from multiprocessing import Pool
import math
import requests
raw_data["Shortened"] = "xx"
Api_Key_List1 = ['Key1',
'Key2',
'Key3']
raw_data = raw_data[0:3]
LongUrl = raw_data['SMS Click traker'].tolist()
args = [[LongUrl[index], Api_Key_List1[index]] for index,value in enumerate(LongUrl)]
print("xxxx")
def goo_shorten_url(url,key):
post_url = 'https://www.googleapis.com/urlshortener/v1/url?key=%s'%key
payload = {'longUrl': url}
headers = {'content-type': 'application/json'}
time.sleep(2)
r = requests.post(post_url, data=json.dumps(payload), headers=headers)
return(json.loads(r.text)["id"])
def helper(args):
return goo_shorten_url(*args)
if __name__ == '__main__':
p = Pool(processes = 3)
data = p.map(helper,args)
print("main")
raw_data["Shortened"]=data
p.close()
print(raw_data)
Using this code i am successfully shorten multiple urls in one go thus saving a lot of time. Here are some questions which i am having difficulty in finding answers to:
What is the optimal number of sub-processes that I shall be running? (I am using a simple quadcore processor 8 gb ram laptop, Windows 7)
Sometimes due to some issues the api does not return a shortened url. The code is stuck forever. This was my experience when i was not using multiprocess. What happens to the process in such case?
Assuming the subprocess will get stuck for a long set of urls (70 000). How do I kill it and also ensure the code safely returns me all remaining shortened urls with proper matching in the data frame?
How can I identify those urls which were not Shortened?
I am a newbie to python programming, bear with me. This will help me better understand the inner working of multiprocessing.
Related
I have a script that uses a loop to append a numerical ID as a token to an API endpoint, as in the following example.
The API response is then appended to a list and transformed into a dataframe upon which I can perform transformations as required.
import pandas as pd
import json
import requests
tokens_list = ['6819100647','4444297007','3136364694','631018055','6275646907','2835497899'
,'5584897496','1271294249','5292810357','9101862265','220608001','1546117949','3672368984'
,'8893786688','8681209578','134973384','2947578755','6271362195','1070349738','747443183'...]
outputs = []
for i in tokens_list :
url = "https://my_api_endpoint/"+str(i)
response = requests.get(url, params=None, headers=None)
response_json = json.loads(response.text)
response_json_data = pd.json_normalize(response_json ['data'])
outputs_list.append(response_json_data )
outputs_df = pd.concat(outputs_list)
This works as intended for many hundreds of tokens/loop iterations, however there may potentially be 50k+ potential tokens, hence 50k+ loop iterations/calls.
How might I improve the script such that I can make tens of thousands of sequential requests without facing any timeout issues or other such problems?
so I'm working at a simple project but apparently I'm stuck at the first step. Basically I'm requesting the .json files from a public github repository. 7 different files which I aim to download and convert to 7 differently named databases.
I tried to use this nested loop, trying to create 7 different csv files, the only problem is that it gives me 7 different named csv files with the same content (the one from the last URL).
I think it has something to do with the way I store the data from the json output in the list "data".
How could I solve this problem?
import pandas as pd
import datetime
import re, json, requests #this is needed to import the data from the github repository
naz_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale-latest.json'
naz_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale.json'
reg_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni-latest.json'
reg_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json'
prov_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province-latest.json'
prov_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province.json'
news_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-note.json'
list_of_url= [naz_l_url,naz_url, reg_l_url,reg_url,prov_url,prov_l_url,news_url]
csv_names = ['01','02','03','04','05','06','07']
for i in list_of_url:
resp = requests.get(i)
data = pd.read_json(resp.text, convert_dates=True)
for x in csv_names:
data.to_csv(f"{x}_df.csv")
I want to try two different ways. 1 with the loop giving me csv files, and another with the loop giving me pd dataframes. But I need to solve the problem of the loop giving me the same output for now.
The problem is that you are iterating over the full list of names for each URL you download. Note how for x in csv_names is inside the for i in list_of_url loop.
Where the problem comes from
Python uses indentation levels to determine when you are in and out of a loop (as other languages might use curly braces, begin/end, or do/end). I'd recommend you brush up on this topic. For example, with Concept of Indentation in Python. You can see the official documentation about Compound statements, too.
Proposed solution
I'd recommend you replace the naming of the files, and do something like this instead:
import pandas as pd
import datetime
import re, json, requests #this is needed to import the data from the github repository
from urllib.parse import urlparse
naz_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale-latest.json'
naz_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale.json'
reg_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni-latest.json'
reg_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json'
prov_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province-latest.json'
prov_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province.json'
news_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-note.json'
list_of_url= [naz_l_url,naz_url, reg_l_url,reg_url,prov_url,prov_l_url,news_url]
csv_names = ['01','02','03','04','05','06','07']
for url in list_of_url:
resp = requests.get(url)
data = pd.read_json(resp.text, convert_dates=True)
# here is where you DON'T want to have a nested `for` loop
file_name = urlparse(url).path.split('/')[-1].replace('json', 'csv')
data.to_csv(file_name)
I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.
I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.
I am data mining a website using Beautiful Soup. The first page is Scoutmob's map, where I grab each city, open up the page, and grab the URL of each deal in that city.
Currently I'm not using threads and everything is being processed serially. For about all 500 deals (from all cities), my program currently takes about 400 seconds.
For practice, I wanted to modify my code to use threading. I have read up some tutorials and examples on how to create queues in Python, but I don't want to create 500 threads to process 500 URLs.
Instead I want to create about 20 (worker) threads to process all the URLs. Can someone show me an example how 20 threads can process 500 URL in a queue?
I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.
By the way, while each worker is data mining, it also writes the content to a database. So there might be issues with threading in the database, but that is another question for another day :-).
Thanks in advance!
For your example creating worker queues is probably overkill. You might have better luck if you grab the rss feed published for each of the pages rather than trying to parse the HTML which is slower. I slapped together the quick little script below that parses it in a total of ~13 seconds... ~8 seconds to grab the cities and ~5 seconds to parse all the rss feeds.
In today's run it grabs 310 total deals from 13 cities (there are a total of 20 cities listed, but 7 of them are listed as "coming soon").
#!/usr/bin/env python
from lxml import etree, html
from urlparse import urljoin
import time
t = time.time()
base = 'http://scoutmob.com/'
main = html.parse(base)
cities = [x.split('?')[0] for x in main.xpath("//a[starts-with(#class, 'cities-')]/#href")]
urls = [urljoin(base, x + '/today') for x in cities]
docs = [html.parse(url) for url in urls]
feeds = [doc.xpath("//link[#rel='alternate']/#href")[0] for doc in docs]
# filter out the "coming soon" feeds
feeds = [x for x in feeds if x != 'http://feeds.feedburner.com/scoutmob']
print time.time() - t
print len(cities), cities
print len(feeds), feeds
t = time.time()
items = [etree.parse(x).xpath("//item") for x in feeds]
print time.time() - t
count = sum(map(len, items))
print count
Yields this output:
7.79690480232
20 ['/atlanta', '/new-york', '/san-francisco', '/washington-dc', '/charlotte', '/miami', '/philadelphia', '/houston', '/minneapolis', '/phoenix', '/san-diego', '/nashville', '/austin', '/boston', '/chicago', '/dallas', '/denver', '/los-angeles', '/seattle', '/portland']
13 ['http://feeds.feedburner.com/scoutmob/atl', 'http://feeds.feedburner.com/scoutmob/nyc', 'http://feeds.feedburner.com/scoutmob/sf', 'http://scoutmob.com/washington-dc.rss', 'http://scoutmob.com/nashville.rss', 'http://scoutmob.com/austin.rss', 'http://scoutmob.com/boston.rss', 'http://scoutmob.com/chicago.rss', 'http://scoutmob.com/dallas.rss', 'http://scoutmob.com/denver.rss', 'http://scoutmob.com/los-angeles.rss', 'http://scoutmob.com/seattle.rss', 'http://scoutmob.com/portland.rss']
4.76977992058
310
Just implement it. You've pretty much talked yourself through the answer right there
I want each worker to grab an unprocessed URL from the queue, and data mine, then once finished, work on another unprocessed URL. Each worker only exit when there is no more URLs in the queue.