I have to check 1,000,000 URLs as a part of a project and they have randomly been selected to either be valid or invalid. I have written the code and it's working and stuff, but I was wondering if there is a way I can make this better i.e. faster/more efficient.
I don't know much about this world of efficiency, but I have heard the word multithreading thrown around, would that help and how do I do that?
import requests
# url = "http://*******/(number from 1 to 1,000,000)"
available_numbers = []
for i in range(1,1000000):
url = f"http://*************/{i}"
data = requests.get(url)
if data.status_code == 200:
available_numbers.append(i)
print(available_numbers)
Related
I have some code inside of an app that is slowing me down wayyy too much, and it's a simple
'get' function...
This portion of the code is just finding the location of the PDF on the internet, then extracting it. I thought it was the extraction process that was taking so long, but after some testing, I believe it's the 'get' request. I am passing a variable into the URL because there are many different PDFs that the user can indirectly select. I have tried to use kivy's Urlrequest but I honestly can't get my head around getting a result frim it. I have heard it is faster though. I have another 2 'post' sessions in different functions that work 10 times faster than this one, so not sure what the issue is...
The rest of my program is working just fine, it's just this which is adding sometimes upwards of 20-25 seconds onto load times (which is unreasonable).
I will include a working extract of the problem below for you to please try.
I have found on it's first attempt at an "airport_loc" it is the slowest, please try swapping out the airport_loc variable with some of these examples:
"YPAD"
"YMLT"
"YPPH"
What can I do different here to speed it up or simply make it more efficient?
import requests
from html2text import html2text
import re
s = requests.session()
page = s.get('https://www.airservicesaustralia.com/aip/pending/dap/AeroProcChartsTOC.htm')
text = html2text(page.text)
airport_loc = "YSSY"
finding_airport = (re.search(r'.%s.' % re.escape(airport_loc), text)).group()
ap_id_loc = int(text.index(finding_airport)) + 6
ap_id_onward = text[ap_id_loc:]
next_loc = re.search(r'[(]Y\w\w\w[)]', ap_id_onward)
next_loc_stop = text.index(next_loc.group())
ap_id_to_nxt_ap = text[ap_id_loc:next_loc_stop]
needed_text = (html2text(ap_id_to_nxt_ap))
airport_id_less_Y = airport_loc[1:]
app_1 = re.search(r'%sGN.*' % re.escape(airport_id_less_Y), needed_text)
app_2 = re.search(r'%sII.*' % re.escape(airport_id_less_Y), needed_text)
try:
if app_2.group():
line_of_chart = (app_2.group())
except:
if app_1.group():
line_of_chart = (app_1.group())
chart_title = (re.search(r'\w\w\w\w\w\d\d[-]\d*[_][\d\w]*[.]pdf', line_of_chart)).group()
# getting exact pdf now
chart_PDF = ('https://www.airservicesaustralia.com/aip/pending/dap/' + chart_title)
retrieve = s.get(chart_PDF)
content = retrieve.content
print(content)
# from here on is working fine.
I haven't included the code following this because it's not really relevant I think.
Please help me speed this thing up :(
It still takes 3 seconds to me with just your code.
latency might come from server.
to make request little faster, I try to edit HTTP adapter like this.
s.mount('http://', requests.adapters.HTTPAdapter(max_retries=0))
retrieve = s.get(chart_PDF)
It shows little improvement (3sec -> 2sec)
But have a risk for failure.
using "asyncio" or other async http library is more better ways
I'm new to Python and having some trouble with an API scraping I'm attempting. What I want to do is pull a list of book titles using this code:
r = requests.get('https://api.dp.la/v2/items?q=magic+AND+wizard&api_key=09a0efa145eaa3c80f6acf7c3b14b588')
data = json.loads(r.text)
for doc in data["docs"]:
for title in doc["sourceResource"]["title"]:
print (title)
Which works to pull the titles, but most (not all) titles are outputting as one character per line. I've tried adding .splitlines() but this doesn't fix the problem. Any advice would be appreciated!
The problem is that you have two types of title in the response, some are plain strings "Germain the wizard" and some others are arrays of string ['Joe Strong, the boy wizard : or, The mysteries of magic exposed /']. It seems like in this particular case, all lists have length one, but I guess that will not always be the case. To illustrate what you might need to do I added a join here instead of just taking title[0].
import requests
import json
r = requests.get('https://api.dp.la/v2/items?q=magic+AND+wizard&api_key=09a0efa145eaa3c80f6acf7c3b14b588')
data = json.loads(r.text)
for doc in data["docs"]:
title = doc["sourceResource"]["title"]
if isinstance(title, list):
print(" ".join(title))
else:
print(title)
In my opinion that should never happen, an API should return predictable types, otherwise it looks messy on the users' side.
I am trying to iterate through some json data, but the information is found on several pages. I don't have a problem working through the first page, however it will just skip over the next set. The weird thing is, it will execute fine while in debug mode. I'm guessing its a timing issue while working with the json loads, but I tried putting sleep timers around that code and the issue persisted.
url = apipath + query + apikey
response = requests.get(url)
data = json.loads(response.text)
for x in data["results"]:
nameList.append(x["name"])
latList.append(x["geometry"]["location"]["lat"])
lonList.append(x["geometry"]["location"]["lng"])
pagetoken = "pagetoken=" + data["next_page_token"]
url = apipath + pagetoken + apikey
response = requests.get(url)
data = json.loads(response.text)
for x in data["results"]:
nameList.append(x["name"])
latList.append(x["geometry"]["location"]["lat"])
lonList.append(x["geometry"]["location"]["lng"])
I would venture to guess that data["results"] equates to a None value and therefore calling for x in None: would result in the program skipping your for loop. Have you tried putting a print above the for loop? Perhaps try print(data["results"]) before going into your loop to ensure the data you want exists. If that returns None then maybe try just print(data) and see what the program is reading.
Well it did end up being a timing issue. I placed a 2 second timer before the second request and it now will load the data just fine. I guess Python just couldn't keep up.
I'm stuck in a conundrum of optimization versus the nature of the program. I have code that's written to extract info from an API and insert it directly into a MongoDB database. I've posted code that is only operating on 4 pages of the API, and it works rather quickly. However, the final program needs to works reasonably well on 40 pages and as of now the program seems to stop after 5. To be clear, It says its completed, but has only collected from 5. To ensure the right information is placed with the right 'collection', which are named from the extraction itself and not manually, the code is built on a serious of nested for loops that are quite slow and pretty hideous to behold. However, I've been whacking at this for a while and I'm having trouble coming up with any other way to do it that gathers the information accurately and puts it in the right place. Again, looking to reduce the number of nested loops. My API key is blocked, so this code will not run. The API is NCBO's BioPortal and you can look at their API here: http://data.bioontology.org/
Thanks!
import urllib2
import json
import ast
from pymongo import MongoClient
from datetime import datetime
REST_URL = "http://data.bioontology.org"
API_KEY = "********"
client=MongoClient()
db=client.db
print "Accessed database."
def get_json(url):
opener = urllib2.build_opener()
opener.addheaders = [('Authorization', 'apikey token=' + API_KEY)]
return json.loads(opener.open(url).read())
# Get all ontologies from the REST service and parse the JSON
all_ontologies = get_json(REST_URL+"/ontologies")
selected_ontologies= ['MERA','OGROUP','GCO','OCHV']
onts_acronyms=[]
page=None
acronym= None
for ontology in all_ontologies:
if ontology["acronym"] in selected_ontologies:
onts_acronyms.append(ast.literal_eval(json.dumps(ontology["acronym"]))) #cleans names and removes whitespaces using ast package
for acronym in onts_acronyms:
page=get_json(REST_URL+"/ontologies/"+acronym+"/classes")
next_page=page
while next_page:
next_page=page["links"]["nextPage"]
for ont_class in page["collection"]:
result = db[acronym].insert({ont_class["prefLabel"]:
{"definition":ont_class["definition"],"synonyms":ont_class["synonym"]}},
check_keys=False)
if next_page:
page=get_json(next_page)
print "DB Built."
i want to know how much time(secend) do i need to get my whole facebook wall from json(graph api)
it takes about 190 seconds to get my whole wall's post (maybe 2000 posts and 131pages(json))
follow is python code. that code is just reading the posts.
is there any problem in my code? and should i cut my response time?
accessToken = "Secret"
requestURL = "https://graph.facebook.com/me/feed?access_token="+accessToken
beforeSec = time.time()*1000
pages = 1
while 1:
read = urllib.urlopen(requestURL).read()
read = json.loads(read)
data = read["data"]
for i in range(0, len(data)):
pass
try:
requestURL = read["paging"]["next"]
pages+=1
except:
break
afterSec = time.time()*1000
print afterSec - beforeSec
It depends offcourse on how big the users wall is ... They have released a new batch function : http://developers.facebook.com/docs/reference/api/batch/
Mayb you can use that?
Your code is synchronous, so you download the pages one by one.
It's very slow, you could download several pages in parallel instead.
Greenlet are the new hype for Python paraller computing, so give a try to gevent.
Well, this is provided you can get the next page before downloading the entire previous page of course. Try to see if you can get the next paging in a quick way.