While loop page iteration for Meetup API not working - python

I'm trying to iterate through the pages of this Meetup API but I am receiving an error:
url = 'https://api.meetup.com/2/groups?offset=1&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
while url:
data = requests.get(url).json()
url2 = data['meta'].get('next')
data2 = pd.io.json.json_normalize(data['results'])
print(data2)
However, when I write it as;
while url:
data = requests.get(url).json()
print(data)
url2 = data['meta'].get('next')
data2 = pd.io.json.json_normalize(data['results'])
It comes out as a list that keeps iterating it's self but I don't know if it's looping through the same page or not.
I also need to use this ["offset"] += 1 somehow but fon't know where to place it

there is also a parameter page that you can use in your api call.
page = 1
url = '<base_url>&page=%d'
while page < 590:
new_url = url % page
# fetch new_url and do your magic
....
page += 1

Related

Python Get Request All Pages Movie list

While using below snippet it is not returning values of Page, Total page and data.
Also not returning the value of function "getMovieTitles".
import request
import json
def getMovieTitles(substr):
titles = []
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}'.format(substr)"
data = requests.get(url)
print(data)
response = json.loads(data.content.decode('utf-8'))
print(data.content)
for page in range(0, response['total_pages']):
page_response = requests.get("https://jsonmock.hackerrank.com/api/movies/search/?Title={}}&page={}".format(substr, page + 1))
page_content = json.loads(page_response.content.decode('utf-8'))
print ('page_content', page_content, 'type(page_content)', type(page_content))
for item in range(0, len(page_content['data'])):
titles.append(str(page_content['data'][item]['Title']))
titles.sort()
return titles
print(getMovieTitles('Superman'))
You're not formatting the url string correctly.
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}'.format(substr)"
format() is a method of string and you've put it inside of the url string, instead do:
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}".format(substr)
First, import
import requests
The problem is in your string formatting
' instead of "
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}".format(substr)
and one } too much
page_response = requests.get("https://jsonmock.hackerrank.com/api/movies/search/?Title={}&page={}".format(substr, page + 1))

Scraping using BeautifulSoup only gets me 33 responses off of an infinite scrolling page. How do i increase the number of responses?

The website link:
https://collegedunia.com/management/human-resources-management-colleges
The code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})
l = []
for divParent in all:
item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
d = {}
d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text
d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
l.append(d)
import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?
Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.
Code:
import json
import pandas
import requests
import base64
collegedata = []
count = 0
while True:
datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
"page": count}
data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
params = {
"data": data
}
response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
if response["hasNext"]:
for i in response["colleges"]:
d = {}
d["Name"] = i["college_name"]
d["Rating"] = i["rating"]
d["Location"] = i["college_city"] + ", " + i["state"]
collegedata.append(d)
print(d)
else:
break
count += 1
df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)
Output:
Let me know if you have any questions :)
When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.
The endpoint to which it sends a http get request is as follows:
https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=
When you send a get via requests module, you get a json response back.
import requests
url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="
res = requests.get(url)
print(res.json())
But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following
{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}
Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.

Nested while loop for API json collection

I'm requesting 590 pages from the Meetup API. I've iterated with a while loop to get the pages. Now that I have the pages I need to request this pages and format them correctly as python in order to place into a Pandas dataframe.
This is how it looks when you do it for one url :
url = ('https://api.meetup.com/2/groups?offset=1&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3')
r = requests.get(url).json()
data = pd.io.json.json_normalize(r['results'])
But because I have so many pages I want to do this automatically and iterate through them all.
That's how nested while loops came to mind and this is what I tried:
urls = 0
offset = 0
url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
r = requests.get(urls%d = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3').json()
while urlx < 591:
new_url = r % urls % offset
print(new_url)
offset += 1
However, it isn't working and I'm receiving many errors including this one:
SyntaxError: keyword can't be an expression
Not sure what you're trying to do, and the code has lots of issues.
But if you just want to loop through 0 to 591 and fetch URLs, then here's the code:
import requests
import pandas as pd
dfs = []
base_url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
for i in range(0, 592):
url = base_url % i
r = requests.get(url).json()
print("Fetching URL: %s\n" % url)
# do something with r here
# here I'll append it to a list of dfs
dfs.append(pd.io.json.json_normalize(r['results']))

Why does my program only output the last page of a multiple page scraping operation?

I am trying to scrape multiple pages using beautifulsoup concept, but am getting only the last page results as output, please suggest the right way. Below is my code.
# For every page
for page in range(0,8):
# Make a get request
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
html_soup = BeautifulSoup(response.text, 'html.parser')
all_table_info = html_soup.find('table', class_ = "views-table cols-4")
for name in all_table_info.find_all('div',
class_="views-field views-field-view"):
names.append(name.text.replace("\n", " ")if name.text else None)
for organization in all_table_info.find_all('td',
class_="views-field views-field-field-employer"):
orgs.append(organization.text.strip() if organization.text else None)
for year in all_table_info.find_all('td',
class_ = "views-field views-field-view-2"):
Years.append(year.text.strip() if year.text else None)
df = pd.DataFrame({'Name' : names, 'Org' : orgs, 'year' : Years })
print (df)
There is a typing error: a plus instead of a dot. You need 'http://nati...ge=0%2C{}'.format(page),
but you wrote
'http://nati...ge=0%2C{}' + format(page)
URLs having braces before the page number end up at the same page.
EDIT:
If I was not clear, you need just change the line
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
to
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}'.format(page))
In the first case the resulting URL contains also the substring '{}', which causes the problem.
Note: there are 9 pages on the site identified by page=0,0 through to page=0,8. Your loop should use range(9). Or, even better, load the first page then get the URL for the next page using the next link. Iterate over all the page by following the next link until there is no next link on the page.
Further to xhancar's answer which identifies the problem, a better way is to avoid string operations when building URLs, and instead let requests construct the URL query string for you:
for page in range(9):
params = {'page': '0,{}'.format(page)}
response = get('http://nationalacademyhr.org/fellowsdirectory', params=params)
The params parameter is passed to requests.get() which adds the values to the URL query string. The query parameters will be properly encoded, e.g. the , replaced with %2C.

Google news crawler flip pages

continuing on previous work to crawl all news result about query and to return title and url, I am refining the crawler to get all results from all pages in Google News. Current code seems can only return the 1st page Googel news search result. Would be grateful to know how to get all pages results. Many thanks!
my codes below:
import requests
from bs4 import BeautifulSoup
import time
import datetime
from random import randint
import numpy as np
import pandas as pd
query2Google = input("What do you want from Google News?\n")
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb")
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y" # URL for query of news results within one year and sort by date
#htmlpage = urllib2.urlopen(url).read()
time.sleep(randint(0, 2))#waiting
htmlpage = requests.get(url)
print("Status code: "+ str(htmlpage.status_code))
soup = BeautifulSoup(htmlpage.text,'lxml')
df = []
for result_table in soup.findAll("div", {"class": "g"}):
a_click = result_table.find("a")
#print ("-----Title----\n" + str(a_click.renderContents()))#Title
#print ("----URL----\n" + str(a_click.get("href"))) #URL
#print ("----Brief----\n" + str(result_table.find("div", {"class": "st"}).renderContents()))#Brief
#print ("Done")
df=np.append(df,[str(a_click.renderContents()).strip("b'"),str(a_click.get("href")).strip('/url?q='),str(result_table.find("div", {"class": "st"}).renderContents()).strip("b'")])
df = np.reshape(df,(-1,3))
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Search Crawl Done!")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
QGN(query2Google)
There used to be an ajax api, but it's no longer avaliable .
Still , you can modify your script with a for loop if you want to get a number of pages , or a while loop if you want to get all pages .
Example :
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y&start="
pages = 10 # the number of pages you want to crawl #
for next in range(0, pages*10, 10) :
page = url + str(next)
time.sleep(randint(1, 5)) # you may need longer than that #
htmlpage = requests.get(page) # you should add User-Agent and Referer #
print("Status code: " + str(htmlpage.status_code))
if htmlpage.status_code != 200 :
break # something went wrong #
soup = BeautifulSoup(htmlpage.text, 'lxml')
... process response here ...
next_page = soup.find('td', { 'class':'b', 'style':'text-align:left' })
if next_page is None or next_page.a is None :
break # there are no more pages #
Keep in mind that google doesn't like bots , you might get a ban .
You could add 'User-Agent' and 'Referer' in headers to simulate a web browser , and use time.sleep(random.uniform(2, 6)) to simulate a human ... or use selenium.
You can also add &num=25 to the end of your query and you'll get back a webpage with that number of results. In this example youll get back 25 google results back.

Categories