Good afternoon people,
I have an API that I want to consume, but it has more than one page to fetch information.
In the example below, I'm searching on page 1, but I have 50+ pages to query. How to customize this to automatically fetch all pages?
Then I will save in a variable to configure a dataframe
import json,requests
link = "https://some.com.br/api/v1/integration/customers.json"
headers = {'iliot-company-token': '3r5s$ddfdassss'}
Parameters = {"page": 1}
clients2 = requests.get(link, headers=headers, json=Parameters)
lista_clients2 = clients.json
print(lista_clients2())
Assuming you have exactly 50 pages, numbered 1-50, this approach might work - defining your request as a function/method then using map to got through all pages. This assumes all of the pages would have the same header, API token etc.
import json,requests
link = "https://some.com.br/api/v1/integration/customers.json"
headers = {'iliot-company-token': '3r5s$ddfdassss'}
def get_data(page):
Parameters = {"page": page}
clients2 = requests.get(link, headers=headers, json=Parameters)
lista_clients2 = clients.json
print(lista_clients2())
#Get it all at once
all_page_data = list(map(get_data, range(1,51)))
#If you want to make a dataframe
import pandas as pd
df = pd.DataFrame(all_page_data)
#You can also split out json-formatted data, if it's in a single column
full_df = pd.json_normalize(df[0])
Related
I have the following code to pull data from a website that has multiple pages and then turn the html tables into a dataframe. However the code I have below takes the last table of the html data so I don't get the full result.
I have some code at the bottom of the page s = Scraper(urlsList[0]) that accesses the list urlsList defined above it. What I can't figure out is how to create essentially a loop to create dataframes for each point in the list(0,1,2,3.. etc). So say when the following code is run:
s = Scraper(urlsList[n]) #n being the url page number
df
A separate dataframe is produced for each n.
At the moment I have a for loop which goes through the page numbers in the url 1 by 1.
Unfortunately I can't share the real URL as it requires authentication but I've made one up to show how the code functions.
import io
import urllib3
import pandas as pd
from requests_kerberos import OPTIONAL, HTTPKerberosAuth
a = get_mwinit_cookie()
pageCount = 6
urlsList = []
urls=
urls = https://example-url.com/ABCD/customer.currentPage={}&end
for x in range(pageCount)[1:]:
urlsList.append(urls.format(x))
def Scraper(url):
urllib3.disable_warnings()
with requests_retry_session() as req:
resp = req.get(url,
timeout=30,
verify=False,
allow_redirects=True,
auth=HTTPAuth(mutual_authentication=OPTIONAL),
cookies=a)
global df
#data = resp.text
data = pd.read_html(resp.text, flavor=None, header=0, index_col=0)
#resp.text, flavor=None, header=0, index_col=0
df = pd.concat(data, sort=False)
#df = data
print(df)
s = Scraper(urlsList[0])
df
You need to return something from your scraper function. Then you can collect the data from the pages in a list of DataFrames and use pd.concat() on that list.
def Scraper(url):
urllib3.disable_warnings()
with requests_retry_session() as req:
resp = req.get(url,
timeout=30,
verify=False,
allow_redirects=True,
auth=HTTPAuth(mutual_authentication=OPTIONAL),
cookies=a)
return pd.read_html(resp.text, flavor=None, header=0, index_col=0)[0]
pages_data = []
for url in urlsList:
pages_data.append(Scraper(url))
df = pd.concat(pages_data, sort=False)
I wrote a code which takes identifier number and make request to specific API and returns concerning data related to this identifier number. The code run through dataframe and takes the identifier numbers (approximately 600) and returns concerning information and converts into pandas dataframe. Finally, all dataframes are concatenated into one dataframe. Code is running super slow. Is there any ways how to make it faster. I am not confident in python and will appreciate if you can share solution code.
Code:
file = dataframe
list_df = []
for index,row in file.iterrows():
url = "https://some_api/?eni_number="+str(row['id'])+"&apikey=some_apikey"
payload={}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = row['id']
list_df.append(df_index)
#print(df_index)
total = pd.concat(list_df)
It seems the bottleneck here is that HTTP requests are executed synchronously, one after the other. So most of the time is wasted waiting for responses from the server.
You may have better results by using an asynchronous approach, for example using grequests to execute all HTTP requests in parallel:
import grequests
ids = dataframe["id"].to_list()
urls = [f"https://some_api/?eni_number={id}&apikey=some_apikey" for id in ids]
payload = {}
headers = {'Accept': 'application/json'}
requests = (grequests.get(url, headers=headers, data=payload) for url in urls)
responses = grequests.map(requests) # execute all requests in parallel
list_df = []
for id, response in zip(ids, responses):
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = id
list_df.append(df_index)
total = pd.concat(list_df)
The website link:
https://collegedunia.com/management/human-resources-management-colleges
The code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})
l = []
for divParent in all:
item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
d = {}
d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text
d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
l.append(d)
import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?
Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.
Code:
import json
import pandas
import requests
import base64
collegedata = []
count = 0
while True:
datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
"page": count}
data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
params = {
"data": data
}
response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
if response["hasNext"]:
for i in response["colleges"]:
d = {}
d["Name"] = i["college_name"]
d["Rating"] = i["rating"]
d["Location"] = i["college_city"] + ", " + i["state"]
collegedata.append(d)
print(d)
else:
break
count += 1
df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)
Output:
Let me know if you have any questions :)
When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.
The endpoint to which it sends a http get request is as follows:
https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=
When you send a get via requests module, you get a json response back.
import requests
url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="
res = requests.get(url)
print(res.json())
But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following
{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}
Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.
When I try to apply filters on the website before webscaping - it yields me to the following URL - https://www.marktplaats.nl/l/auto-s/p/2/#f:10898,10882
However, when I apply it in my script to retrieve href for each and every advertisement, it yields results from this website - https://www.marktplaats.nl/l/auto-s/p/2, completely neglecting 2 of my filters (namely #f:10898,10882).
Can you please advise me what is my problem?
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
add_url='/#f:10898,10882'
txt = requests.get(url + str(pagenumber)+add_url)
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
I would suggest that you use their api instead which seems to be open.
If you open the link you will see all the same listings you are searching for (try to find something to format the json, since it will look like just a bunch a text), with the appropriate filters and no need to parse html. You can also modify it easily in request just by changing the headers.
https://www.marktplaats.nl/lrp/api/search?attributesById[]=10898&attributesById[]=10882&l1CategoryId=91&limit=30&offset=0
In code it would look something like this:
def getcars():
url = 'https://www.marktplaats.nl/lrp/api/search'
querystring = {
'attributesById[]': 10898,
'attributesById[]': 10882,
'l1CategoryId': 91,
'limit': 30,
'offset': 0
}
headers = {
}
response = requests.get(url, headers=headers, params=querystring)
x = response.json()
return x
cars = getcars()
I'm requesting 590 pages from the Meetup API. I've iterated with a while loop to get the pages. Now that I have the pages I need to request this pages and format them correctly as python in order to place into a Pandas dataframe.
This is how it looks when you do it for one url :
url = ('https://api.meetup.com/2/groups?offset=1&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3')
r = requests.get(url).json()
data = pd.io.json.json_normalize(r['results'])
But because I have so many pages I want to do this automatically and iterate through them all.
That's how nested while loops came to mind and this is what I tried:
urls = 0
offset = 0
url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
r = requests.get(urls%d = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3').json()
while urlx < 591:
new_url = r % urls % offset
print(new_url)
offset += 1
However, it isn't working and I'm receiving many errors including this one:
SyntaxError: keyword can't be an expression
Not sure what you're trying to do, and the code has lots of issues.
But if you just want to loop through 0 to 591 and fetch URLs, then here's the code:
import requests
import pandas as pd
dfs = []
base_url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
for i in range(0, 592):
url = base_url % i
r = requests.get(url).json()
print("Fetching URL: %s\n" % url)
# do something with r here
# here I'll append it to a list of dfs
dfs.append(pd.io.json.json_normalize(r['results']))