I have the following code here that manages to web scrape information of the website, but I wish to run this code every 10 seconds to refresh running this code as well as formatting the output of this code into a nice table with the average of the top 5 values. How should I go around doing this?
import json
import requests
url = 'https://otc-api-hk.eiijo.cn/v1/data/trade-market?coinId=2¤cy=3&tradeType=sell&blockType=general'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print('USDT SGD')
print('----')
for d in data['data']:
print('{:<30}{}'.format(d['userName'], d['price']))
url = 'https://otc-api.hbg.com/v1/data/trade-market?coinId=1¤cy=3&tradeType=sell&blockType=general'
data = requests.get(url).json()
print('BTC SGD')
print('----')
for d in data['data']:
print('{:<30}{}'.format(d['userName'], d['price']))
Convert it into a pandas data frame, calculate the average using nlargest, print the data and the average
df = pd.Dataframe(data['data'])
df = df[['userName','price']]
top_5_avg = df.nlargest(5, "price")['price].mean()
print(df)
print(f'The average of top 5 is {top_5_avg}')
Related
In a CSV-file I have a column with 150k id-values, like below. I am trying to iterate through them and call the API using each value. The API has the request limit 5000/min.
OBJEKT_ID
id1
id2
id3
...
I then want to put the requested data (CLASS) into a new csv-file together with the corresponding ID in another column. Like below.
OBJEKT_ID,CLASS
id1,X
id2,Y
id3,Z
...,...
However, I am only able to create one row of data (like below) in the new csv-file before I get an error message.
OBJEKT_ID,CLASS
id1,X
The error message is "index 1 is out of bounds for axis 0 with size 1". Why is this the case?
Here's the code:
object_df = pandas.read_csv("CSV_FILE.csv")
for index, row in object_df.iterrows():
response = requests.get(
f"url/{row[index]}",
headers=headers)
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
print(result)
df = pandas.DataFrame()
df['OBJEKT_ID'] = [row[index]]
df['CLASS'] = [result]
df.to_csv("collected_data.csv", index=False)
time.sleep(0.0125)
enter code here
Corrections for the following errors:
row is a 1 element pandas series, so df['OBJEKT_ID'] = [row[index]] becomes out of bounds
You're creating a new dataframe in each loop i.e. df = pandas.DataFrame()
You're overwriting the csv file with each loop i.e. df.to_csv("collected_data.csv", index=False)
Code
object_df = pandas.read_csv("CSV_FILE.csv")
# Will hold the output data
output = {'OBJEKT_ID':[],
'CLASS':[]}
for index, row in object_df.iterrows():
response = requests.get(
f"url/{row[index]}",
headers=headers)
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
print(result)
output['OBJEKT_ID'].append(row[0]) # ID column
output['CLASS'].append(result) # class column
time.sleep(0.0125) # rate limiting (note: another option is to use a rate limiting modules
# such as https://pypi.org/project/ratelimit/)
# Create Dataframe
df = pd.DataFrame(output)
# Write to csv
df.to_csv("collected_data.csv", index=False)
Alternative Method
Use rate limiter and apply function
import requests
from ratelimit import limits
#limits(calls=4900, period=60) # limits to 4900 calls per minute (backoff from 5000 max)
def call_api(row): # function to process requests
response = requests.get(
f"url/row",
headers=headers) # note: headers not shown in posted code
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
if response.status_code != 200:
raise Exception('API response: {}'.format(response.status_code))
return response
# Create dataframe from CSV file
object_df = pandas.read_csv("CSV_FILE.csv")
# Add class column (calling api on each row)
object_df['CLASS'] = object_df['OBJEKT_ID'].apply(call_api)
# Write to csv
df.to_csv("collected_data.csv", index=False)
I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df
To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv (screenshot from LibreOffice):
The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1
I'm using Python to do some data cleaning/task automation, but am having a hard time reading in data through an API with multiple conditions. My data is as follows:
url = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?descriptor='Social Distancing' or descriptor='Face Covering Violation' or descriptor='Business not in compliance'"
r = requests.get(url)
x = r.json()
df = pd.DataFrame(x)
When I pull it, it only provides me with data where the descriptor is 'Social Distancing'. Any tips on how to change this so that it filters for all of the needed data?
Make three requests and merge their responses:
def get_data(filter):
url = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?descriptor='{0}'".format(filter)
r = requests.get(url)
return pd.DataFrame(r.json())
df = pd.concat([
get_data('Social Distancing'),
get_data('Face Covering Violation'),
get_data('Business not in compliance')
])
I am having difficulty increasing the amount of requests I can make per second with Google Maps Geocoder. I am using a paid account (at $.50/1000 requests), so according to the Google Geocoder API I should be able to make up to 50 requests per second.
I have a list of 15k address which I am trying to get GPS coordinates for. I am storing them as a Pandas Dataframe and looping over them. To make sure this wasn't due to slow looping, I tested how fast it loops over all 15k, and it only took 1.5 seconds. But I was only able to make less than 1 request per second. I realized this might be due to my slow internet connection, so I fired up a Windows Google Cloud VM with obviously fast internet. I was able to speed up the requests to about 1.5 requests/ second, but still way slower than theoretically possible.
I thought this might be due to using a python library Geocoder, so I tried making the request directly using python requests, but this didn't speed things up either.
Does this have something to do with the fact that I'm not using a server? I would think this wouldn't matter since I'm using a Google Cloud VM. Also, I know this doesn't have to do with multithreading, since it can already iterate through the loop using 1 core with extreme speed. Thanks in advance for any thoughts.
import geocoder
import pandas as pd
import time
import requests
startTime = time.time()
#Read File Name with all transactions up to October 4th
input_filename = "C:/Users/username/Downloads/transaction-export 10-04-2017.csv"
df = pd.read_csv(input_filename, header=0, error_bad_lines=False)
#Only look at customer addresses
df = df['Customer Address']
#Drop duplicates and NAs
df = df.drop_duplicates(keep='first')
df = df.dropna()
#convert dataframe to string
addresses = df.tolist()
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
One way to increase the speed of this is to maintain the requests session with Google, rather than creating a new session with every request. This is suggested in the geocoder documentation.
Your modified code will then be:
import requests
#Google Api Key
api_key = 'my_api_key'
#create empty array
address_gps = []
#google api address
url = 'https://maps.googleapis.com/maps/api/geocode/json'
#For each address return its geocoded latlng coordinates
with requests.Session() as session:
for int, val in enumerate(addresses):
''' Direct way to make call without geocoder
params = {'sensor': 'false', 'address': address, 'key': api_key}
r = requests.get(url, params=params)
results = r.json()['results']
location = results[0]['geometry']['location']
print location['lat'], location['lng']
num_address = num_address+1;
'''
endTime = time.time()
g = geocoder.google(val, key=api_key, exactly_one=True, session=session)
print "Address,", (val), "Number,", int, "Total,", len(addresses), "Time,", endTime-startTime
if g.ok:
address_gps.append(g.latlng)
print g.latlng
else:
address_gps.append(0)
print("Error")
#save every 100 iterations
if int%100==0:
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
# save as csv
df1 = pd.DataFrame({'Address GPS': address_gps})
df1.to_csv('C:/Users/username/Downloads/AllCustomerAddressAsGPS.csv')
With this code
import pandas as pd
import requests
link = "http://sp.kaola.com/api/category/goods?pageNo=1&pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
df = reqeusts.get(link).json()
print df
I can get the response for the URL I'm querying.
But how can I get the data from when the url's GET arg becomes pageNo = 3, 4 and so on?
I want get all of the responses from all the pages in one request. If this possible ?
In each page I can get 20 responses. How can I get all of them ?
update:
i use this method to clearn the json:
from pandas.io.json import json_normalize
df1 = df['body']
df_final = json_normalize(df1['result'],'goodsList')
HOW CAN I get all the response into only one dataframe?
Getting all the responses on one page doesn't seem possible. This is something you cannot control, and only the person who made the website can control.
But, what you can do is loop through the pages of the search result and add them together. I notice you have a hasMore variable which tells if there are more search results. This gives something like this:
import requests
link = "http://sp.kaola.com/api/category/goods?pageSize=20&search=%7B%0A%20%20%22sortType%22%20%3A%20%7B%0A%20%20%20%20%22type%22%20%3A%200%0A%20%20%7D%2C%0A%20%20%22isNavigation%22%20%3A%20%220%22%2C%0A%20%20%22filterTypeList%22%20%3A%20%5B%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%22id%22%20%3A%20%5B%0A%20%20%20%20%20%20%20%204055%0A%20%20%20%20%20%20%5D%2C%0A%20%20%20%20%20%20%22type%22%20%3A%201%2C%0A%20%20%20%20%20%20%22category%22%20%3A%20%7B%0A%20%20%20%20%20%20%20%20%22parentCategoryId%22%20%3A%200%2C%0A%20%20%20%20%20%20%20%20%22categoryId%22%20%3A%204055%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%5D%2C%0A%20%20%22source%22%20%3A%201%2C%0A%20%20%22noStoreCount%22%20%3A%200%2C%0A%20%20%22isActivity%22%20%3A%200%2C%0A%20%20%22storeCount%22%20%3A%2060%0A%7D"
max_pages = 100
data = {}
for page_no in range(max_pages):
try:
req = reqeusts.get(link + "&pageNo=" + str(page_no))
except reqeusts.ConnectionError:
break # Stop loop if the url was not found.
df = req.json()
if df["body"]["result"]["hasMore"] == 0:
break # Page says it has no more results
# Here, add whatever data you want to save from df to data