Webscraping data from a json source, why i get only 1 row? - python

I'am trying to get some information from a website with python, from a webshop.
I tried this one:
def proba():
my_url = requests.get('https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')
data = my_url.json()
results = []
products = data['MainContent'][0]['contents'][0]['productList']['products']
for product in products:
name = product['productModel']['displayName']
try:
priceGross = product['priceInfo']['priceItemSale']['gross']
except:
priceGross = product['priceInfo']['priceItemToBase']['gross']
url = product['productModel']['url']
results.append([name, priceGross, url])
df = pd.DataFrame(results, columns = ['Name', 'Price', 'Url'])
# print(df) ## print df
df.to_csv(r'/usr/src/Python-2.7.13/test.csv', sep=',', encoding='utf-8-sig',index = False )
while True:
mytime=datetime.now().strftime("%H:%M:%S")
while mytime < "23:59:59":
print mytime
proba()
mytime=datetime.now().strftime("%H:%M:%S")
In this webshop there are 9 items, but i see only 1 row in the csv file.

Not entirely sure what you intend as end result. Are you wanting to update an existing file? Get data and write out all in one go? Example of latter shown below where I add each new dataframe to an overall dataframe and use a Return statement for the function call to provide each new dataframe.
import requests
from datetime import datetime
import pandas as pd
def proba():
my_url = requests.get('https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')
data = my_url.json()
results = []
products = data['MainContent'][0]['contents'][0]['productList']['products']
for product in products:
name = product['productModel']['displayName']
try:
priceGross = product['priceInfo']['priceItemSale']['gross']
except:
priceGross = product['priceInfo']['priceItemToBase']['gross']
url = product['productModel']['url']
results.append([name, priceGross, url])
df = pd.DataFrame(results, columns = ['Name', 'Price', 'Url'])
return df
headers = ['Name', 'Price', 'Url']
df = pd.DataFrame(columns = headers)
while True:
mytime = datetime.now().strftime("%H:%M:%S")
while mytime < "23:59:59":
print(mytime)
dfCurrent = proba()
mytime=datetime.now().strftime("%H:%M:%S")
df = pd.concat([df, dfCurrent])
df.to_csv(r"C:\Users\User\Desktop\test.csv", encoding='utf-8')

Related

Create merged df based on the url list [pandas]

I was able to extract the data from url_query url, but additionally, I would like to get the data from the urls_list created based on the query['ids'] column from dataframe. Please see below the current logic:
url = 'https://instancename.some-platform.com/api/now/table/data?display_value=true&'
team = 'query=group_name=123456789'
url_query = url+team
dataframe: query
[ids]
0 aaabbb1cccdddeee4ffggghhhhh5iijj
1 aa1bbb2cccdddeee5ffggghhhhh6iijj
issue_list = []
for issue in query['ids']:
issue_list.append(f'https://instancename.some-platform.com/api/now/table/data?display_value=true&?display_value=true&query=group_name&sys_id={issue}')
response = requests.get(url_query, headers=headers,auth=auth, proxies=proxies)
data = response.json()
def api_response(k):
dct = dict(
event_id= k['number'],
created_time = k[‘created’],
status = k[‘status’],
created_by = k[‘raised_by’],
short_desc = k[‘short_description’],
group = k[‘team’]
)
return dct
raw_data = []
for p in data['result']:
rec = api_response(k)
raw_data.append(rec)
df = pd.DataFrame.from_records(raw_data)
df:
The url_query response extracts what I need, but the key is that I would like to add to the existing one 'df' add the data from the issue_list = []. I don't know how to put the issue_list = [] to the response. I've tried to add issue_list to the response = requests.get(issue_list, headers=headers,auth=auth, proxies=proxies) statement, but I've got invalid schema error.
You can create list of DataFrames with query q instead url_query and last join together by concat:
dfs = []
for issue in query['ids']:
q = f'https://instancename.some-platform.com/api/now/table/data?display_value=true&?display_value=true&query=group_name&sys_id={issue}'
response = requests.get(q, headers=headers,auth=auth, proxies=proxies)
data = response.json()
raw_data = [api_response(k) for p in data['result']]
df = pd.DataFrame.from_records(raw_data)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Store result from URL into Pandas Data frame

I am new to Pandas & Python . Have a requirement where..
I am passing 100 post codes to a URL using for loop & trying to extract the latitude & longitude for each of the post codes passed.
The result of it I need to save in data frame . Below is the code I have am using .
query_cust = "select custMasterID,Full_Name,POSTCODE from DMON.BANK_CUSTOMERS"
df_cust = pd.read_sql(query_cust, con=con_str)
df_cust["URL"] = "https://api.getthedata.com/postcode/" + df_cust['POSTCODE'].str.replace(" ", "")
for column in df_cust["URL"]:
# print(column)
response = requests.get(column)
response_text = response.text
#df = json.loads(response_text)['data']
parse_json = json.loads(response_text)
df_cust["Lat"] = pd.json_normalize(parse_json['data']['latitude'])
df_cust["Long"] = parse_json['data']['longitude']
print(df_cust)
Below is the error which is coming when i try running it .
df_cust["Lat"] = pd.json_normalize(parse_json['data']['latitude'])
in _json_normalize
raise NotImplementedError
NotImplementedError
You don't need to use json_normalize to get what you need from the response data. Just iterate through each row of the dataframe and update the values:
import pandas as pd
import json
import requests
pd.options.display.max_columns = None
pd.options.display.max_rows = None
df_cust = pd.DataFrame(columns=['POSTCODE'])
# Just appending some data
df_cust = df_cust.append({'POSTCODE': 'SW1A-1AA'}, ignore_index=True)
df_cust = df_cust.append({'POSTCODE': 'WC2B-4AB'}, ignore_index=True)
df_cust = df_cust.append({'POSTCODE': 'ASDF-QWE'}, ignore_index=True) # Wrong postal code
for i, row in df_cust.iterrows():
df_cust.at[i, 'URL'] = 'https://api.getthedata.com/postcode/' + row['POSTCODE'].replace('-','')
response = requests.get(df_cust.loc[i, 'URL'])
parse_json = json.loads(response.text)
if 'data' in parse_json:
if 'latitude' in parse_json['data']:
df_cust.at[i, 'LAT'] = parse_json['data']['latitude']
else:
df_cust.at[i, 'LAT'] = None
if 'longitude' in parse_json['data']:
df_cust.at[i, 'LON'] = parse_json['data']['longitude']
else:
df_cust.at[i, 'LON'] = None
else:
df_cust.at[i, 'LAT'] = None
df_cust.at[i, 'LON'] = None
print(df_cust)
Output:
POSTCODE URL LAT LON
0 SW1A-1AA https://api.getthedata.com/postcode/SW1A1AA 51.501009 -0.141588
1 WC2B-4AB https://api.getthedata.com/postcode/WC2B4AB 51.514206 -0.119893
2 ASDF-QWE https://api.getthedata.com/postcode/ASDFQWE None None

How to filter an API Search result in Python?

I am using the edamam recipe api and have been trying to filter the response by only saving recipes with a number of calories > the max inputed by the user. I keep getting an error. This is the code:
import requests
import pandas as pd
def recipe_search(ingredient):
app_id = ''
app_key = ''
result = requests.get('https://api.edamam.com/search?q={}&app_id={}&app_key={}'.format(ingredient, app_id, app_key))
data = result.json()
return data['hits']
def run():
ingredient = input('Enter an ingredient: ')
max_no_of_calories = float(input('Enter the max amount of calories desired in recipe: '))
data_label = []
data_uri = []
data_calories = []
results = recipe_search(ingredient)
for result in results:
recipe = result['recipe']
result['calories'] < max_no_of_calories
data_label.append(recipe['label'])
data_uri.append(recipe['uri'])
data_calories.append(recipe['calories'])
data = {'Label': data_label,
'URL': data_uri,
'No of Calories': data_calories
}
df = pd.DataFrame(data, columns=['Label', 'URL'])
df.to_csv(r'C:\Users\name\Documents/cfg-python/export_dataframe.csv',
index=False, header=True)
run()
df2 = pd.read_csv(r'C:\Users\name\Documents/cfg-python/export_dataframe.csv')
sorted_df = df2.sort_values(by=["calories"], ascending=True)
sorted_df.to_csv(r'C:\Users\name\Documents/cfg-python/export_dataframe.csv', index=False)
This is the error:
if result['calories'] < max_no_of_calories:
KeyError: 'calories'
Is anyone able to help? How could I re-write this code with the filter of only recipes with under the max_no_of_calories? 'max_no_of_calories' is input by the user.
You made a typo in your for loop
for result in results:
recipe = result['recipe']
if recipe["calories"] < max_no_of_calories:
print(recipe["calories"])
This will get rid of your key error

Add every scraped item to csv row pandas

I have a selenium project that scrape website and loop to get inner class text
I want to save every scraped text from this loop to a new csv row located next to the py file, and accept new columns if added in the future
How do i do that?
This is what i tried
prodTitle = driver.find_elements_by_xpath("//*[contains(#class,'itemTitle')]")
for pTitle in prodTitle:
itemName = pTitle
pd = pd.dataframe(pTitle.text)
pd.to_csv('data.csv', pd)
print(pTitle.text)
but it add the last item only
You can add the data in the same loop and then save the whole dataframe, like this:
prodTitle = driver.find_elements_by_xpath("//*[contains(#class,'itemTitle')]")
df = pd.DataFrame(columns=['Title'])
for (idx,pTitle) in enumerate(prodTitle):
itemName = pTitle
df.loc[idx, 'Title'] = pTitle.text
print(pTitle.text)
df.to_csv('data.csv')
EDIT: to add more data it is convenient set the column before the loop, like this:
cols = ['Title', 'Col_0', 'Col_1', 'Col_N']
df = pd.DataFrame(columns=cols)
and then inside the loop:
...
df.loc[idx, 'Title'] = title
df.loc[idx, 'Col_0'] = data_0
df.loc[idx, 'Col_1'] = data_1
df.loc[idx, 'Col_N'] = data_N
...
EDIT (because I found another way):
You can create a list with all the data and then passed them to a DataFrame:
prodTitle = driver.find_elements_by_xpath("//*[contains(#class,'itemTitle')]")
data = []
for pTitle in prodTitle:
itemName = pTitle
data.append([pTitle.text, pTitle.data_0, pTitle.data_1, ...])
columns = ['Title', 'Col_0', 'Col_1', ...]
df = pd.DataFrame(data=data, columns=columns)

Python 3.7 KeyError

I like to retrieve information from NewsApi and ran into an issue. Enclosed the code:
from NewsApi import NewsApi
import pandas as pd
import os
import datetime as dt
from datetime import date
def CreateDF(JsonArray,columns):
dfData = pd.DataFrame()
for item in JsonArray:
itemStruct = {}
for cunColumn in columns:
itemStruct[cunColumn] = item[cunColumn]
# dfData = dfData.append(itemStruct,ignore_index=True)
# dfData = dfData.append({'id': item['id'], 'name': item['name'], 'description': item['description']},
# ignore_index=True)
# return dfData
return itemStruct
def main():
# access_token_NewsAPI.txt must contain your personal access token
with open("access_token_NewsAPI.txt", "r") as f:
myKey = f.read()[:-1]
#myKey = 'a847cee6cc254d8495632f83d5c77d39'
api = NewsApi(myKey)
# get sources of news
# columns = ['id', 'name', 'description']
# rst_source = api.GetSources()
# df = CreateDF(rst_source['sources'], columns)
# df.to_csv('source_list.csv')
#
#
# # get news for specific country
# rst_country = api.GetHeadlines()
# columns = ['author', 'publishedAt', 'title', 'description','content', 'url']
# df = CreateDF(rst_country['articles'], columns)
# df.to_csv('Headlines_country.csv')
# get news for specific symbol
symbol = "coronavirus"
sources = 'bbc.co.uk'
columns = ['author', 'publishedAt', 'title', 'description', 'content', 'source']
limit = 500 # maximum requests per day
i = 1
startDate = dt.datetime(2020, 3, 1, 8)
# startDate = dt.datetime(2020, 3, 1)
df = pd.DataFrame({'author': [], 'publishedAt': [], 'title': [], 'description': [], 'content':[], 'source': []})
while i < limit:
endDate = startDate + dt.timedelta(hours=2)
rst_symbol = api.GetEverything(symbol, 'en', startDate, endDate, sources)
rst = CreateDF(rst_symbol['articles'], columns)
df = df.append(rst, ignore_index=True)
# DF.join(df.set_index('publishedAt'), on='publishedAt')
startDate = endDate
i += 1
df.to_csv('Headlines_symbol.csv')
main()
I got following error:
rst = CreateDF(rst_symbol['articles'], columns)
KeyError: 'articles'
In this line:
rst = CreateDF(rst_symbol['articles'], columns)
I think there is some problem regarding the key not being found or defined - does anyone has an idea how to fix that? I'm thankful for every hint!
MAiniak
EDIT:
I found the solution after I tried a few of your hints. Apparently, the error occurred when the NewsAPI API key ran into a request limit. This happened every time, until I changed the limit = 500 to limit = 20. For some reason, there is no error with a new API Key and reduced limit.
Thanks for your help guys!
Probably 'articles' is not one of your columns in rst_symbol object.
The python documentation [2] [3] doesn't mention any method named NewsApi() or GetEverything(), but rather NewsApiClient() and get_everything(), i.e.:
from newsapi import NewsApiClient
# Init
newsapi = NewsApiClient(api_key='xxx')
# /v2/top-headlines
top_headlines = newsapi.get_top_headlines(q='bitcoin',
sources='bbc-news,the-verge',
category='business',
language='en',
country='us')
# /v2/everything
all_articles = newsapi.get_everything(q='bitcoin',
sources='bbc-news,the-verge',
domains='bbc.co.uk,techcrunch.com',
from_param='2017-12-01',
to='2017-12-12',
language='en',
sort_by='relevancy',
page=2)
# /v2/sources
sources = newsapi.get_sources()

Categories