why does beautiful soup not find value when it exists - python
Im trying to scrape data from yahoo finance with beautiful soup. One field is a span tag with an attribute of "data-reactid"="42" representing the previous close value of the stock. If I run the following commands it returns None. Why is that?
code below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q/op?s=spy+Options'
page = urlopen(url)
soup = BeautifulSoup(page)
soup.find("span", attrs={"data-reactid":"42"})
Try:
soup.find_all("span", attrs={"data-reactid":"42"})
See attrs for more examples.
EDIT:
Since the page is rendered using ReactJS, the data you are trying to access is not available when you make the request, this is why you are always getting None
I suggest you to use something like yfinance.
See this for more information.
When you open your browser and point to http://finance.yahoo.com/q/op?s=spy+Options few XHR calls are issued by the browser. One of those calls return a a data structure with the field 'previousClose'. It may be the fields you are looking for. See the code below.
import requests
import pprint
r = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=%5EGSPC&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance')
if r.status_code == 200:
pprint.pprint(r.json())
output
{'spark': {'error': None,
'result': [{'response': [{'indicators': {'quote': [{'close': [2982.4,
2981.64,
2982.96,
2982.85,
2978.84,
2977.04,
2974.02,
2974.34,
2973.85,
2974.75,
2975.68,
2978.29,
2977.26,
2978.91,
2980.48,
2983.23,
2982.07,
2984.48,
2984.14,
2984.07,
2984.66,
2981.88,
2983.19,
2983.86,
2983.79,
2967.47,
2968.61,
2971.59,
2970.77,
2975.5,
2971.69,
2972.09,
2973.98,
2968.89,
2969.1,
2970.09,
2968.25,
2969.2,
2966.84,
2963.03,
2962.99,
2958.07,
2959.89,
2963.97,
2962.86,
2960.46,
2958.78,
2961.0,
2959.69,
2959.73,
2961.58,
2958.23,
2959.21,
2960.67,
2958.38,
2955.76,
2956.29,
2955.62,
2954.33,
2954.44,
2952.78,
2951.81,
2951.3,
2948.71,
2946.59,
2948.26,
2950.32,
2948.2,
2948.35,
2953.54,
2955.45,
2952.13,
2955.97,
2956.97,
2957.38,
2958.28,
2961.44,
2962.13]}]},
'meta': {'chartPreviousClose': 2977.62,
'currency': 'USD',
'currentTradingPeriod': {'post': {'end': 1569628800,
'gmtoffset': -14400,
'start': 1569614400,
'timezone': 'EDT'},
'pre': {'end': 1569591000,
'gmtoffset': -14400,
'start': 1569571200,
'timezone': 'EDT'},
'regular': {'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}},
'dataGranularity': '5m',
'exchangeName': 'SNP',
'exchangeTimezoneName': 'America/New_York',
'firstTradeDate': -1325602800,
'gmtoffset': -14400,
'instrumentType': 'INDEX',
'previousClose': 2977.62,
'priceHint': 2,
'range': '1d',
'regularMarketPrice': 2961.79,
'regularMarketTime': 1569618019,
'scale': 3,
'symbol': '^GSPC',
'timezone': 'EDT',
'tradingPeriods': [[{'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}]],
'validRanges': ['1d',
'5d',
'1mo',
'3mo',
'6mo',
'1y',
'2y',
'5y',
'10y',
'ytd',
'max']},
'timestamp': [1569591000,
1569591300,
1569591600,
1569591900,
1569592200,
1569592500,
1569592800,
1569593100,
1569593400,
1569593700,
1569594000,
1569594300,
1569594600,
1569594900,
1569595200,
1569595500,
1569595800,
1569596100,
1569596400,
1569596700,
1569597000,
1569597300,
1569597600,
1569597900,
1569598200,
1569598500,
1569598800,
1569599100,
1569599400,
1569599700,
1569600000,
1569600300,
1569600600,
1569600900,
1569601200,
1569601500,
1569601800,
1569602100,
1569602400,
1569602700,
1569603000,
1569603300,
1569603600,
1569603900,
1569604200,
1569604500,
1569604800,
1569605100,
1569605400,
1569605700,
1569606000,
1569606300,
1569606600,
1569606900,
1569607200,
1569607500,
1569607800,
1569608100,
1569608400,
1569608700,
1569609000,
1569609300,
1569609600,
1569609900,
1569610200,
1569610500,
1569610800,
1569611100,
1569611400,
1569611700,
1569612000,
1569612300,
1569612600,
1569612900,
1569613200,
1569613500,
1569613800,
1569614100]}],
'symbol': '^GSPC'}]}}
Related
Python iterate through JSON - dict object is not callable
I want to parse flight data from Aviationstack API. For this example, I format the URL to get data from flight with departure at Marseille airport (France) with the parameter dep_icao=LFML (cf. documentation) : # get data for flights from Marseille airport url = 'http://api.aviationstack.com/v1/flights?access_key=MYAPIKEY&dep_icao=LFML' req = requests.get(url) response = req.json() This is the response (shortened) {'pagination': {'limit': 100, 'offset': 0, 'count': 100, 'total': 377}, 'data': [{'flight_date': '2022-12-24', 'flight_status': 'active', 'departure': {'airport': 'Marseille Provence Airport', 'timezone': 'Europe/Paris', 'iata': 'MRS', 'icao': 'LFML', 'terminal': '1A', 'gate': None, 'delay': 10, 'scheduled': '2022-12-24T09:00:00+00:00', 'estimated': '2022-12-24T09:00:00+00:00', 'actual': '2022-12-24T09:09:00+00:00', 'estimated_runway': '2022-12-24T09:09:00+00:00', 'actual_runway': '2022-12-24T09:09:00+00:00'}, 'arrival': {'airport': 'El Prat De Llobregat', 'timezone': 'Europe/Madrid', 'iata': 'BCN', 'icao': 'LEBL', 'terminal': '1', 'gate': None, 'baggage': '06', 'delay': None, 'scheduled': '2022-12-24T10:05:00+00:00', 'estimated': '2022-12-24T10:05:00+00:00', 'actual': None, 'estimated_runway': None, 'actual_runway': None}, 'airline': {'name': 'Qatar Airways', 'iata': 'QR', 'icao': 'QTR'}, 'flight': {'number': '3721', 'iata': 'QR3721', 'icao': 'QTR3721', 'codeshared': {'airline_name': 'vueling', 'airline_iata': 'vy', 'airline_icao': 'vlg', 'flight_number': '1509', 'flight_iata': 'vy1509', 'flight_icao': 'vlg1509'}}, 'aircraft': None, 'live': None}, [...] }]} Then I want to iterate through this JSON response. I can get the pagination info right but I'm only able to get info from the first element in data, so I will later loop through all results. My problem is that I would only like to get the iata item nested in flight. The way I'm trying to get it returns 'dict' object is not callable error. # iterate through JSON response pagination_data = response.get("pagination") flight_data = response.get("data")[0] total_results = pagination_data.get("total") flight = flight_data.get('flight')('iata') pprint(f"total results : {total_results}") pprint(f"flight : {flight}")
The get method will return a dictionnary which is not function and then can not be called It should be flight = flight_data.get('flight').get('iata') Alternatives : (but without returning None) flight_data.get('flight')['iata'] (but without returning None) flight=flight_data['flight']['iata']
How to scarpe li tag under ul?
I want to fetch the all the list tags under the ul tag with id= "demofour" from https://www.parliament.lk/en/members-of-parliament/directory-of-members/?cletter=A. Below is the code: print(soup.find('ul',id='demoFour')) But the output which is being displayed is <ul id="demoFour"></ul>
Content is served dynamically based on data of an additional XHR request, so you have to call this instead. You can inspect this by taking a look into devtools of browser on XHR tab. Example Instead of appending only the obvious to a list of dicts you could also iterate all detailpages while requesting them. from bs4 import BeautifulSoup import requests, string data = [] for letter in list(string.ascii_uppercase): result = requests.post(f'https://www.parliament.lk/members-of-parliament/directory-of-members/index2.php?option=com_members&task=all&tmpl=component&letter={letter}&wordfilter=&search_district=') for e in result.json(): #result = requests.get(f"https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/{e['mem_intranet_id']}") data.append({ 'url':f"https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/{e['mem_intranet_id']}", 'id':e['mem_intranet_id'], 'name':e['member_sname_eng'] }) data Output [{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3266', 'id': '3266', 'name': 'A. Aravindh Kumar'}, {'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/50', 'id': '50', 'name': 'Abdul Haleem'}, {'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3325', 'id': '3325', 'name': 'Ajith Rajapakse'}, {'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3296', 'id': '3296', 'name': 'Akila Ellawala'}, {'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3355', 'id': '3355', 'name': 'Ali Sabri Raheem'},...]
I am trying to webscrape from Zomato, however it returns with an output of "None" and Attribute Error
Whenever i try to extract the data, it returns an output of "None" which I am not sure of is it the code (I followed the rules of using bs4) or is it just the website that's different to scrape? My code: import requests import bs4 as bs url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants' req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) html = req.text soup = bs.BeautifulSoup(html, "html.parser") listings = soup.find('div', class_='sc-gAmQfK fKxEbD') rest_name = listings.find('h4', class_='sc-1hp8d8a-0 sc-eTyWNx gKsZcT').text ##Output: AttributeError: 'NoneType' object has no attribute 'find' print(listings) ##returns None Here is the inspected tag of the website which i try to get the h4 class showing the restaurant's name: inspected element
What happens? Classes are generated dynamically and may differ from your inspections via developer tools - So you won't find what you are looking for. How to fix? It would be a better approach to select your targets via tag or id if available, cause these are more static than css classes. listings = soup.select('a:has(h4)') Example Iterating listings and scrape several infromation: import requests import bs4 as bs url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants' req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) html = req.text soup = bs.BeautifulSoup(html, "html.parser") data = [] for item in soup.select('a:has(h4)'): data.append({ 'title':item.h4.text, 'url':item['href'], 'etc':'...' }) print(data) Output [{'title': 'Radio Dalam Diner', 'url': '/jakarta/radio-dalam-diner-pondok-indah/info', 'etc': '...'}, {'title': 'Aneka Bubur 786', 'url': '/jakarta/aneka-bubur-786-pondok-indah/info', 'etc': '...'}, {'title': "McDonald's", 'url': '/jakarta/mcdonalds-pondok-indah/info', 'etc': '...'}, {'title': 'KOPIKOBOY', 'url': '/jakarta/kopikoboy-pondok-indah/info', 'etc': '...'}, {'title': 'Kopitelu', 'url': '/jakarta/kopitelu-pondok-indah/info', 'etc': '...'}, {'title': 'KFC', 'url': '/jakarta/kfc-pondok-indah/info', 'etc': '...'}, {'title': 'HokBen Delivery', 'url': '/jakarta/hokben-delivery-pondok-indah/info', 'etc': '...'}, {'title': 'PHD', 'url': '/jakarta/phd-pondok-indah/info', 'etc': '...'}, {'title': 'Casa De Jose', 'url': '/jakarta/casa-de-jose-pondok-indah/info', 'etc': '...'}]
Sending python requests and handling JSON lists
I am sending requests to a crypto network for data on accounts. You get sent back information, but I haven't yet encountered lists being sent in JSON until now. I want to parse certain information, but am having trouble because the JSON is a list and is not as easy to parse compared to normal JSON data. import requests import json url = ' https://s1.ripple.com:51234/' payload = { "method": "account_objects", "params": [ { "account": "r9cZA1mLK5R5Am25ArfXFmqgNwjZgnfk59", "ledger_index": "validated", "type": "state", "deletion_blockers_only": False, "limit": 10 } ] } response = requests.post(url, data=json.dumps(payload)) print(response.text) data = response.text parsed = json.loads(data) price = parsed['result'] price = price['account_objects'] for Balance in price: print(Balance) You will receive all the tokens the account holds and the value. I can not figure out how to parse this correctly and receive the correct one. This particular test account has a lot of tokens so I will only show the first tokens info. RESULT {'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'}, 'Flags': 65536, 'HighLimit': {'currency': 'ASP', 'issuer': 'r9cZA1mLK5R5Am25ArfXFmqgNwjZgnfk59', 'value': '0'}, 'HighNode': '0', 'LedgerEntryType': 'RippleState', 'LowLimit': {'currency': 'ASP', 'issuer': 'r3vi7mWxru9rJCxETCyA1CHvzL96eZWx5z', 'value': '10'}, 'LowNode': '0', 'PreviousTxnID': 'BF7555B0F018E3C5E2A3FF9437A1A5092F32903BE246202F988181B9CED0D862', 'PreviousTxnLgrSeq': 1438879, 'index': '2243B0B630EA6F7330B654EFA53E27A7609D9484E535AB11B7F946DF3D247CE9'} I want to get the first bit of info, here. {'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'}, Specifically 'value' and the number I have tried to take parse 'Balance' but since it is a list it is not as straight forward.
You're mixing up lists and dictionaries. In order to access a dictionary by key, you need to invoke the key, as such: for Balance in price: print(Balance['Balance']) Yields the following results: {'currency': 'CHF', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '-0.3488146605801446'} {'currency': 'BTC', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'} {'currency': 'USD', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '-11.68225001668339'} If you only wanted to extract the value, you simply dive one level deeper: for Balance in price: print(Balance['Balance']['value') Which yields: -0.3488146605801446 0 -11.68225001668339
I assume that under price['account_objects'] you have a list of dictionaries? And then in each dictionary you have in one of the keys: 'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'. If so, why don't you iterate over the list and then access each dictionary, like: account_objects = price['account_objects'] for account_object in price: print(account_object['Balance'])
Python3 requests module or urllib.request module both retrieving incomplete json
I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object: url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout) print(url) response = urlopen(url) try: reader = codecs.getreader("utf-8") print(reader(response)) jsonresponse = json.load(reader(response)) print(jsonresponse) and similarly using the response library instead of urllib also fails to retrieve the full JSON url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout) print("using this url %s"%url) r = requests.get(url) try: print(r.json()) jsonresponse = r.json()# json.loads(response.read()) In both cases I get about 1/4 of the JSON. For example, in this case: https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897 I received: {'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\ e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\ temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\ ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\ 0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}} I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data. I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original. I would appreciate any advice about how to troubleshoot this. Thanks.
Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers. Once you set a User-Agent header with the UA string of a common browser, you get the full response: >>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0' >>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897' >>> r = requests.get(url, headers={'User-Agent': UA}) >>> >>> print len(r.content) 96412 See the requests docs for more details on setting custom headers.