why does beautiful soup not find value when it exists

why does beautiful soup not find value when it exists - python

Im trying to scrape data from yahoo finance with beautiful soup. One field is a span tag with an attribute of "data-reactid"="42" representing the previous close value of the stock. If I run the following commands it returns None. Why is that?
code below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q/op?s=spy+Options'
page = urlopen(url)
soup = BeautifulSoup(page)
soup.find("span", attrs={"data-reactid":"42"})

Try:
soup.find_all("span", attrs={"data-reactid":"42"})
See attrs for more examples.
EDIT:
Since the page is rendered using ReactJS, the data you are trying to access is not available when you make the request, this is why you are always getting None
I suggest you to use something like yfinance.
See this for more information.

When you open your browser and point to http://finance.yahoo.com/q/op?s=spy+Options few XHR calls are issued by the browser. One of those calls return a a data structure with the field 'previousClose'. It may be the fields you are looking for. See the code below.
import requests
import pprint
r = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=%5EGSPC&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance')
if r.status_code == 200:
pprint.pprint(r.json())
output
{'spark': {'error': None,
'result': [{'response': [{'indicators': {'quote': [{'close': [2982.4,
2981.64,
2982.96,
2982.85,
2978.84,
2977.04,
2974.02,
2974.34,
2973.85,
2974.75,
2975.68,
2978.29,
2977.26,
2978.91,
2980.48,
2983.23,
2982.07,
2984.48,
2984.14,
2984.07,
2984.66,
2981.88,
2983.19,
2983.86,
2983.79,
2967.47,
2968.61,
2971.59,
2970.77,
2975.5,
2971.69,
2972.09,
2973.98,
2968.89,
2969.1,
2970.09,
2968.25,
2969.2,
2966.84,
2963.03,
2962.99,
2958.07,
2959.89,
2963.97,
2962.86,
2960.46,
2958.78,
2961.0,
2959.69,
2959.73,
2961.58,
2958.23,
2959.21,
2960.67,
2958.38,
2955.76,
2956.29,
2955.62,
2954.33,
2954.44,
2952.78,
2951.81,
2951.3,
2948.71,
2946.59,
2948.26,
2950.32,
2948.2,
2948.35,
2953.54,
2955.45,
2952.13,
2955.97,
2956.97,
2957.38,
2958.28,
2961.44,
2962.13]}]},
'meta': {'chartPreviousClose': 2977.62,
'currency': 'USD',
'currentTradingPeriod': {'post': {'end': 1569628800,
'gmtoffset': -14400,
'start': 1569614400,
'timezone': 'EDT'},
'pre': {'end': 1569591000,
'gmtoffset': -14400,
'start': 1569571200,
'timezone': 'EDT'},
'regular': {'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}},
'dataGranularity': '5m',
'exchangeName': 'SNP',
'exchangeTimezoneName': 'America/New_York',
'firstTradeDate': -1325602800,
'gmtoffset': -14400,
'instrumentType': 'INDEX',
'previousClose': 2977.62,
'priceHint': 2,
'range': '1d',
'regularMarketPrice': 2961.79,
'regularMarketTime': 1569618019,
'scale': 3,
'symbol': '^GSPC',
'timezone': 'EDT',
'tradingPeriods': [[{'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}]],
'validRanges': ['1d',
'5d',
'1mo',
'3mo',
'6mo',
'1y',
'2y',
'5y',
'10y',
'ytd',
'max']},
'timestamp': [1569591000,
1569591300,
1569591600,
1569591900,
1569592200,
1569592500,
1569592800,
1569593100,
1569593400,
1569593700,
1569594000,
1569594300,
1569594600,
1569594900,
1569595200,
1569595500,
1569595800,
1569596100,
1569596400,
1569596700,
1569597000,
1569597300,
1569597600,
1569597900,
1569598200,
1569598500,
1569598800,
1569599100,
1569599400,
1569599700,
1569600000,
1569600300,
1569600600,
1569600900,
1569601200,
1569601500,
1569601800,
1569602100,
1569602400,
1569602700,
1569603000,
1569603300,
1569603600,
1569603900,
1569604200,
1569604500,
1569604800,
1569605100,
1569605400,
1569605700,
1569606000,
1569606300,
1569606600,
1569606900,
1569607200,
1569607500,
1569607800,
1569608100,
1569608400,
1569608700,
1569609000,
1569609300,
1569609600,
1569609900,
1569610200,
1569610500,
1569610800,
1569611100,
1569611400,
1569611700,
1569612000,
1569612300,
1569612600,
1569612900,
1569613200,
1569613500,
1569613800,
1569614100]}],
'symbol': '^GSPC'}]}}

Related

Python iterate through JSON - dict object is not callable

I want to parse flight data from Aviationstack API.
For this example, I format the URL to get data from flight with departure at Marseille airport (France) with the parameter dep_icao=LFML (cf. documentation) :
# get data for flights from Marseille airport
url = 'http://api.aviationstack.com/v1/flights?access_key=MYAPIKEY&dep_icao=LFML'
req = requests.get(url)
response = req.json()
This is the response (shortened)
{'pagination': {'limit': 100, 'offset': 0, 'count': 100, 'total': 377},
'data': [{'flight_date': '2022-12-24',
'flight_status': 'active',
'departure': {'airport': 'Marseille Provence Airport',
'timezone': 'Europe/Paris',
'iata': 'MRS',
'icao': 'LFML',
'terminal': '1A',
'gate': None,
'delay': 10,
'scheduled': '2022-12-24T09:00:00+00:00',
'estimated': '2022-12-24T09:00:00+00:00',
'actual': '2022-12-24T09:09:00+00:00',
'estimated_runway': '2022-12-24T09:09:00+00:00',
'actual_runway': '2022-12-24T09:09:00+00:00'},
'arrival': {'airport': 'El Prat De Llobregat',
'timezone': 'Europe/Madrid',
'iata': 'BCN',
'icao': 'LEBL',
'terminal': '1',
'gate': None,
'baggage': '06',
'delay': None,
'scheduled': '2022-12-24T10:05:00+00:00',
'estimated': '2022-12-24T10:05:00+00:00',
'actual': None,
'estimated_runway': None,
'actual_runway': None},
'airline': {'name': 'Qatar Airways', 'iata': 'QR', 'icao': 'QTR'},
'flight': {'number': '3721',
'iata': 'QR3721',
'icao': 'QTR3721',
'codeshared': {'airline_name': 'vueling',
'airline_iata': 'vy',
'airline_icao': 'vlg',
'flight_number': '1509',
'flight_iata': 'vy1509',
'flight_icao': 'vlg1509'}},
'aircraft': None,
'live': None},
[...]
}]}
Then I want to iterate through this JSON response. I can get the pagination info right but I'm only able to get info from the first element in data, so I will later loop through all results.
My problem is that I would only like to get the iata item nested in flight. The way I'm trying to get it returns 'dict' object is not callable error.
# iterate through JSON response
pagination_data = response.get("pagination")
flight_data = response.get("data")[0]
total_results = pagination_data.get("total")
flight = flight_data.get('flight')('iata')
pprint(f"total results : {total_results}")
pprint(f"flight : {flight}")

The get method will return a dictionnary which is not function and then can not be called
It should be flight = flight_data.get('flight').get('iata')
Alternatives :
(but without returning None) flight_data.get('flight')['iata']
(but without returning None) flight=flight_data['flight']['iata']

How to scarpe li tag under ul?

I want to fetch the all the list tags under the ul tag with id= "demofour" from https://www.parliament.lk/en/members-of-parliament/directory-of-members/?cletter=A.
Below is the code:
print(soup.find('ul',id='demoFour'))
But the output which is being displayed is
<ul id="demoFour"></ul>

Content is served dynamically based on data of an additional XHR request, so you have to call this instead. You can inspect this by taking a look into devtools of browser on XHR tab.
Example
Instead of appending only the obvious to a list of dicts you could also iterate all detailpages while requesting them.
from bs4 import BeautifulSoup
import requests, string
data = []
for letter in list(string.ascii_uppercase):
result = requests.post(f'https://www.parliament.lk/members-of-parliament/directory-of-members/index2.php?option=com_members&task=all&tmpl=component&letter={letter}&wordfilter=&search_district=')
for e in result.json():
#result = requests.get(f"https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/{e['mem_intranet_id']}")
data.append({
'url':f"https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/{e['mem_intranet_id']}",
'id':e['mem_intranet_id'],
'name':e['member_sname_eng']
})
data
Output
[{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3266',
'id': '3266',
'name': 'A. Aravindh Kumar'},
{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/50',
'id': '50',
'name': 'Abdul Haleem'},
{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3325',
'id': '3325',
'name': 'Ajith Rajapakse'},
{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3296',
'id': '3296',
'name': 'Akila Ellawala'},
{'url': 'https://www.parliament.lk/en/members-of-parliament/directory-of-members/viewMember/3355',
'id': '3355',
'name': 'Ali Sabri Raheem'},...]

I am trying to webscrape from Zomato, however it returns with an output of "None" and Attribute Error

Whenever i try to extract the data, it returns an output of "None" which I am not sure of is it the code (I followed the rules of using bs4) or is it just the website that's different to scrape?
My code:
import requests
import bs4 as bs
url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants'
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = req.text
soup = bs.BeautifulSoup(html, "html.parser")
listings = soup.find('div', class_='sc-gAmQfK fKxEbD')
rest_name = listings.find('h4', class_='sc-1hp8d8a-0 sc-eTyWNx gKsZcT').text
##Output: AttributeError: 'NoneType' object has no attribute 'find'
print(listings)
##returns None
Here is the inspected tag of the website which i try to get the h4 class showing the restaurant's name:
inspected element

What happens?
Classes are generated dynamically and may differ from your inspections via developer tools - So you won't find what you are looking for.
How to fix?
It would be a better approach to select your targets via tag or id if available, cause these are more static than css classes.
listings = soup.select('a:has(h4)')
Example
Iterating listings and scrape several infromation:
import requests
import bs4 as bs
url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants'
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = req.text
soup = bs.BeautifulSoup(html, "html.parser")
data = []
for item in soup.select('a:has(h4)'):
data.append({
'title':item.h4.text,
'url':item['href'],
'etc':'...'
})
print(data)
Output
[{'title': 'Radio Dalam Diner', 'url': '/jakarta/radio-dalam-diner-pondok-indah/info', 'etc': '...'}, {'title': 'Aneka Bubur 786', 'url': '/jakarta/aneka-bubur-786-pondok-indah/info', 'etc': '...'}, {'title': "McDonald's", 'url': '/jakarta/mcdonalds-pondok-indah/info', 'etc': '...'}, {'title': 'KOPIKOBOY', 'url': '/jakarta/kopikoboy-pondok-indah/info', 'etc': '...'}, {'title': 'Kopitelu', 'url': '/jakarta/kopitelu-pondok-indah/info', 'etc': '...'}, {'title': 'KFC', 'url': '/jakarta/kfc-pondok-indah/info', 'etc': '...'}, {'title': 'HokBen Delivery', 'url': '/jakarta/hokben-delivery-pondok-indah/info', 'etc': '...'}, {'title': 'PHD', 'url': '/jakarta/phd-pondok-indah/info', 'etc': '...'}, {'title': 'Casa De Jose', 'url': '/jakarta/casa-de-jose-pondok-indah/info', 'etc': '...'}]

Sending python requests and handling JSON lists

I am sending requests to a crypto network for data on accounts. You get sent back information, but I haven't yet encountered lists being sent in JSON until now. I want to parse certain information, but am having trouble because the JSON is a list and is not as easy to parse compared to normal JSON data.
import requests
import json
url = ' https://s1.ripple.com:51234/'
payload = {
"method": "account_objects",
"params": [
{
"account": "r9cZA1mLK5R5Am25ArfXFmqgNwjZgnfk59",
"ledger_index": "validated",
"type": "state",
"deletion_blockers_only": False,
"limit": 10
}
]
}
response = requests.post(url, data=json.dumps(payload))
print(response.text)
data = response.text
parsed = json.loads(data)
price = parsed['result']
price = price['account_objects']
for Balance in price:
print(Balance)
You will receive all the tokens the account holds and the value. I can not figure out how to parse this correctly and receive the correct one. This particular test account has a lot of tokens so I will only show the first tokens info.
RESULT
{'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'}, 'Flags': 65536, 'HighLimit': {'currency': 'ASP', 'issuer': 'r9cZA1mLK5R5Am25ArfXFmqgNwjZgnfk59', 'value': '0'}, 'HighNode': '0', 'LedgerEntryType': 'RippleState', 'LowLimit': {'currency': 'ASP', 'issuer': 'r3vi7mWxru9rJCxETCyA1CHvzL96eZWx5z', 'value': '10'}, 'LowNode': '0', 'PreviousTxnID': 'BF7555B0F018E3C5E2A3FF9437A1A5092F32903BE246202F988181B9CED0D862', 'PreviousTxnLgrSeq': 1438879, 'index': '2243B0B630EA6F7330B654EFA53E27A7609D9484E535AB11B7F946DF3D247CE9'}
I want to get the first bit of info, here. {'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'},
Specifically 'value' and the number
I have tried to take parse 'Balance' but since it is a list it is not as straight forward.

You're mixing up lists and dictionaries. In order to access a dictionary by key, you need to invoke the key, as such:
for Balance in price:
print(Balance['Balance'])
Yields the following results:
{'currency': 'CHF', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '-0.3488146605801446'}
{'currency': 'BTC', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'}
{'currency': 'USD', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '-11.68225001668339'}
If you only wanted to extract the value, you simply dive one level deeper:
for Balance in price:
print(Balance['Balance']['value')
Which yields:
-0.3488146605801446
0
-11.68225001668339

I assume that under price['account_objects'] you have a list of dictionaries? And then in each dictionary you have in one of the keys: 'Balance': {'currency': 'ASP', 'issuer': 'rrrrrrrrrrrrrrrrrrrrBZbvji', 'value': '0'. If so, why don't you iterate over the list and then access each dictionary, like:
account_objects = price['account_objects']
for account_object in price:
print(account_object['Balance'])

Python3 requests module or urllib.request module both retrieving incomplete json

I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object:
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print(url)
response = urlopen(url)
try:
reader = codecs.getreader("utf-8")
print(reader(response))
jsonresponse = json.load(reader(response))
print(jsonresponse)
and similarly using the response library instead of urllib also fails to retrieve the full JSON
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print("using this url %s"%url)
r = requests.get(url)
try:
print(r.json())
jsonresponse = r.json()# json.loads(response.read())
In both cases I get about 1/4 of the JSON. For example, in this case:
https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897
I received:
{'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\
e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\
temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\
ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\
0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}}
I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data.
I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original.
I would appreciate any advice about how to troubleshoot this. Thanks.

Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers.
Once you set a User-Agent header with the UA string of a common browser, you get the full response:
>>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0'
>>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897'
>>> r = requests.get(url, headers={'User-Agent': UA})
>>>
>>> print len(r.content)
96412
See the requests docs for more details on setting custom headers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why does beautiful soup not find value when it exists - python

Related

Python iterate through JSON - dict object is not callable

How to scarpe li tag under ul?

I am trying to webscrape from Zomato, however it returns with an output of "None" and Attribute Error

Sending python requests and handling JSON lists

Python3 requests module or urllib.request module both retrieving incomplete json

Categories

Resources