This question already has answers here:
IndexError: list index out of range while webscraping advertisements with beautifulsoup
(3 answers)
Closed last month.
I am performing sentiment analysis on a URL. For getting content from URL, I write this code:
import numpy as np
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = target_url
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.content,'html.parser')
content=soup.findAll(attrs={'class':'td-post-content'})
content=content[0].text.replace('\n'," ")
but this keep saying that my list is out of index.
I tried to change list to dictionary but it still giving me error of out of index.
What do I have to do to solve this?
Probbly content[0] has no item and then give that error check print(len(content)) has item inside. Check the soup.Findall() working properly.
import numpy as np
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = target_url
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.content, 'html.parser')
content = soup.find_all(class_="td-post-content")
content = content[0].text.replace('\n', " ")
If the above code doesn't work, share the URL of the target page, to provide a more accurate answer.
Related
I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.
So here's my script:
import requests
import urllib
import json
url = 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
response = json.loads(requests.get(url).text)
print(response["offers"])
and after grabbing the page source of https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560
I want to grab this data
"offers":{"#type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}
More specifically, price and priceValidUntil
from some googling I think this would be the way to do it but since there's so much data within the webpage I think it is taking my script a ton of time to run.
Is there a more efficient way of getting this json data and am I grabbing this data correctly?
You can use this example how to load the json data from HTML page:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
url = "https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(
soup.select_one('script[type="application/ld+json"]').contents[0]
)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print("Price:", data["offers"]["price"])
print("Price valid until:", data["offers"]["priceValidUntil"])
Prints:
Price: 1449.95
Price valid until: 4/8/2021
I'm a new Python programmer and was trying do scrape some key metrics on Reuters but can't do it properly.
Here's my code:
import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_keys = bs.findAll('tr', class_='data')
key_name = bs.find('th', class_='MarketsTable-label-_JI6s').find('div', class_='TextLabel__text-label___3oCVw TextLabel__gray___1V4fk TextLabel__regular___2X0ym MarketsTable-label-_JI6s')
for key in key_name:
beta = key.find('Beta')
print(beta)
Beta is the metric I want. Gives me "-1" as answer. I want the value in 'span' related to the name.
What should I do?
I have changed a bit your instructions, I tried to simplify the class request and used .text that allows you to extrapolate the text inside.
I have used the assumption that th and td are the only ones in the row like in the url you asked for (to delete the class selection).
Note: I've used a for, you may use a filter
import requests
from bs4 import BeautifulSoup
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_keys = bs.findAll('tr', class_='data')
for key in list_all_keys:
title = key.find("th").text
if title == "Beta":
print(key.find("td").text)
A few ways you can do it. 1) You can find the element with string "Beta" in the html, then grab the next <td> element. 2) Use pandas' read_html() to get the table, then query/pull out what you specifically want from the table. or 3) I prefer this option as it gets you the raw, un-rendered data: just get the json response from the API. All solutions are below:
Solution 1:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'html.parser')
beta = bs('th',text=re.compile(r'Beta'))[0].find_next('td').text
print (beta)
Solution 2:
import pandas as pd
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
df = pd.read_html(url)[0]
print (df[df[0] == 'Beta'].iloc[0,1])
Solution 3:
import requests
from pandas.io.json import json_normalize
url = 'https://www.reuters.com/companies/api/getFetchCompanyKeyMetrics/AAPL.OQ'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
print (jsonData['market_data']['beta'])
df = json_normalize(jsonData['market_data'])
df.to_csv('file.csv', index=False)
# Or to excel file
#df.to_excel('file.xls', index=False)
Output: in order of the solution provided:
1.14
1.13
1.13326
To get the JSON response:
If you go to Inspect the page, and look at the panel on the right. Go to Network -> XHR. Look through the requests on the left and see if the data you are after in there (might need to reload the page, then you'll need to click around to find it.)
If you find it, go to Headers to get the Request URL that you'll use to get that response.
Once you have the response, to convert to a table and output to excel, see the code for Solution 3.
I'm trying to write a program, that downloads the most upvoted picture from a subreddit, but for some reason the BeautifulSoup does not find all the links on a website, I know I could try it with other methods but I'm curious why isn't it finding all the link every time.
Here is the code as well.
from PIL import Image
import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/r/wallpaper/top/'
result = requests.get(url)
soup = BeautifulSoup(result.text,'html.parser')
for link in soup.find_all('a'):
print (link.get('href'))
Site is loaded with JavaScript, bs4 will not be able to render JavaScript therefor, I've been able to locate the data within script tag.
import requests
import re
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
r = requests.get(url, headers=headers)
match = re.search(r"window.___r = ({.+})", r.text).group(1)
data = json.loads(match)
# print(data.keys())
# humanreadable = json.dumps(data, indent=4)
main("https://www.reddit.com/r/wallpaper/top/")
Shorter version:
match = re.finditer(r'permalink":"(.+?)"', r.text)
for item in match:
print(item.group(1))
Output:
https://www.reddit.com/r/wallpaper/comments/fv9ubr/khyber_pakhtunkhwa_pakistan_balakot_1920x1024/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/
https://www.reddit.com/user/wsopgame/comments/fvbxom/join_the_official_wsop_online_poker_game_and/?instanceId=t3_p%3DgAAAAABeiiTtw4FM0zBerf9DDiq5tmonjJbAwzQb_UwA-VHlw2J8zUxw-y6Doa6j-jPP0qt05lRZfyReQwnLH9pN6wdSBBvqhgxgRS3uKyKCRvkk6WNwns5wpad0ijMgHwqVnZSGMT0KWP4WB15zBNkb3j96ifm23pT4uACb6cpNVh-TE05GiTtDnD9UUMir02Z7hOr0x4f_wLJEIplafXRp2yiAFPh5VzH_4VSsPx9zV7v3IJwN5ctYLfIcdCW5Z3W-z3bbOVUCU2HqqRAoh0XEj0LrgdicMexa9fzPbtWOshfx3kIazwFhYXoSowPBRZUquSs9zEaQwP1B-wg951edNb7RSjYTrDpQ75zsMfIkasKvAOH-V58%3D
https://www.reddit.com/r/wallpaper/comments/fv6wew/lone_road_in_nowhere_arizona_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvaqaa/the_hobbit_house_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fvcs4j/something_i_made_in_illustrator_5120_2880/
https://www.reddit.com/r/wallpaper/comments/fv09u2/bath_time_in_rocky_mountain_national_park_1280x720/
https://www.reddit.com/r/wallpaper/comments/fuyomz/up_is_still_my_favorite_film_grandpa_carl_cams/
https://www.reddit.com/r/wallpaper/comments/fvagex/beautiful_and_colorful_nature_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv3nnn/maroon_bells_co_photo_credit_to/
https://www.reddit.com/r/wallpaper/comments/fuyg0z/volcano_lightening_19201080/
https://www.reddit.com/r/wallpaper/comments/fvgohk/doctor_strange1920x1080/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/
https://www.reddit.com/user/redditads/comments/ezogdp/reach_your_audience_on_reddit/?instanceId=t3_p%3DgAAAAABeiiTt9isPY03zwoimtzcC7w3uLzUDCuoD5cU6ekeEYt48cRAqoMsc1ZDBJ6OeK1U3Bs2Zo1ZSWzdQ4DOux21vGvWzJkxNWQ14XzDWag_GlrE-t_4rpFA_73kW94xGUQchsXL7f4VkbbHIyn8SMlUlTtt3j3lJCViwINOQgIF3p5N8Q4ri-swtJC-JyEUYa4dJazlZ9xLYyOHSvMkiR3k9lDx0NEKqpqfbQ9__f3xLUzgS4yF4OngMDFUVFa5nyH3I32mkP3KezXLxOR6H8CSGI_jqRA4dBV-AnHLuzPlgENRpfaMhWJ04vTEOjmG4sm4xs65OZCumqNstzlDEvR7ryFwL6LeH02a9E3czck5jfKY7HXQ%3D
https://www.reddit.com/r/wallpaper/comments/fuzjza/ghost_cloud_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvg88o/park_autumn_tress_wallpaper_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fv47r8/audi_quattro_s1_3840x2160_fh4/
https://www.reddit.com/r/wallpaper/comments/fuybjs/spacecrafts_1920_x_1080/
https://www.reddit.com/r/wallpaper/comments/fv043i/dragonfly_1280x720/
https://www.reddit.com/r/wallpaper/comments/fv06ud/muskrat_swim_1280x720/
https://www.reddit.com/r/wallpaper/comments/fvdafk/natural_beauty_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbnuc/cigar_man_19201080/
https://www.reddit.com/r/wallpaper/comments/fvcww4/thunder_road_3840_x_2160/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/
https://www.reddit.com/user/redditads/comments/7w17su/interested_in_gaining_a_new_perspective_on_things/?instanceId=t3_p%3DgAAAAABeiiTtxVzGp9KwvtRNa1pOVCgz2IBkTGRxqdyXk4WTsjAkWS9wzyDVF_1aSOz36HqHOVrngfj3z_9O1cAkzz-0fwhxyJ_8jePT3F88mrveLChf_YRIbAtxb-Ln_OaeeXUnyrFVl-OPN7cqXvtgh3LoymBx3doL-bEVnECOWkcSXvUIwpMn-flVZ5uNcGL1nKEiszUcORqq1oQ32BnrmWHomrDb3Q%3D%3D
https://www.reddit.com/r/wallpaper/comments/fv3xqs/social_distancing_log_1920x1080/
https://www.reddit.com/r/wallpaper/comments/fvbcpl/neon_city_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fvbhdb/sunrise_wallpaper_19201080/
https://www.reddit.com/r/wallpaper/comments/fv2eno/second_heavy_bike_in_ghost_recon_breakpoint/
I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.
import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})
for the result of b, it does not return all the list inside the tag.
And for the last one of it is '+ plus', looks like as following.
<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]
This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.
Instead you have to use selenium web driver, open page and process all the javascripts. Then you can get ccess to all data you expect