web scraping python list index out of range - python

I'm trying to use python to web scrape some ranking list of kworb.
Here is my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
df = pd.read_html('https://kworb.net/spotify/country/hk_weekly.html', attrs={'id':'spotifyweekly'})[0]
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
c = list(e.stripped_strings)
data .append({
'Pos':c[0],
'Artist':c[2],
'Song':c[4]
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
and it comes out an error:
Traceback (most recent call last):
File "C:\Users\lohub\OneDrive\desktop\scrape.py", line 17, in <module>
'Song':c[4]
IndexError: list index out of range

Be aware - There is only one option needed pandas or BeuatifulSoup from your code, while first one will deliver a propper result.
As mentioned earlier, there are not enough items to select by index, which is probably caused by the specific row differing in number of values from the rest:
<tr><td>61</td>
<td></td>
<td class="text mp"><div>Jason Chan - 你瞞我瞞</div></td>
<td></td>
<td></td><td></td>
<td></td>
<td></td>
<td></td></tr>
That will result in the following if you print c, so last index would be 3:
['61', 'Jason Chan', '-', '你瞞我瞞']
An alternative that handels this issue would be:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)

Related

can't convert scraped string to float in python

so basically, I want to scrape a table from this article and find the difference between the 1980 column and the 2018 column. to do that, I'm trying to convert the scraped data from a tag to a string, and then to a float. but I get an error when I try to convert into a float.
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
html = urlopen("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita")
soup = BeautifulSoup(html, "html.parser")
for tr in soup.select('table:nth-of-type(2) tr:has(td)'):
nation = tr.td.a.text
eighty = tr.find_all("td")[3]
eighty_x = eighty.text
eighty_y = float(eighty_x)
eighteen = tr.find_all("td")[14]
eighteen_x = eighteen.text
eighteen_y = float(eighteen_x)
selection =(nation, eighty_x, eighteen_x.strip())
print(selection)
what I get is this:
('Afghanistan', '0.2', '0.3')
('Albania', '1.7', '1.6')
('Algeria', '3.0', '3.9')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [39], in <cell line: 9>()
11 eighty = tr.find_all("td")[3]
12 eighty_x = eighty.text
---> 13 eighty_y = float(eighty_x)
14 eighteen = tr.find_all("td")[14]
15 eighteen_x = eighteen.text
ValueError: could not convert string to float: '..'
The table that you're scraping from, located here: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita
has the ".." in it. So that's the issue. You then need to decide how to handle that case. Maybe store it as a -1 in your data to signal the lack of data in the table? Up to you.
Also, I think in your selection you want eighty_y and eighteen_y instead of the x's right? Or else why are you turning x->y (as in eighty_x becomes eighty_y)?
Assuming this is fine, then this will do it:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
html = urlopen("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita")
soup = BeautifulSoup(html, "html.parser")
for tr in soup.select('table:nth-of-type(2) tr:has(td)'):
if tr is None or tr.td is None or tr.td.a is None: continue
nation = tr.td.a.text
eighty = tr.find_all("td")[3]
eighty_x = eighty.text
eighty_y = -1 if ".." in eighty_x else float(eighty_x)
eighteen = tr.find_all("td")[14]
eighteen_x = eighteen.text
eighteen_y = -1 if ".." in eighteen_x else float(eighteen_x)
selection =(nation, eighty_y, eighteen_y)
print(selection)

Python - List out of range error in web scraping

I have been running this python code and it gives me an error saying
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-6ff1d459c8bd> in <module>
6 soup = BeautifulSoup(data, 'html5lib')
7 df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
----> 8 for row in soup.find_all('tbody')[1].find_all('tr'):
9 col = row.find_all("td")
10 Name = col[0].text
IndexError: list index out of range
The code i have user do do the python web scraping is,
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.kaggle.com/priteshraj10/sp-500-companies"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
for row in soup.find_all('tbody')[1].find_all('tr'):
col = row.find_all("td")
Name = col[0].text
Sector = col[1].text
Price = col[2].text
Price_Earnings = col[3].text
Dividend_Yield = col[4].text
Earnings_Share = col[5].text
Week_Low = col[6].text
Week_High = col[7].text
Market_Cap = col[8].text
EBITDA = col[9].text
df = df.append({"Name":Name,"Sector":Sector,"Price":Price,"Price_Earnings":Price_Earnings,"Dividend_Yield":Dividend_Yield,"Earnings_Share":Earnings_Share,"Week_Low":Week_Low,"Week_High":Week_High,"Market_Cap":Market_Cap,"EBITDA":EBITDA}, ignore_index=True)
Can you help me on this?
If you try to print the variable soup you will see that the HTML returned does not contain the information you want probably because the site has a block to avoid web-scraping
Apparently this line of code is returning you a list
for row in soup.find_all('tbody')[1]
And it expects at least to have two items (python list indexing starts from 0), and it's not.
What you could do is print this item:
print(soup.find_all('tbody'))
To see what are you trying to access the position index 1, and why is not there.
Additionally if you want to check the length:
print(len(soup.find_all('tbody')))
It should be smaller than 1, hence the error.
I'd recommend you, instead of printing using a debugger to find out what's up with your code.
The issue seems to be that the website you are trying to scrape , probably changed their HTML code somehow.

How to create dictionary from a table using beautifulsoup?

I am trying to retrieve data from a table via beautifulsoup, but somehow my (beginner) syntax is wrong:
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
title = soup.find("div", id = "accordionContent5e95581b6e244")
results = {}
for row in title.findAll('tr'):
aux = row.findAll('td')
results[aux[0].string] = aux[1].string
print(results)
This is the relevant code:
<div id="accordionContent5e95581b6e244" class="panel-collapse collapse in">
<div class="panel-body">
<table class="table" width="100%">
<tbody>
<tr>
<th width="170">PZN</th>
<td>00520917</td>
</tr>
<tr>
<th width="170">Anbieter</th>
<td>Hexal AG</td>
</tr>
My goal is to retrieve a dictionary from the th td cells.
How can this be done in beautifulsoup?
I would suggest use pandas to store data in Data Frame and then import into dictionary.
import pandas as pd
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
table=soup.select_one(".panel-body >table")
df=pd.read_html(str(table))[0]
print(df.set_index(0).to_dict('dict'))
Output:
{1: {'Rezeptpflichtig': 'nein', 'Anbieter': 'Hexal AG', 'PZN': '00520917', 'Darreichungsform': 'Brausetabletten', 'Wirksubstanz': 'Acetylcystein', 'Monopräparat': 'ja', 'Packungsgröße': '40\xa0St', 'Apothekenpflichtig': 'ja', 'Produktname': 'ACC akut 600mg Hustenlöser'}}
First Mistake : You are using id which varies of you want to scrape more pages .
Second Mistake : aux = row.findAll('td') this will return list of one item because you are not taking into consideration the th tags that means aux[1].string will raise an exception .
Here is the code :
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
title = soup.find("div", class_="panel-collapse collapse in")
results = {}
for row in title.findAll('tr'):
key = row.find('th')
value = row.find('td')
results[key.text] =value.text.strip()
print(results)
Output:
{'PZN': '00520917', 'Anbieter': 'Hexal AG', 'Packungsgröße': '40\xa0St', 'Produktname': 'ACC akut 600mg Hustenlöser', 'Darreichungsform': 'Brausetabletten', 'Monopräparat': 'ja', 'Wirksubstanz': 'Acetylcystein', 'Rezeptpflichtig': 'nein', 'Apothekenpflichtig': 'ja'}

Getting a certain element out of the website table

I've been trying to get only one value from a table on a website. I've been following a tutorial but I am currently stuck. My goal is to extract the name of the country from the table and the number of total cases of that specific country and print it on the screen. For example:
China: 80,761 Total cases
I'm using Python 3.7.
This is my code so far:
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.findAll('table',{'id':'main_table_countries'})
If you have <table> tags, just go with pandas' .read_html(). It uses beautifulsoup under the hood, then you can just slice and dice the dataframe as you please:
import pandas as pd
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df.iloc[:,:2])
To do it with beautifulsoup straight forward. First you want to grab the <table> tag. Within the <table> tag get all the <tr> tages (rows). Then iterate through each row to get all the <td> tags (the data). The data you want are in index positions 0 and 1, so just print those out.
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if data != []:
print (data[0].text, data[1].text)
ADDITIONAL:
import pandas as pd
country = 'China'
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df[df['Country,Other'] == country].iloc[:,:2])
OR
import requests
from bs4 import BeautifulSoup
import re
country = 'China'
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find('a', text=re.compile(country))
for row in rows:
data = row.parent.parent.parent.find_all('td')[1].text
print (row, data)
You can get the target info this way:
for t in table[0].find_all('tr'):
target = t.find_all('td')
if len(target)>0:
print(target[0].text, target[1].text)
Output:
China 80,761
Italy 9,172
Iran 8,042
etc.

How can I loop through all <th> tags within my script for web scraping?

As of now, I'm only getting ['1'] as the output of what's being printed with my current code below. I want to grab 1-54 on the Team Batting table in the Rk column on the website https://www.baseball-reference.com/teams/NYY/2019.shtml.
How would I go about modifying colNum so it can print the 1-54 in the Rk column? I'm pointing out the colNum line because I feel the issue lies there but I could be wrong.
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser') # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)
Your mistake was in the last few lines as you mentioned. If I understood right, you wanted a list of all the values in the "Rk" column. In order to get all the rows, you have to use the find_all() function. I tweaked your code a little bit in order to get the text of the first field in each row in the following lines:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)

Categories