Index Error while webscraping news websites

Index Error while webscraping news websites - python

I've been trying to web-scrape the titles of news articles, but I encounter an "Index Error" when the following code. I'm facing a problem only in the last line of code.
import requests
from bs4 import BeautifulSoup
URL= 'https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation'
r1 = requests.get(URL)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
coverpage_news = soup1.find_all('h3', class_='item-title')
coverpage_news[4].get_text()
This is the error:
IndexError Traceback (most recent call last)
<ipython-input-10-f7f1f6fab81c> in <module>
6 soup1 = BeautifulSoup(coverpage, 'html5lib')
7 coverpage_news = soup1.find_all('h3', class_='item-title')
----> 8 coverpage_news[4].get_text()
IndexError: list index out of range

Use soup1.select() to search for nested elements matching a CSS selector:
coverpage_news = soup1.select("h3 a.item-title")
This will find an a element with class="item-title" that's a descendant of an h3 element.

Try to change:
coverpage_news = soup1.find_all('h3', class_='item-title')
to
coverpage_news = soup1.find_all('h3', class_='list-txt')

Changing #Barmar's helpful answer a little bit to show all titles:
coverpage_news = soup1.select("h3 a.item-title")
for link in coverpage_news:
print(link.text)
Output:
US Covid Infections Cross 9 Million, Record 1-Day Spike Of 94,000 Cases
Johnson & Johnson Plans To Test COVID-19 Vaccine On Youngsters Soon
Global Stock Markets Decline As Coronavirus Infection Rate Weighs
Cristiano Ronaldo Recovers From Coronavirus
Reliance's July-September Profit Falls 15% As Covid Slams Oil Business
"Likely To Know By December If We'll Have Covid Vaccine": Top US Expert
With No Local Case In A Record 200 Days, This Country Is World's Envy
Delhi Blames Pollution For Covid Spike At High-Level Health Ministry Meet
Delhi Covid Cases Above 5,000 For 3rd Straight Day, Spike In ICU Patients
2 Million Indians Returned From Abroad Under Vande Bharat Mission: Centre
Existing Lockdown Restrictions Extended In Maharashtra Till November 30
Can TB Vaccine Protect Elderly From Covid?
Is The Covid-19 Situation Worsening In Delhi?
What's The Truth Behind India's Falling Covid Numbers?
"Slight Laxity Can Lead To Spike": AIIMS Director As India Sees Drop In Covid Cases

Related

Web Scrape Attempt With Selenium Yields Duplicate Entries

I am attempting to scrape some data off of this website.
I would greatly appreciate any assistance with this.
There are 30 entries per page and I'm currently trying to scrape information from within each of the links on each page. Here is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
Everything runs fine up to here.The problem is likely in the section below.
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
I'm trying to scrape some of the information under the "Working Hours" box (i.e. establishment year, company manager etc) from every business's page. You can find the exact information under the columns variable.
Because the not all pages have the same amount of information under the "Working Hours" box (here is one with more details underneath it), I tried using dictionaries + text manipulation to look up the available sub-entries and obtain the relevant information to their right. That is to say, obtain the name of the company manager, the year of establishment and so on; and if a page did not have this, then it would simply be tagged as "None given" under the relevant sub-entry.
The idea is to collate all this information and export it to a dataframe later on. Inputting "None given" when a page is lacking a particular sub-entry allows me to preserve the integrity of the data structure so that the entries are sure to align.
However, when I run the code the output I receive is completely off.
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. This is repeated for every other sub-entry as you'll see if you run the code and scroll down. I'm not sure what went wrong, but it seems that the size of the list has been inflated by a factor of 10, with extra "None given"s littered here and there. The size of each list should be 30, but now its 330.
I would greatly appreciate any assistance with this. Thank you.

You can use next example how to iterate all enterprises on that page and save the various info into a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
And saves data.csv (screenshot from LibreOffice):

Pandas read_html producing empty df with tuple column names

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)

For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

I have been trying to scrape data from https://www.premierleague.com/players to get team rosters for premier league clubs for the past 10 years.
The following is the code I am using. In this particular example se=17 specifies season 2008/09 and cl=12 is for Manchester United.
url= 'https://www.premierleague.com/players?se=17&cl=12'
r=requests.get(url)
d= pd.read_html(r.text)
d[0]
Inspite of the url providing the correct data on the page, the table I get is the one for the current season 2019/20. I have tried multiple combinations of the url and still I am not able to scrape.
Can someone help?

I prefer to use BeautifulSoup to navigate the DOM. This works.
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.premierleague.com/players", params={"se":17,"cl":12})
soup = BeautifulSoup(resp.content.decode(), "html.parser")
html = soup.find("div", {"class":"table playerIndex"}).find("table")
df = pd.read_html(str(html))[0]
sample output
Player Position Nationality
Rolando Aarons Midfielder England
Tammy Abraham Forward England
Che Adams Forward England
Dennis Adeniran Midfielder England
Adrián Goalkeeper Spain
Adrien Silva Midfielder Portugal

BeautifulSoup elements output to list

I have an output using BeautifulSoup.
I need to convert the output from 'type' 'bs4.element.Tag' to a list and export the list into a DataFrame column, named COLUMN_A
I want my output to stop at the 14th element (the last three h2 are useless)
My code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
attraction_place=soup.find_all('h2', class_="sitename")
for attraction in attraction_place:
print(attraction.text)
type(attraction)
Output:
1 Vigeland Sculpture Park
2 Akershus Fortress
3 Viking Ship Museum
4 The National Museum
5 Munch Museum
6 Royal Palace
7 The Museum of Cultural History
8 Fram Museum
9 Holmenkollen Ski Jump and Museum
10 Oslo Cathedral
11 City Hall (Rådhuset)
12 Aker Brygge
13 Natural History Museum & Botanical Gardens
14 Oslo Opera House and Annual Music Festivals
Where to Stay in Oslo for Sightseeing
Tips and Tours: How to Make the Most of Your Visit to Oslo
More Related Articles on PlanetWare.com
I expect a list like:
attraction=[Vigeland Sculpture Park, Akershus Fortress, ......]
Thank you very much in advance.

A nice easy way is to take the alt attribute of the photos. This gets clean text output and only 14 without any need for slicing/indexing.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm')
soup = bs(r.content, 'lxml')
attractions = [item['alt'] for item in soup.select('.photo [alt]')]
print(attractions)

new = []
count = 1
for attraction in attraction_place:
while count < 15:
text = attraction.text
new.append(text)
count += 1

You can use slice.
for attraction in attraction_place[:14]:
print(attraction.text)
type(attraction)

Filter a dictionary based on the value of its date keys

I want to import articles from as many sources around the world as from a certain date.
import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)
response_dataframe = pd.DataFrame(response.json())
articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)
But I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
TypeError: unhashable type: 'dict'
Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.
The New York Times The Washington Post The Financial Times
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...
My Python version is 3.6.6

You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Index Error while webscraping news websites - python

Use soup1.select() to search for nested elements matching a CSS selector: coverpage_news = soup1.select("h3 a.item-title") This will find an a element with class="item-title" that's a descendant of an h3 element.

Try to change: coverpage_news = soup1.find_all('h3', class_='item-title') to coverpage_news = soup1.find_all('h3', class_='list-txt')

Related

Web Scrape Attempt With Selenium Yields Duplicate Entries

Pandas read_html producing empty df with tuple column names

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

BeautifulSoup elements output to list

Filter a dictionary based on the value of its date keys

Categories

Resources