BeautifulSoup elements output to list - python

I have an output using BeautifulSoup.
I need to convert the output from 'type' 'bs4.element.Tag' to a list and export the list into a DataFrame column, named COLUMN_A
I want my output to stop at the 14th element (the last three h2 are useless)
My code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
attraction_place=soup.find_all('h2', class_="sitename")
for attraction in attraction_place:
print(attraction.text)
type(attraction)
Output:
1 Vigeland Sculpture Park
2 Akershus Fortress
3 Viking Ship Museum
4 The National Museum
5 Munch Museum
6 Royal Palace
7 The Museum of Cultural History
8 Fram Museum
9 Holmenkollen Ski Jump and Museum
10 Oslo Cathedral
11 City Hall (Rådhuset)
12 Aker Brygge
13 Natural History Museum & Botanical Gardens
14 Oslo Opera House and Annual Music Festivals
Where to Stay in Oslo for Sightseeing
Tips and Tours: How to Make the Most of Your Visit to Oslo
More Related Articles on PlanetWare.com
I expect a list like:
attraction=[Vigeland Sculpture Park, Akershus Fortress, ......]
Thank you very much in advance.

A nice easy way is to take the alt attribute of the photos. This gets clean text output and only 14 without any need for slicing/indexing.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.planetware.com/tourist-attractions-/oslo-n-osl-oslo.htm')
soup = bs(r.content, 'lxml')
attractions = [item['alt'] for item in soup.select('.photo [alt]')]
print(attractions)

new = []
count = 1
for attraction in attraction_place:
while count < 15:
text = attraction.text
new.append(text)
count += 1

You can use slice.
for attraction in attraction_place[:14]:
print(attraction.text)
type(attraction)

Related

How to scrape the specific text from kworb and extract it as an excel file?

I'm trying to scrape the positions, the artists and the songs from a ranking list on kworb. For example: https://kworb.net/spotify/country/us_weekly.html
I used the following script:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://kworb.net/spotify/country/us_weekly.html")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
And here is the output:
ITUNES
WORLDWIDE
ARTISTS
CHARTS
DON'T PRAY
RADIO
SPOTIFY
YOUTUBE
TRENDING
HOME
CountriesArtistsListenersCities
Spotify Weekly Chart - United States - 2023/02/09 | Totals
PosP+Artist and TitleWksPk(x?)StreamsStreams+Total
1
+1
SZA - Kill Bill
9
1(x5)
15,560,813
+247,052
148,792,089
2
-1
Miley Cyrus - Flowers
4
1(x3)
13,934,413
-4,506,662
75,009,251
3
+20
Morgan Wallen - Last Night
2
3(x1)
11,560,741
+6,984,649
16,136,833
...
How do I only get the positions, the artists and the songs separately and store it as an excel first?
expected output:
Pos Artist Songs
1 SZA Kill Bill
2 Miley Cyrus Flowers
3 Morgan Wallen Last Night
...
Best practice to scrape tables is using pandas.read_html() it uses BeautifulSoup under the hood for you.
import pandas as pd
#find table by id and select first index from list of dfs
df = pd.read_html('https://kworb.net/spotify/country/us_weekly.html', attrs={'id':'spotifyweekly'})[0]
#split the column by delimiter and creat your expected columns
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
#pick your columns and export to excel
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
Alternative based on direct approach:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
Outputs
Pos
Artist
Song
1
SZA
Kill Bill
2
Miley Cyrus
Flowers
3
Morgan Wallen
Last Night
4
Metro Boomin
Creepin'
5
Lil Uzi Vert
Just Wanna Rock
6
Drake
Rich Flex
7
Metro Boomin
Superhero (Heroes & Villains) [with Future & Chris Brown]
8
Sam Smith
Unholy
...

How to Scrape NBA starting lineups and create a Pandas DataFrame?

I am having trouble parsing the code for the NBA starting lineups and would love some help if possible.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/basketball/nba-lineups.php"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
lineups = soup.find_all(class_='lineup__player')
print(lineups)
I am looking for the following data:
Player
Team
Position
I was hoping to scrape the data and then create a Pandas Dataframe from the output.
Here is an example of my desired output:
Player Team Position
Dennis Schroder BOS PG
Robert Langford BOS SG
Jayson Tatum BOS SF
Jabari Parker BOS PF
Grant Williams BOS C
Player Team Postion
Kyle Lowry MIA PG
Duncan Robinson MIA SG
Jimmy Butler MIA SF
P.J.Tucker MIA PF
Bam Adebayo MIA C
... ... ...
I was able to find the Player data but was unable to successfully parse it. I can see the Player data located inside 'Title'.
Any tips on how to complete this project will be greatly appreciated. Thank you in advance for any help that you may offer.
I am just looking for the 5 starting players... no need to add the bench players. And not sure if there is some way to add a space in between each team like my output above.
Here is and example of the current output that I would like to parse:
[<li class="lineup__player is-pct-play-100" title="Very Likely To Play">
<div class="lineup__pos">PG</div>
D. Schroder
</li>, <li class="lineup__player is-pct-play-100" title="Very Likely To Play">
<div class="lineup__pos">SG</div>
<a href="/basketball/player.php?id=4762" title="Romeo Langford">R.
You're on the right track. Here's one way to do it.
import requests, pandas
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/basketball/nba-lineups.php"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
lineups = soup.find_all(class_='is-pct-play-100')
positions = [x.find('div').text for x in lineups]
names = [x.find('a')['title'] for x in lineups]
teams = sum([[x.text] * 5 for x in soup.find_all(class_='lineup__abbr')], [])
df = pandas.DataFrame(zip(names, teams, positions))
print(df)

Index Error while webscraping news websites

I've been trying to web-scrape the titles of news articles, but I encounter an "Index Error" when the following code. I'm facing a problem only in the last line of code.
import requests
from bs4 import BeautifulSoup
URL= 'https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation'
r1 = requests.get(URL)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
coverpage_news = soup1.find_all('h3', class_='item-title')
coverpage_news[4].get_text()
This is the error:
IndexError Traceback (most recent call last)
<ipython-input-10-f7f1f6fab81c> in <module>
6 soup1 = BeautifulSoup(coverpage, 'html5lib')
7 coverpage_news = soup1.find_all('h3', class_='item-title')
----> 8 coverpage_news[4].get_text()
IndexError: list index out of range
Use soup1.select() to search for nested elements matching a CSS selector:
coverpage_news = soup1.select("h3 a.item-title")
This will find an a element with class="item-title" that's a descendant of an h3 element.
Try to change:
coverpage_news = soup1.find_all('h3', class_='item-title')
to
coverpage_news = soup1.find_all('h3', class_='list-txt')
Changing #Barmar's helpful answer a little bit to show all titles:
coverpage_news = soup1.select("h3 a.item-title")
for link in coverpage_news:
print(link.text)
Output:
US Covid Infections Cross 9 Million, Record 1-Day Spike Of 94,000 Cases
Johnson & Johnson Plans To Test COVID-19 Vaccine On Youngsters Soon
Global Stock Markets Decline As Coronavirus Infection Rate Weighs
Cristiano Ronaldo Recovers From Coronavirus
Reliance's July-September Profit Falls 15% As Covid Slams Oil Business
"Likely To Know By December If We'll Have Covid Vaccine": Top US Expert
With No Local Case In A Record 200 Days, This Country Is World's Envy
Delhi Blames Pollution For Covid Spike At High-Level Health Ministry Meet
Delhi Covid Cases Above 5,000 For 3rd Straight Day, Spike In ICU Patients
2 Million Indians Returned From Abroad Under Vande Bharat Mission: Centre
Existing Lockdown Restrictions Extended In Maharashtra Till November 30
Can TB Vaccine Protect Elderly From Covid?
Is The Covid-19 Situation Worsening In Delhi?
What's The Truth Behind India's Falling Covid Numbers?
"Slight Laxity Can Lead To Spike": AIIMS Director As India Sees Drop In Covid Cases

How do I scrape https://www.premierleague.com/players for information about team rosters for the last 10 years?

I have been trying to scrape data from https://www.premierleague.com/players to get team rosters for premier league clubs for the past 10 years.
The following is the code I am using. In this particular example se=17 specifies season 2008/09 and cl=12 is for Manchester United.
url= 'https://www.premierleague.com/players?se=17&cl=12'
r=requests.get(url)
d= pd.read_html(r.text)
d[0]
Inspite of the url providing the correct data on the page, the table I get is the one for the current season 2019/20. I have tried multiple combinations of the url and still I am not able to scrape.
Can someone help?
I prefer to use BeautifulSoup to navigate the DOM. This works.
from bs4 import BeautifulSoup
import requests
resp = requests.get("https://www.premierleague.com/players", params={"se":17,"cl":12})
soup = BeautifulSoup(resp.content.decode(), "html.parser")
html = soup.find("div", {"class":"table playerIndex"}).find("table")
df = pd.read_html(str(html))[0]
sample output
Player Position Nationality
Rolando Aarons Midfielder England
Tammy Abraham Forward England
Che Adams Forward England
Dennis Adeniran Midfielder England
Adrián Goalkeeper Spain
Adrien Silva Midfielder Portugal

How to return result to a table or csv type of format from html

I'm trying to set up a scrape from a betting site I can run during the NFL season to get the odds into Excel/DB, but as I am very new to python and bs4 I'm running into trouble.
I'm using Python 3.7.4 with BS4
import requests
from bs4 import BeautifulSoup
result2 = requests.get("https://www.betfair.com/sport/american-football/nfl-kampe/green-bay-packers-chicago-bears/29202049")
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
print(item.text)
I would like output to be csv like this:
"Green Bay Packers", "2.3", "Chicago Bears", "1.55"
"Green Bay Packers", "1.7","+3,5", "Chicago Bears", "2.0","-3.5"
Current result (with big line breaks):
Green Bay Packers
2.3
Chicago Bears
1.55
Green Bay Packers
1.7
+3,5
etc
I can't access the site because it's blocked behind the firewall on the public wifi I am on so I can't test the code below, but instead of printing the items, put them into a list. Then take that list and convert to dataframe/table. So something like:
Note: Still work to be done to clean it up, but this gets you going
import requests
from bs4 import BeautifulSoup
import pandas as pd
result2 = requests.get("https://www.betfair.com/sport/american-football/nfl-kampe/green-bay-packers-chicago-bears/29202049")
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')
data = []
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
temp_data = [ alpha for alpha in item.text.split('\n') if alpha != '' ]
data.append(temp_data)
df = pd.DataFrame(data)
print(df)
df.to_csv('file.csv')
Output:
print (df.to_string())
0 1 2 3 4 5 6 7
0 Green Bay Packers 11/8 Chicago Bears 8/13 None None None None
1 Green Bay Packers 3/4 +3.5 Chicago Bears 11/10 -3.5 None None
2 Current Points: Over 20/23 +46 Under 19/20 +46 None
3 Green Bay Packers by 1-13 Pts 2/1 Green Bay Packers 14+ 5/1 Chicago Bears by 1-13 Pts 6/4 Chicago Bears 14+ 10/3
I think you can just replace the new line symbol with space?
import csv
with open('filename.csv', 'a') as csv_file:
for item in soup.find_all('div', {'class': 'minimarketview-content'}):
x = item.text.replace('\n',',')
writer = csv.writer(csv_file)
writer.writerow([x])
Edit:
Added save to .csv file.

Categories