Removing new line characters in web scrape - python

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want
'D. Fletcher'
but instead I get
'LF\nD. Fletcher\nR'
Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.
I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily_lineups.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
players = soup.find_all('li', {'class': 'lineup__player'})
####for link in players.find('a'):
##### print (link.string)
awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)

You should only get the .text for the a tag, not the whole li:
awayPlayers = [player.find('a').text.strip() for player in players]
That would result in something like the following:
['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...

Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.
Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}
teams = [item.text for item in soup.select('.lineup__abbr')] #26
matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
team_visit = teams[i]
team_home = teams[i + 1]
highlights = teambox.select('.lineup__player-highlight-name a')
visit_highlight = highlights[0].text
home_highlight = highlights[1].text
match = team_visit + ' v ' + team_home
visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
matches[match] = {'visitor' : [{team_visit : visitors}] ,
'home' : [{team_home : home}]
}
i+=1
Example info:
Current structure:

I think you were almost there, you just needed to tweak it a little bit:
awayPlayers = [player.find('a').text for player in players]
This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:
['L. Martin',
'Jose Ramirez',
'J. Luplow'...]

You have to find a tag and title attribute in it, check below answer.
awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)
Output is:
['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

Related

Why is for looping not looping?

Im new to programming and cannot figure out why this wont loop. It prints and converts the first item exactly how I want. But stops after the first iteration.
from bs4 import BeautifulSoup
import requests
import re
import json
url = 'http://books.toscrape.com/'
page = requests.get(url)
html = BeautifulSoup(page.content, 'html.parser')
section = html.find_all('ol', class_='row')
for books in section:
#Title Element
header_element = books.find("article", class_='product_pod')
title_element = header_element.img
title = title_element['alt']
#Price Element
price_element = books.find(class_='price_color')
price_str = str(price_element.text)
price = price_str[1:]
#Create JSON
final_results_json = {"Title":title, "Price":price}
final_result = json.dumps(final_results_json, sort_keys=True, indent=1)
print(title)
print(price)
print()
print(final_result)
First, clarify what you are looking for? Probably, you wish to print the title, price and final_result for every book that has been scraped from the URL books.toscrape.com. The code is working as it is written though the expectation is different. If you notice you are finding all the "ol" tags with class name = "row" and there's just one such element on the page thus, section has only one element eventually the for loop iterates just once.
How to debug it?
Check the type of section, type(section)
Print the section to know what it contains
write some print statements in for loop to understand what happens when
It isn't hard to debug this one.
You need to change:
section = html.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
there is only 1 <ol> in that doc
I think you want
for book in section[0].find_all('li'):
ol means ordered list, of which there is one in this case, there are many li or list items in that ol

Web scraping with bs4 python: How to display football matchups

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)
Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

python - How would i scrape this website for specific data that's constantly changing/being updated?

the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!
This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.

Findall to div tag using beautiful soup yields blank return

<div class="columns small-5 medium-4 cell header">Ref No.</div>
<div class="columns small-7 medium-8 cell">110B60329</div>
Website is https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results
I would like to run a loop and return '110B60329'. I have ran beautiful soup and done a find_all(div), I then define the 2 different tags as head and data based on their class. I then ran iteration through the 'head' tags hoping it would return the info in the div tag i have defined as data .
Python returns a blank (cmd prompt reprinted the filepth).
Would anyone kindly know how i might fix this. My full code is.....thanks
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html= soup(response.content,"lxml")
properties_col = html.find_all('div')
for col in properties_col:
ref = 'n/a'
des = 'n/a'
head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"})
data = col.find_all("div",{"class":"columns small-7 medium-8 cell"})
for i,elem in enumerate(head):
#for i in range(elems):
if head [i].text == "Ref No.":
ref = data[i].text
print ref
You can do this by two ways.
1) If you are sure that the website that your are scraping won't change its content you can find all divs by that class and get the content by providing an index.
2) Find all left side divs (The titles) and if one of them matches what you want get the next sibling to get the text.
Example:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
#Method 1
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell")
Reference = LeftBlockData[0].get_text().strip()
Description = LeftBlockData[2].get_text().strip()
print(Reference)
print(Description)
#Method 2
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"):
RightColumn = column.next_sibling.next_sibling.get_text().strip()
if "Ref No." in column.get_text().strip():
print (RightColumn)
if "Description" in column.get_text().strip():
print (RightColumn)
The prints will output (in order):
110B60329
STORE
110B60329
STORE
Your problem is that you are trying to match a node text that have a lot of tabs with a non-spaced string.
For example your head [i].textvariable contains
Ref No., so if you compare it with Ref No. it'll give a false result. Striping it will solve.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results")
soup = BeautifulSoup(r.text, 'lxml')
for row in soup.find_all(class_='table-row'):
print(row.get_text(strip=True, separator='|').split('|'))
out:
['Ref No.', '110B60329']
['Office', 'LOTHIAN VJB']
['Description', 'STORE']
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR']
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.']
['Tenant', 'PROPRIETOR']
['Occupier']
['Net Annual Value', '£1,750']
['Marker']
['Rateable Value', '£1,750']
['Effective Date', '01-APR-10']
['Other Appeal', 'NO']
['Reval Appeal', 'NO']
get_text() is very powerful tool, you can strip the white space and put separator in the text.
You can use this method to get clean data and filter it.

Get URL, Title and Description only if Title or Description contains %string%

I have a textfile containing the URLs of some RSS feeds. I would like to find out which of these URLs has a title or description (or any other tag) containg certain strings (a list of words).
As for now, I am able to get the URL, title and headline (and whatever). Not really sure on how to proceed though. I guess I would check the tags with regex. If I checked an URLs title and found a wordmatch, how would I then retrieve the URL again? The URL needs to be connected to the tags, like in a .csv. Bit confused here. Maybe someone can shoot me in the right direction?
My path so far:
import requests
from bs4 import BeautifulSoup
rssfeed = open('input.txt')
rss_source = rssfeed.read()
rss_sources = rss_source.split()
i=0
while i<len(rss_sources):
get_rss = requests.get(rss_sources[i])
rss_soup = BeautifulSoup(get_rss.text, 'html.parser')
rss_urls = rss_soup.find_all('link')
i=i+1
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
print (rss_all_urls, urls_titles, urls_headlines)
So you want to have an array of URLs.
That array should contain certain URLs based on some conditions:
- if the Title of that URL match one of the strings contained on an array
So first you need your arrays:
titlesToMatch = ['title1', 'title2', 'title3']
finalArrayWithURLs = []
then when you have your: rss_all_urls, urls_titles, urls_headlines for a URL you want to include on the finalArrayWithURLs just those ones that match one of the titles on the titleToMatch
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
if any(item in urls_titles for item in titlesToMatch):
finalArrayWithURLs.push(url)
So after that you will have on the finalArrayWithURLs just those URLs where the title match one of the titles of your titlesToMatch array

Categories