Scraping multiple pages into list with beautifulsoup - python

I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Related

Extracting Hyperlinks from Basketball Reference Column(On pages With multiple Tables) to new Column

I have recently been working on a scraper for NBA MVPS from basketball reference and hope to incorporate the embedded Player page hyperlinks that appear into a new column in the same row as the player. I have done the same for scraping other pages but unfortunately, due to the many tables and indeterminate order of them, my prior method returns many random links from around the page. Below is my code, the column player links is the one in question. While these links are acceptably formatted, they are simply not the correct ones as stated prior. The table is perfectly fine however this column is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='https://www.basketball-reference.com/awards/awards_2022.html'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, "html.parser")
tabs = soup.select('table[id*="mvp"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
players.append(player)
links= []
for link in soup.findAll('table')[1].findAll('a'):
url = link.get('href')
links.append(url)
my_list = links
substring = '/friv/'
new_list = [item for item in my_list if not item.startswith(substring)]
for item in my_list.copy():
if substring in item:
my_list.remove(item)
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols)
df['Playerlinks']=my_list
print(df.to_markdown())
So, essentially I am asking whether anyone is aware of a method to scrape just these hyperlinks(12 in the given example), and put them either into an ordered list(to be put into a column) or any better methods you may be aware of. My expected output, for this link, in particular, would be a first-row value of "/players/j/jokicni01.html", a second of "/players/e/embiijo01.html" etc; corresponding with their respective players. I have tried many methods using ids, find alls, and others but unfortunately my Html knowledge is simply very limited and I am starting to go in circles. Thank you in advance for any help you can provide.
Does something like this work for you? You can extract the profile links while you are iterating through the player rows.
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
# get player link
player.append(j.find('a')['href'])
players.append(player)
df=pd.DataFrame(players,columns=cols+["Playerlinks"])
print(df)

Noticing a warning to limit scraped results with BeautifulSoup in Python

I am trying to scrape sales data from eBay with BeautifulSoup in Python for recently sold items and it works very well with the following code which finds all prices and all dates from sold items.
price = []
try:
p = soup.find_all('span', class_='POSITIVE')
except:
p = 'nan'
for x in p:
x = str(x)
x = x.replace(' ','"')
x = x.split('"')
if '>Sold' in x:
continue
else:
price.append(x)
Now I am running into a problem though. As seen in the picture below for this URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=babe+ruth+1933+goudey+149+psa+%281.5%29&_sacat=0&LH_TitleDesc=0&_osacat=0&_odkw=babe+ruth+1933+goudey+149+psa+1.5&LH_Complete=1&rt=nc&LH_Sold=1), eBay sometimes suggests other search results if there are not enough for specific search queries. Check out the image
By that, my code not only finds the correct prices but also those of the suggested results below the warning. I was trying to find out where the warning message is located and delete every listing that is being found afterward, but I cannot figure it out. I also thought that I can search for the prices one by one but even then I cannot figure out how to notice when the warning appears.
Is there any other way you guys can think of to solve this?
I am aware that this is really specific
You can scrape the number of results (Shown in picture) and make a loop with the range of the results.
The code will be something like:
results = soup.find...
#You have to make the variable a int so replace everything extra
results = int(results)
for i in range(1, results):
price[i] = str(price[i])
price[i] = price[i].replace(' ','"')
price[i] = price[i].split()
if '>Sold' in price[i]:
continue
else:

How to append dictionaries to a list without overwriting data

I'm having a bit of trouble with a function I'm trying to write. What it is supposed to do is 1) go to a particular URL and get a list of financial sectors stored in a particular div; 2) visit each sector's respective page and get 3 particular pieces of information from there; 3) put the gathered collection into a dictionary; and 4) append that dictionary to another dictionary.
The desired output is a dictionary containing a list of dictionaries for all the sectors.
Here is my function:
def fsr():
fidelity_sector_report = dict()
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
import requests
from bs4 import BeautifulSoup
# scrape the url page and locate links for each sector
try:
response = requests.get(url)
if not response.status_code == 200:
return 'Main page error'
page = BeautifulSoup(response.content, "lxml")
sectors = page.find_all('a',class_="heading1")
for sector in sectors:
link = 'https://eresearch.fidelity.com/' + sector['href']
name = sector.text
sect = dict()
lst = []
# scrape target pages for required information
try:
details = requests.get(link)
if not details.status_code == 200:
return 'Details page error'
details_soup = BeautifulSoup(details.content,'lxml')
fundamentals = details_soup.find('div',class_='sec-fundamentals')
values = dict()
#locate required values by addressing <tr> text and put them in a dictionary
values['Enterprise Value'] = fundamentals.select_one('th:contains("Enterprise Value") + td').text.strip()
values['Return on Equity (TTM)'] = fundamentals.select_one('th:contains("Return on Equity (TTM)") + td').text.strip()
values['Dividend Yield'] = fundamentals.select_one('th:contains("Dividend Yield") + td').text.strip()
#add values to the sector dictionary
sect[name] = values
# add the dictionary to the list
lst.append(dict(sect))
# for a dictionary using the list
fidelity_sector_report['results'] = lst
except:
return 'Something is wrong with details request'
return fidelity_sector_report
except:
return "Something is horribly wrong"
AS far as I can tell, it performs the main taks wonderfully, and the problem appears at the stage of appending a formed dictionary to a list - instead of adding new piece, it gets overwritten completely. I figured that out by putting print(lst) right after the fidelity_sector_report['results'] = lst line.
What should I change so that list (and, correspondingly, dictionary) gets formed as planned?
You should move the lst=[] outside of the sectors loop.
Your problem appears since for each sector, you reset lst and you append the current sector data to an empty list.
The following code causes the value of fidelity_sector_report['results'] to be replaced with lst.
fidelity_sector_report['results'] = lst
I presume you would want to access the respective values using a key, you can add the following line below fidelity_sector_report = dict() to initialize a dictionary:
fidelity_sector_report['results'] = {}
Then, create a key for each sector using the sector name and set the value with your values dictionary by replacing fidelity_sector_report['results'] = lst with:
fidelity_sector_report['results'][name] = dict(values)
You can access the data by using the relevant keys, i.e fidelity_sector_report['results']['Financials']['Dividend Yield'] for the dividend yield of the financials sector.

Web scraping Loop python issue

I am a python newbie, and wondering if someone would be able to highlight where about I am going wrong with the following webscraping script.
I am trying to recursively loop through the list of matches, to pull a cumulative value (metric) for each match.
My problem is, it is returning the exact same value each time.
I've tried to add notes to explain each of my points, any help appreciated.
#use Selenium & Beautiful Soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
#define URL/driver
my_url = "https://www.bet365.com/#/IP/"
driver = webdriver.Edge()
driver.get(my_url)
#allow a sleep of 10 seconds
time.sleep(10)
#parse the page
pSource= driver.page_source
soup = BeautifulSoup(pSource, "html.parser")
#containers tag - per match
containers = soup.findAll("div", {"class": "ipn-TeamStack "})
for container in containers:
#Total Match Shots
cumul_match_shots = 0
match = container.find_all('div')
for data in soup.findAll('div',{'class':'ml1-SoccerStatsBar '}):
for result in data.find_all('span'):
a = result.text
if len(a) > 0:
cumul_match_shots += int(a)
#print out values
print(match)
print(cumul_match_shots)
#close the webpage
driver.close() `
I think that you need to change the indentation (and move it a little bit higher) of the print(cumul_match_shots), as in the current state - it will give (print) you always the value from the last for loop.
And I am not sure if you are have a right place to reset the value to 0 again. Currently it looks like it will be a cumulative value of score in ALL matches.
As for match - it should be ok, as you do not modify it in the for loops.

Python when checking if scraped element exists in list

I keep getting an error when I am using an if else statement in python. I want my script to check if an index exists and if it does then run the code, if not then run another code. I get the error ValueError: 'Named Administrator' is not in list
import requests
from bs4 import BeautifulSoup
url_3 = 'https://www.brightscope.com/form-5500/basic-info/107299/Orthopedic-Institute-Of-Pennsylvania/15801790/Orthopedic-Institute-Of-Pennsylvania-401k-Profit-Sharing-Plan/'
page = requests.get(url_3)
soup = BeautifulSoup(page.text, 'html.parser')
divs = [e.get_text() for e in soup.findAll('span')]
if divs.index('Named Administrator'):
index = divs.index('Named Administrator')
contact = divs[index + 1]
else:
contact = '-'
Rather than doing index, do a __contains__ test:
if 'Named Administrator' in divs:
and move forward only if Named Administrator actually exists in divs list, so you won't get the ValueError.
Another consideration is that membership test in lists has O(N) time complexity, so if you are doing this for a large list, probably use a set instead:
{e.get_text() for e in soup.findAll('span')}
but as sets are unordered you won't be able to use index-ing.
So either think about something else that would work on sets as well i.e. no need to get next value by indexing.
Or you can use a set for membership test, and list for getting the next value. The cost here might be higher or lower based on your actual context and you can only find out that by profiling:
divs_list = [e.get_text() for e in soup.findAll('span')]
divs_set = set(divs_list)
if 'Named Administrator' in divs_set:
index = divs_list.index('Named Administrator')
contact = divs_list[index + 1]

Categories