Try to use bs4 get some infos from wikipedia - python

I started to learn python this year as the new year resolutions;P I encountered some problems when self-learning web-scraping. This may be a dumb questions but I hope someone can point out problems of my codes.
Thanks in advance!
I want to web-scraping from Wikipedia Nobel Economic Prize https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics
# I first get the whole table
wiki_table = soup.find('table',{'class':'wikitable'})
print(wiki_table)
# And grab the td information
name_list = wiki_table('td')
print(name_list)
type(name_list) #bs4.element.ResultSet
type(name_list[0:]) # list
# My goal is to separate laureate's name from other descriptions i.e. countries, years...What I plan to do is first get some lists containing people's names and then clean others unwanted strings.
# I tried to loop both the bs4 type and list type
laurates=[]
for a in name_list:
laurates.append(name_list.find_all(class='a'))
print(laurates)
# I looped for a here because the html is like `Ragnar Frisch`. I thought the name is with the a code (or I interpreted wrongly?)

The simplest way(in this case) is just to load the table into a pandas dataframe and then extract from that whatever items you need using the usual pandas methods. So
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics"
pd.read_html(url)
will output the table on that page.

Related

Can't read html table with pd.read_html

on this link: https://www.basketball-reference.com/teams/MIA/2022.html
I want to read this table:
I use this code:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url,match="Shooting")
But it says: ValueError: No tables found matching pattern 'Shooting'.
If I try pd.read_html(url,match="Roster") or pd.read_html(url,match="Totals") it searches for these tables.
Its the second table that you want to read. You can simply do:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url)[1]
I've discovered that the HTML code commented inside each div#all_* are the same with actual scoring tables content. So it looks like the tables somehow generated from the comments using JavaScript after page loads. Obviously it's some kind scraping protection.
Screenshots of what I mean (for Shooting section you want to get):
Well, the only solution I see for now is firstly to load the whole HTML of page then modify req.content with replace function (delete all special HTML comments characters) and finally get the table you want using pandas:
import requests
import pandas as pd
url = "https://www.basketball-reference.com/teams/MIA/2022.html"
req = requests.get(url)
req = req.text.replace('<!--', '')
# req = req.replace('-->', '') # not necessary in this case
pd.read_html(req, match="Shooting")
Since the whole HTML code doesn't contains comments anymore I recommend to get tables by index.
For Shooting - Regular Season tab:
pd.read_html(req)[15]
and for Shooting - Playoffs tab:
pd.read_html(req)[16]
pd.read_html() isn't finding all the table tags. Only 7 are being returned.
Roster, Per Game, Totals, Advanced and 3 others. Shooting is not among them so pd.read_html(url,match="Shooting") is going to give you an error.
import pandas as pd
url = 'https://www.basketball-reference.com/teams/MIA/2022.html'
x = pd.read_html(url)
print(len(x)) #7

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

Using Pandas match='string' parameter in pd.read_html does not recognize the string in a table, even though there is a table containing 'string'

This question has two parts:
Part One
I'm trying to scrape some data from the SEC's website using pandas pd.read_html function. I only want one specific table, which has the text "Principal Position" in the table. I wrote a script (see below) to pull this data, but the problem is that it only works some of the time. Sometimes it seems to completely ignore tables that contain this text.
For instance, in the script below, I attempt to pull the table containing the words "Principal Position" for each of three companies - Microsoft, Amazon, and Tesla. Although all three companies have a table containing the words "Principal Position", only two of the tables (MSFT and TSLA) are scraped. The third (AMZN) is skipped in the try-except block because the text is not found. I cannot figure out what I'm doing wrong that is causing the AMZN table to be skipped.
Any help would be greatly appreciated!
Part Two
I'm also trying to figure out how to cause the table to have headers that start with whatever row contains the words "Principal Position." Sometimes this phrase is in the second row, sometimes the third, etc. I can't figure out how to set the headers parameter in pd.read_html to be dynamic so that it changes based on whichever row contains the words "Principal Position."
Ideally I would also like to get rid of the extra columns that are inserted into the table (i.e., the columns that are all 'NaN' values).
I know I'm asking a ton but thought I'd throw it out there to see if anyone knows how to do this (I'm stumped). Again, greatly appreciate any help!
My code (which skips AMZN table but does scrape the MSFT and TSLA tables)
import pandas as pd
import html5lib
CIK_list = {'MSFT': 'https://www.sec.gov/Archives/edgar/data/789019/000119312519268531/d791036ddef14a.htm',
'AMZN': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312520108422/d897711ddef14a.htm',
'TSLA': 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020027321/tsla-def14a_20200707.htm',}
for ticker, link in CIK_list.items():
try:
df_list = pd.read_html(link, match=('Principal Position'))
df_list = pd.DataFrame(df_list[0])
df_list.to_csv(f'Z:/Python/{ticker}.csv')
except:
pass
EDITED POST: To add a bit of detail, the error I am receiving is as follows:
ValueError: No tables found matching pattern 'Principal Position'
However, if you look at the link to the AMZN filing, you can text search for "Principal Position" and it comes up. Could it be that somehow Pandas is not waiting for the page to fully load before executing read_html?
Try:
df_list = pd.read_html(link, match=('Principal\s+Position'))
Because looking at the html code, there appears to be more than just a single whitespace between Principal and Position for the AMZN webpage. Using \s+ regex, which means one or more space will capture this table also.

Is there an efficient way to optimize this web-scraping script?

Here I have a web-scraping script that utilizes "requests" and "BeautifulSoup" modules to extract the movie names and ratings from the "https://www.imdb.com/chart/top/" website. I also extracted a short description of each movie from the link provided in "td.posterColumn a" for each tag to form the third column. For doing so, I had to create a secondary soup object and extract the summary text from it for each . Even though the method works and I'm able to form a table, the runtime is too long and that is understandable as a new soup object is created for each iteration of the row. Could anyone please suggest me a faster and efficient way to perform this operation? Also, how do I make all the rows and columns appear in its entirety in the DataFrame output? Thanks!
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import pdb
start_time=time.time()
response=requests.get("https://www.imdb.com/chart/top/")
soup = BeautifulSoup(response.content,"lxml")
body=soup.select("tbody.lister-list")[0]
titles=[]
ratings=[]
summ=[]
for row in body.select("tr"):
title=row.select("td.titleColumn a")[0].get_text().strip()
titles.append(title)
rating=row.select("td.ratingColumn.imdbRating")[0].get_text().strip()
ratings.append(rating)
innerlink=row.select("td.posterColumn a")[0]["href"]
link="https://imdb.com"+innerlink
#pdb.set_trace()
response2=requests.get(link).content
soup2=BeautifulSoup(response2,"lxml")
summary=soup2.select("div.summary_text")[0].get_text().strip()
summ.append(summary)
df=pd.DataFrame({"Title":titles,"IMDB Rating":ratings, "Movie Summary":summ})
df.to_csv("imdbmovies.csv")
end_time=time.time()
finish=end_time-start_time
print("Runtime is {f:1.4f} secs".format(f=finish))
print(df)
Pandas DataFrame output:

Python Web Scraping with lxml

I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.

Categories