Is there an efficient way to optimize this web-scraping script?

Is there an efficient way to optimize this web-scraping script? - python

Here I have a web-scraping script that utilizes "requests" and "BeautifulSoup" modules to extract the movie names and ratings from the "https://www.imdb.com/chart/top/" website. I also extracted a short description of each movie from the link provided in "td.posterColumn a" for each tag to form the third column. For doing so, I had to create a secondary soup object and extract the summary text from it for each . Even though the method works and I'm able to form a table, the runtime is too long and that is understandable as a new soup object is created for each iteration of the row. Could anyone please suggest me a faster and efficient way to perform this operation? Also, how do I make all the rows and columns appear in its entirety in the DataFrame output? Thanks!
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
import pdb
start_time=time.time()
response=requests.get("https://www.imdb.com/chart/top/")
soup = BeautifulSoup(response.content,"lxml")
body=soup.select("tbody.lister-list")[0]
titles=[]
ratings=[]
summ=[]
for row in body.select("tr"):
title=row.select("td.titleColumn a")[0].get_text().strip()
titles.append(title)
rating=row.select("td.ratingColumn.imdbRating")[0].get_text().strip()
ratings.append(rating)
innerlink=row.select("td.posterColumn a")[0]["href"]
link="https://imdb.com"+innerlink
#pdb.set_trace()
response2=requests.get(link).content
soup2=BeautifulSoup(response2,"lxml")
summary=soup2.select("div.summary_text")[0].get_text().strip()
summ.append(summary)
df=pd.DataFrame({"Title":titles,"IMDB Rating":ratings, "Movie Summary":summ})
df.to_csv("imdbmovies.csv")
end_time=time.time()
finish=end_time-start_time
print("Runtime is {f:1.4f} secs".format(f=finish))
print(df)
Pandas DataFrame output:

Related

I use python/requests to get the HTML label information, want to store the result into pandas data.frame

I use python/requests to get the HTML label information, want to store the result into data.frame value_table, below can't work from value_table.append(name.get_text().strip(),ignore_index=True). Anyone can help ? Thanks!
import pandas as pd
import requests
from lxml import etree
from bs4 import BeautifulSoup
value_table=pd.DataFrame(columns=['value'])
url = 'https://www.ebay.com/itm/394079766930'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'lxml')
sku = soup.find('div','u-flL iti-act-num itm-num-txt').get_text(strip=True)
price = soup.find('span',{'itemprop':'price'}).get_text(strip=True)
div = soup.find('div', {'id': 'viTabs_0_is'})
divs = div.findAll('span', {'class': 'ux-textspans'})
for name in divs:
print(name.get_text().strip()+' ')
value_table.append(name.get_text().strip(),ignore_index=True)
value_table['sku',:]=sku
value_table['price':]=price

You're mostly on track. The only real modifications needed for things related to pandas.
Pandas reads a list as a column. But you can use a linear algebra trick to transpose the list which is written as a row but brought in as a column to a row as intended. Because of that, value_table=pd.DataFrame(columns=['value']) isn't needed. From there it's just a few lines.
So, keep everything above the for-loop with the exception of value_table=pd.DataFrame(columns=['value']) and replace the for-loop down with this:
value_table=[]
for name in divs:
value_table.append(name.get_text().strip())
values_table = pd.DataFrame(value_table).T
values_table['sku']=skus
values_table['price']=price
That will give you (well, as much can be captured in a screenshot)
For a future iteration, you might want to consider if dict suits your needs better.
EDIT: I noticed in the comments you said: "currently i only want to strore them into one variable as string"
That's a simple as values_table['stringed']=str(value_table) but it isn't particularly readable, nor easily searchable.

Can't read html table with pd.read_html

on this link: https://www.basketball-reference.com/teams/MIA/2022.html
I want to read this table:
I use this code:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url,match="Shooting")
But it says: ValueError: No tables found matching pattern 'Shooting'.
If I try pd.read_html(url,match="Roster") or pd.read_html(url,match="Totals") it searches for these tables.

Its the second table that you want to read. You can simply do:
import pandas as pd
url="https://www.basketball-reference.com/teams/MIA/2022.html"
pd.read_html(url)[1]

I've discovered that the HTML code commented inside each div#all_* are the same with actual scoring tables content. So it looks like the tables somehow generated from the comments using JavaScript after page loads. Obviously it's some kind scraping protection.
Screenshots of what I mean (for Shooting section you want to get):
Well, the only solution I see for now is firstly to load the whole HTML of page then modify req.content with replace function (delete all special HTML comments characters) and finally get the table you want using pandas:
import requests
import pandas as pd
url = "https://www.basketball-reference.com/teams/MIA/2022.html"
req = requests.get(url)
req = req.text.replace('<!--', '')
# req = req.replace('-->', '') # not necessary in this case
pd.read_html(req, match="Shooting")
Since the whole HTML code doesn't contains comments anymore I recommend to get tables by index.
For Shooting - Regular Season tab:
pd.read_html(req)[15]
and for Shooting - Playoffs tab:
pd.read_html(req)[16]

pd.read_html() isn't finding all the table tags. Only 7 are being returned.
Roster, Per Game, Totals, Advanced and 3 others. Shooting is not among them so pd.read_html(url,match="Shooting") is going to give you an error.
import pandas as pd
url = 'https://www.basketball-reference.com/teams/MIA/2022.html'
x = pd.read_html(url)
print(len(x)) #7

Scraping from a dropdown menu using beautifulsoup

I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options
The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.
Thank you very much.
import bs4
import requests
datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
datesPage = requests.get(datesLink)
datesSoup = BeautifulSoup(datesPage.text, 'lxml')
datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text

The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.
You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:
import requests, re
from datetime import *
content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]
for d in dates:
print(d.strftime('%Y-%m-%d'))
It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.

You can simply read HTML directly to Pandas:
import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'
df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Try to use bs4 get some infos from wikipedia

I started to learn python this year as the new year resolutions;P I encountered some problems when self-learning web-scraping. This may be a dumb questions but I hope someone can point out problems of my codes.
Thanks in advance!
I want to web-scraping from Wikipedia Nobel Economic Prize https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics
# I first get the whole table
wiki_table = soup.find('table',{'class':'wikitable'})
print(wiki_table)
# And grab the td information
name_list = wiki_table('td')
print(name_list)
type(name_list) #bs4.element.ResultSet
type(name_list[0:]) # list
# My goal is to separate laureate's name from other descriptions i.e. countries, years...What I plan to do is first get some lists containing people's names and then clean others unwanted strings.
# I tried to loop both the bs4 type and list type
laurates=[]
for a in name_list:
laurates.append(name_list.find_all(class='a'))
print(laurates)
# I looped for a here because the html is like `Ragnar Frisch`. I thought the name is with the a code (or I interpreted wrongly?)

The simplest way(in this case) is just to load the table into a pandas dataframe and then extract from that whatever items you need using the usual pandas methods. So
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics"
pd.read_html(url)
will output the table on that page.

Isolating data from dynamic table with beautifulSoup

I'm trying to extract data from a table(1), which has a couple filter options. I'm using BeautifulSoup and got to this page with Requests. An extract of code:
from bs4 import BeautifulSoup
tt = Contact_page.content # webpage with table
soup = BeautifulSoup(tt)
R_tables = soup.find('div', {'class': 'responsive-table'})
Using find_all("tr") and find_all("th") results in empty sets. Using R_tables.findChildren only goes down to "formrow" who then has no children. From formrow to my tr/th tags, I can't access it through BS4.
R_tables results in table 3. The XPath for this file is
"//*[#id="kronos_body"]/div[3]/div[2]/div[3]/script/text()
How can I get each row information for my data? soup.find("r") and soup.find("f") also result in empty sets.
Pardon me in advance if this post is sloppy, this is my first. I'll link what my most similar thread is in a comment, I can't link more than 2 times.
EDIT 1 : Apparently BS doesn't recognize any javascript apart from variables (correct me if I'm wrong, I'm still still relatively new). Are there any other modules that can help me out? I was proposed Ghost and Selenium, but I won't be using Selenium.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there an efficient way to optimize this web-scraping script? - python

Related

I use python/requests to get the HTML label information, want to store the result into pandas data.frame

Can't read html table with pd.read_html

Scraping from a dropdown menu using beautifulsoup

Try to use bs4 get some infos from wikipedia

Isolating data from dynamic table with beautifulSoup

Categories

Resources