How to scrape specific text from specific table elements - python

I am trying to scrape specific text from specific table elements on an Amazon product page.
URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/
URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S
URL_1:
The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'
I am not able to parse out the text on these items as some are in one block of code, where others are not.
I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format
A bad example of the code used, I was trying to get sales rank past the </b>
element in the HTML
#Sales Rank
sales_rank ='NOT'
try:
sr = soup.find('li', attrs={'id':'SalesRank'})
sales_rank = sr.find('/b').text.strip()
except:
pass
I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as
dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief
my_dict = {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}
URL_2:
The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.

You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from #aaronhall.
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
for link in links:
r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['N/A']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['N/A']
rankings = dict(zip(cat_nos, cats))
map_dict = {
'Product Dimensions': 'dimensions',
'Shipping Weight': 'weight',
'Item model number': 'Item_No',
'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
}
final_dict = {}
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
print(final_dict)

Related

Webscraping different URLs - limit

I have coded a web scraper for auto trader but for some reason when iterating through urls I can only ever get a maximum length of 1300 for my dataframe. There are 13 results per page so is there some sort of significance about a limit of 100 or am I just doing something wrong? Any help would be greatly appreciated :)
I've attached my code below
# Import required libraries
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
# List of urls
path = 'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=RH104JJ&make=&price-from=500&price-to=100000&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page='
urls = []
for i in range(1,500):
url = path + str(i)
urls.append(url)
# Lists to store the scraped data in
makes = []
prices = []
ratings = []
dates = []
types = []
miles = []
litres = []
bhps = []
transmissions = []
fuels = []
owners = []
attributes = [makes, ratings, dates, types, miles, litres, bhps, transmissions, fuels, owners]
# Iterate through urls
sum = 0
for url in urls:
sum += 1
if sum%10 == 0:
print(sum)
# Attempt to connect to the url
try:
response = get(url)
except:
print('oops')
html_soup = BeautifulSoup(response.text, 'html.parser')
# Get a list of individual cars and iterate through it
car_containers = html_soup.find_all('li', class_ = 'search-page__result')
for container in car_containers:
try:
container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text
rating = container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text.strip()
except:
rating = ''
ratings.append(rating)
make = container.h2.text.strip().title().split(' ')[0]
makes.append(make)
price = container.find("div", {"class": "vehicle-price"}).text[1:]
prices.append(price)
specs = container.find("ul", {"class": "listing-key-specs"}).find_all("li", recursive=True)
for spec in specs:
if spec.text.split(' ')[0].isdigit() and len(spec.text.split(' ')[0]) == 4:
date = spec.text.split(' ')[0]
dates.append(date)
if 'mile' in str(spec):
mile = spec.text.split(' ')[0]
miles.append(mile)
if 'l' in str(spec).lower() and str(spec.text)[:-1].replace('.', '').isnumeric() and not spec.text.split(' ')[0].isdigit():
litre = spec.text[:-1]
litres.append(litre)
if any(x in str(spec).lower() for x in ['automatic', 'manual']):
transmission = spec.text
transmissions.append(transmission)
if any(x in str(spec).lower() for x in ['bhp', 'ps']):
bhp = spec.text
bhps.append(bhp)
if any(x in str(spec).lower() for x in ['petrol', 'diesel']):
fuel = spec.text
fuels.append(fuel)
if 'owner' in str(spec):
owner = spec.text
owners.append(owner.split(' ')[0])
typelist = ['hatchback', 'saloon', 'convertible', 'coupe', 'suv', 'mpv', 'estate', 'limousine',
'pickup']
if any(x in str(spec).lower() for x in typelist):
typ = spec.text
types.append(typ)
# Filling in empty spaces
for attribute in attributes:
if len(attribute) < len(prices):
attribute.append('')
# Creating a dataframe from the lists
df = ({'makes': makes,
'Price': prices,
'Rating': ratings,
'Year': dates,
'Type': types,
'Miles': miles,
'Litres': litres,
'BHP': bhps,
'Transmission': transmissions,
'Fuel': fuels,
'Owners': owners
})
df = pd.DataFrame(df)
Maybe just use a url shortener if the length of the url is too long

BeautifulSoup4 scraping: Pandas "arrays must all be same length" when exporting data to csv

I'm using BeautifulSoup4 to scrape info from a website and using Pandas to export the data to a csv file. There are 5 columns of data represented by 5 lists in a dictionary. However, since the website doesn't have complete data for all 5 categories, some lists have fewer items than others. So when I try to export the data, pandas gives me
ValueError: arrays must all be same length.
What is the best way to handle this situation? To be specific, the lists with fewer items are "authors" and "pages". Thanks in advance!
Code:
import requests as r
from bs4 import BeautifulSoup as soup
import pandas
#make a list of all web pages' urls
webpages=[]
for i in range(15):
root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All&page='+ str(i)
webpages.append(root_url)
print(webpages)
#start looping through all pages
titles = []
journals = []
authors = []
pages = []
dates = []
issues = []
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles += [el.replace('\n', '') for el in title_list]
journal_list = [journal.text for journal in page_soup.find_all('em')]
journals += [el.replace('\n', '') for el in journal_list]
author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors += [el.replace('\n', '') for el in author_list]
pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages += [el.replace('\n', '') for el in pages_list]
date_list = [date.text for date in page_soup.find_all('div', {'class':'field field--name-field-date field--type-datetime field--label-hidden field__item'})]
dates += [el.replace('\n', '') for el in date_list]
issue_list = [issue.text for issue in page_soup.find_all('div', {'class':'field field--name-field-issue-number field--type-integer field--label-hidden field__item'})]
issues += [el.replace('\n', '') for el in issue_list]
# export to csv file via pandas
dataset = {'Title': titles, 'Author': authors, 'Journal': journals, 'Date': dates, 'Issue': issues, 'Pages': pages}
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example45.csv', encoding="utf-8")
If you are sure, that for examples the length of the titles is always correct, you could do something like that:
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles_to_add = [el.replace('\n', '') for el in title_list]
titles += titles_to_add
...
author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors_to_add = [el.replace('\n', '') for el in author_list]
if len(authors_to_add) < len(titles_to_add):
while len(authors_to_add) < len(titles_to_add):
authors_to_add += " "
authors += authors_to_add
pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages_to_add = [el.replace('\n', '') for el in pages_list]
if len(pages_to_add) < len(titles_to_add):
while len(pages_to_add) < len(titles_to_add):
pages_to_add += " "
pages += pages_to_add
However... This will just add elements to the columns, so that they have the correct length so that you can create a dataframe. But in your dataframe the Authors and pages will not be in the correct row. You will have to change your algorithm a bit to achieve your final goal... It would be better if you would iterate through all rows on your page and get title etc... like this:
rows = page_soup.find_all('div', {'class':'views-row'})
for row in rows:
title_list = [title.text for title in row.find_all('div', {'class':'field field-name-node-title'})]
...
Then you need to check if a title, author, etc... exists len(title_list)>0and if not, add "None" or something else to the specific list. Then everything should be correct in your df.
You could make a dataframe out of just the first list (df = pandas.DataFrame({'Title': titles})), and then add the others:
dataset = {'Author': authors, 'Journal': journals, 'Date': dates, 'Issue': issues, 'Pages': pages}
df2 = pandas.DataFrame(dataset)
df_final = pandas.concat([df, df2], axis=1)
This will give you blanks (or NaN) where you have missing data.
The trouble with this, as with #WurzelseppQX's answer, is that the data may not be aligned, which would make it pretty useless. So the best is perhaps to change your code in such a way that you always append something to the each list for each run through the loop, just making it 0 or blank if there's nothing there.

Create a loop by iterating a string throughout a code

I have the following code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import requests
from requests import get
date = []
tourney_round = []
result = []
winner_odds = []
loser_odds = []
surface = []
players_and_tourney
response = get('http://www.tennisexplorer.com/player/humbert-e2553/?annual=all')
page_html = BeautifulSoup(response.text, 'html.parser')
results2018_containers = page_html.find_all('div', id = 'matches-2018-1-data')
for container in results2018_containers:
played_date_2018 = results2018_containers[0].findAll('td', class_ = 'first time')
for i in played_date_2018:
date.append(i.text)
string_2018 = '2018'
date = [x + string_2018 for x in date]
for container in results2018_containers:
rounds_2018 = results2018_containers[0].findAll('td', class_ = 'round')
for i in rounds_2018:
tourney_round.append(i.text)
for container in results2018_containers:
results_2018 = results2018_containers[0].findAll('td', class_ = 'tl')
for i in results_2018:
result.append(i.text)
for container in results2018_containers:
surfaces_2018 = results2018_containers[0].findAll('td', class_ = 's-color')
for i in surfaces_2018:
surface.append(i.find('span')['title'])
for container in results2018_containers:
odds_2018 = results2018_containers[0].findAll('td', class_ = 'course')
winner_odds_2018 = odds_2018[0:][::2]
for i in winner_odds_2018:
winner_odds.append(i.text)
loser_odds_2018 = odds_2018[1:][::2]
for i in loser_odds_2018:
loser_odds.append(i.text)
for container in results2018_containers:
namesandtourney_2018 = results2018_containers[0].findAll('td', class_ = 't-name')
for i in namesandtourney_2018:
players_and_tourney.append(i.text)
from itertools import chain, groupby, repeat
chainer = chain.from_iterable
def condition(x):
return x.startswith('\xa0')
elements = [list(j) for i, j in groupby(players_and_tourney, key=condition) if not i]
# create list of headers
headers = [next(j) for i, j in groupby(players_and_tourney, key=condition) if i]
# chain list of lists, and use repeat for headers
initial_df_2018 = pd.DataFrame({'Date': date,
'Surface': surface,
'Players': list(chainer(elements)),
'Tournament': list(chainer(repeat(i, j) for i, j in \
zip(headers, map(len, elements)))),
'Round': tourney_round,
'Result': result,
'Winner Odds': winner_odds,
'Loser Odds' : loser_odds})
initial_df_2018['Winner'], initial_df_2018['Loser'] =
initial_df_2018['Players'].str.split(' - ', 1).str
del initial_df_2018['Players']
initial_df_2018 = initial_df_2018[['Date','Surface','Tournament','Winner','Loser','Result','Winner Odds','Loser Odds']]
I want to create a loop that runs the code for every year starting from 2005. So basically, running a loop by replacing 2018 throughout the code by each year between 2005 an 2018. If possible, the code would run first for the year 2018, then 2017, and so on until 2005.
Edit: I added the code that i used to pull data for the year 2018, but I want to have a loop that will pull data for all the years that can be found on the page.
If I understood you correctly you want to complete the request for 2018, for all years between 2005-2018.
What I did was loop over your code for years in those range, replacing the id each time and adding all data to the list.
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
date_dict = {}
for year in range(2019, 1, -1):
date = []
string_id = "played-{}-data".format(year)
results_containers = page_html.find_all('div', id = string_id)
if (results_containers == None):
continue
for container in results_containers :
played_date = results_containers [0].findAll('td', class_ = 'plays')
for i in played_date :
date.append(i.text)
if not (year in date_dict):
date_dict[year] = []
date_dict[year] += date
You can store the year as an integer but still use it in a string.
for year in range(2018, 2004, -1):
print(f"Happy New Year {year}")
Other ways to include a number in a string are "Happy New Year {}".format(year) or "it is now " + str(year) + " more text".
Also, I don't think you do, but if someone finds this and really wants to "iterate a string" caesar ciphers are a good place to look.
There's no problem looping that, but you need to define how you want your results. I used a dictionary here, and i've turned your code into a function that I can call with variables:
def get_data(year):
date =[]
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
results_containers = page_html.find_all('div', id = 'played-{year}-data'.format(year))
for container in results_containers:
played_date = results_containers[0].findAll('td', class_ = 'plays')
for i in played_date:
date.append(i.text)
return date
Now all i have to do is create a range of possible years and call the function every time, this can be done as simply as:
all_data = {year: get_data(year) for year in range(2018, 2004, -1)}
Just use a for loop over a range. Something like:
date =[]
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
for year in range(2018, 2004, -1):
year_id = 'played-{}-data'.format(year)
results_containers = page_html.find_all('div', id=year_id)
...

Extracting Data from http://www.ign.com/tv/reviews

I attempted to achieve the following result:
My code, that currently is not working, is appended
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
r = requests.get('http://www.ign.com/tv/reviews')
c=r.text
# print(c)
soup = BeautifulSoup(c, 'html.parser')
x=soup.find_all('div', class_='item-title')
for item in x:
print(item)
print('--------------------------------------------------')
lobbying = {}
for element in x:
lobbying[element.a.get_text()] = {}
#print (lobbying) # This is a dictionary object
for key,value in lobbying.items():
print(key,value)
for element in x:
lobbying[element.a.get_text()]["link"] = element.a["href"]
for key,value in lobbying.items():
print(key,value, sep='\n', end='\n\n')
This is for first finding the date and score and the inserting the what we find into the dictionary.
f = soup.find_all('div', class_='itemList-item')
reviewItems ={}
for item in f:
score = item.find("span", class_="scoreBox-scorePhrase").getText()
date = item.find_all("div", class_="grid_3")[1].getText().strip()
lobbying[element.a.get_text()]["score"] = score
lobbying[element.a.get_text()]["date"] = date
for key,value in lobbying.items():
print(key,value)
I was able to get the info you needed of the items with following:
f = soup.find_all('div', class_='itemList-item')
reviewItems = {}
for item in f:
review = {}
review["score"] = item.find("span", class_="scoreBox-scorePhrase").getText()
review["date"] = item.find_all("div", class_="grid_3")[1].getText().strip()
review["link"] = item.find("a", class_="scoreBox-link")["href"]
reviewItems[item.find("div", class_="item-title").getText().strip()] = review
for key, value in reviewItems.items():
print(key, value)
If the class you are using is too generic (like the grid_3), try finding a more specific one. In this case it is scoreBox, or itemList-item.
You said you want to get the date from the URL, but I think you meant that you want to get it from the div that has it. Seems to be that in this particular case you can just take the second element from the grid_3 elements found for each itemList-item.
Anyway, this will print out the following for 25 items.
Ash vs Evil Dead - Home Again {'link': 'http://www.ign.com/articles/2016/12/04/ash-vs-evil-dead-home-again-review', 'date': 'December 5, 2016', 'score': 'Amazing'}

Extract data from a site using python

I am making a program that will extract the data from http://www.gujarat.ngosindia.com/
I wrote the following code :
def split_line(text):
words = text.split()
i = 0
details = ''
while ((words[i] !='Contact')) and (i<len(words)):
i=i+1
if(words[i] == 'Contact:'):
break
while ((words[i] !='Purpose')) and (i<len(words)):
if (words[i] == 'Purpose:'):
break
details = details+words[i]+' '
i=i+1
print(details)
def get_ngo_detail(ngo_url):
html=urlopen(ngo_url).read()
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'border3'})
td = soup.find('td', {'class': 'border'})
split_line(td.text)
def get_ngo_names(gujrat_url):
html = urlopen(gujrat_url).read()
soup = BeautifulSoup(html)
for link in soup.findAll('div',{'id':'mainbox'}):
for text in link.find_all('a'):
print(text.get_text())
ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
get_ngo_detail(ngo_link)
#NGO_name = text2.get_text())
a = get_ngo_names(BASE_URL)
print a
But when i run this script i only get the name of NGOs and contact person.
I want Email, telephone number, website, purpose and contact person.
Your split_line could be improved. Imagine you have this text:
s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: a.mitra1#lse.ac.uk
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""
Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:
lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House',
'Drive in Road, Opp Drive in Cinema',
...]
We can define a list of the elements we want to extract, and a dictionary to hold the results:
targets = ["Contact", "Purpose", "Email"]
results = {}
And work through each line, capturing the information we want:
for line in lines:
l = line.split(":")
if l[0] in targets:
results[l[0]] = l[1]
This gives me:
results == {'Contact': ' Angha Mitra',
'Purpose': ' Economics and Finance, Micro-enterprises',
'Email': ' a.mitra1#lse.ac.uk'}
Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by.
e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]
My regular expression could be wrong but this is the direction you should head in.

Categories