Extracting Data from http://www.ign.com/tv/reviews - python

I attempted to achieve the following result:
My code, that currently is not working, is appended
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
r = requests.get('http://www.ign.com/tv/reviews')
c=r.text
# print(c)
soup = BeautifulSoup(c, 'html.parser')
x=soup.find_all('div', class_='item-title')
for item in x:
print(item)
print('--------------------------------------------------')
lobbying = {}
for element in x:
lobbying[element.a.get_text()] = {}
#print (lobbying) # This is a dictionary object
for key,value in lobbying.items():
print(key,value)
for element in x:
lobbying[element.a.get_text()]["link"] = element.a["href"]
for key,value in lobbying.items():
print(key,value, sep='\n', end='\n\n')
This is for first finding the date and score and the inserting the what we find into the dictionary.
f = soup.find_all('div', class_='itemList-item')
reviewItems ={}
for item in f:
score = item.find("span", class_="scoreBox-scorePhrase").getText()
date = item.find_all("div", class_="grid_3")[1].getText().strip()
lobbying[element.a.get_text()]["score"] = score
lobbying[element.a.get_text()]["date"] = date
for key,value in lobbying.items():
print(key,value)

I was able to get the info you needed of the items with following:
f = soup.find_all('div', class_='itemList-item')
reviewItems = {}
for item in f:
review = {}
review["score"] = item.find("span", class_="scoreBox-scorePhrase").getText()
review["date"] = item.find_all("div", class_="grid_3")[1].getText().strip()
review["link"] = item.find("a", class_="scoreBox-link")["href"]
reviewItems[item.find("div", class_="item-title").getText().strip()] = review
for key, value in reviewItems.items():
print(key, value)
If the class you are using is too generic (like the grid_3), try finding a more specific one. In this case it is scoreBox, or itemList-item.
You said you want to get the date from the URL, but I think you meant that you want to get it from the div that has it. Seems to be that in this particular case you can just take the second element from the grid_3 elements found for each itemList-item.
Anyway, this will print out the following for 25 items.
Ash vs Evil Dead - Home Again {'link': 'http://www.ign.com/articles/2016/12/04/ash-vs-evil-dead-home-again-review', 'date': 'December 5, 2016', 'score': 'Amazing'}

Related

Web scraping with beautifulsoup: find and replace missing nodes with None

I'm using the following code to scrape web items with Beaufulsoup:
item_id = []
items = soup.find_all('div', class_ = 'item-id')
for one_item in items:
list_item = one_item.text
item_id.append(list_item)
However, some items are missing and when I run the code, I one get the list of the items available. How can I proceed to get the entire list including the missings listed as "None" ?
import requests
from bs4 import BeautifulSoup as bsoup
site_source = requests.get("https://search.bvsalud.org/global-literature-on-novel-coronavirus-2019-ncov/?output=site&lang=en&from=0&sort=&format=summary&count=100&fb=&page=1&skfp=&index=tw&q=%28%22rapid+test%22+OR+%22rapid+diagnostic+test%22%29+AND+sensitivity+AND+specificity").content
soup = bsoup(site_source, "html.parser")
item_list = soup.find_all('div', class_ = 'textArt')
result_list = []
for item in item_list:
result = item.find('div', class_='reference')
if result is None:
result_list.append('None')
else:
result_list.append(result.text)
for result in result_list:
print(result)

How to scrape specific text from specific table elements

I am trying to scrape specific text from specific table elements on an Amazon product page.
URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/
URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S
URL_1:
The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'
I am not able to parse out the text on these items as some are in one block of code, where others are not.
I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format
A bad example of the code used, I was trying to get sales rank past the </b>
element in the HTML
#Sales Rank
sales_rank ='NOT'
try:
sr = soup.find('li', attrs={'id':'SalesRank'})
sales_rank = sr.find('/b').text.strip()
except:
pass
I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as
dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief
my_dict = {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}
URL_2:
The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.
You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from #aaronhall.
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
for link in links:
r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['N/A']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['N/A']
rankings = dict(zip(cat_nos, cats))
map_dict = {
'Product Dimensions': 'dimensions',
'Shipping Weight': 'weight',
'Item model number': 'Item_No',
'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
}
final_dict = {}
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
print(final_dict)

Create a loop by iterating a string throughout a code

I have the following code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import requests
from requests import get
date = []
tourney_round = []
result = []
winner_odds = []
loser_odds = []
surface = []
players_and_tourney
response = get('http://www.tennisexplorer.com/player/humbert-e2553/?annual=all')
page_html = BeautifulSoup(response.text, 'html.parser')
results2018_containers = page_html.find_all('div', id = 'matches-2018-1-data')
for container in results2018_containers:
played_date_2018 = results2018_containers[0].findAll('td', class_ = 'first time')
for i in played_date_2018:
date.append(i.text)
string_2018 = '2018'
date = [x + string_2018 for x in date]
for container in results2018_containers:
rounds_2018 = results2018_containers[0].findAll('td', class_ = 'round')
for i in rounds_2018:
tourney_round.append(i.text)
for container in results2018_containers:
results_2018 = results2018_containers[0].findAll('td', class_ = 'tl')
for i in results_2018:
result.append(i.text)
for container in results2018_containers:
surfaces_2018 = results2018_containers[0].findAll('td', class_ = 's-color')
for i in surfaces_2018:
surface.append(i.find('span')['title'])
for container in results2018_containers:
odds_2018 = results2018_containers[0].findAll('td', class_ = 'course')
winner_odds_2018 = odds_2018[0:][::2]
for i in winner_odds_2018:
winner_odds.append(i.text)
loser_odds_2018 = odds_2018[1:][::2]
for i in loser_odds_2018:
loser_odds.append(i.text)
for container in results2018_containers:
namesandtourney_2018 = results2018_containers[0].findAll('td', class_ = 't-name')
for i in namesandtourney_2018:
players_and_tourney.append(i.text)
from itertools import chain, groupby, repeat
chainer = chain.from_iterable
def condition(x):
return x.startswith('\xa0')
elements = [list(j) for i, j in groupby(players_and_tourney, key=condition) if not i]
# create list of headers
headers = [next(j) for i, j in groupby(players_and_tourney, key=condition) if i]
# chain list of lists, and use repeat for headers
initial_df_2018 = pd.DataFrame({'Date': date,
'Surface': surface,
'Players': list(chainer(elements)),
'Tournament': list(chainer(repeat(i, j) for i, j in \
zip(headers, map(len, elements)))),
'Round': tourney_round,
'Result': result,
'Winner Odds': winner_odds,
'Loser Odds' : loser_odds})
initial_df_2018['Winner'], initial_df_2018['Loser'] =
initial_df_2018['Players'].str.split(' - ', 1).str
del initial_df_2018['Players']
initial_df_2018 = initial_df_2018[['Date','Surface','Tournament','Winner','Loser','Result','Winner Odds','Loser Odds']]
I want to create a loop that runs the code for every year starting from 2005. So basically, running a loop by replacing 2018 throughout the code by each year between 2005 an 2018. If possible, the code would run first for the year 2018, then 2017, and so on until 2005.
Edit: I added the code that i used to pull data for the year 2018, but I want to have a loop that will pull data for all the years that can be found on the page.
If I understood you correctly you want to complete the request for 2018, for all years between 2005-2018.
What I did was loop over your code for years in those range, replacing the id each time and adding all data to the list.
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
date_dict = {}
for year in range(2019, 1, -1):
date = []
string_id = "played-{}-data".format(year)
results_containers = page_html.find_all('div', id = string_id)
if (results_containers == None):
continue
for container in results_containers :
played_date = results_containers [0].findAll('td', class_ = 'plays')
for i in played_date :
date.append(i.text)
if not (year in date_dict):
date_dict[year] = []
date_dict[year] += date
You can store the year as an integer but still use it in a string.
for year in range(2018, 2004, -1):
print(f"Happy New Year {year}")
Other ways to include a number in a string are "Happy New Year {}".format(year) or "it is now " + str(year) + " more text".
Also, I don't think you do, but if someone finds this and really wants to "iterate a string" caesar ciphers are a good place to look.
There's no problem looping that, but you need to define how you want your results. I used a dictionary here, and i've turned your code into a function that I can call with variables:
def get_data(year):
date =[]
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
results_containers = page_html.find_all('div', id = 'played-{year}-data'.format(year))
for container in results_containers:
played_date = results_containers[0].findAll('td', class_ = 'plays')
for i in played_date:
date.append(i.text)
return date
Now all i have to do is create a range of possible years and call the function every time, this can be done as simply as:
all_data = {year: get_data(year) for year in range(2018, 2004, -1)}
Just use a for loop over a range. Something like:
date =[]
response = get('http://www.example.com')
page_html = BeautifulSoup(response.text, 'html.parser')
for year in range(2018, 2004, -1):
year_id = 'played-{}-data'.format(year)
results_containers = page_html.find_all('div', id=year_id)
...

How to go through all items and than save them in a dictionary key

I want to load automatically a code from website.
I have a list with some names and want to go through every item. Go through the first item, make request, open website, copy the code/number from HTML (text in span) and than save this result in dictionary and so on (for all items).
I read from csv all lines and save them into a list.
After this I make request to load HTML from a website, search the company and read the numbers from span.
My code:
with open(test_f, 'r') as file:
rows = csv.reader(file,
delimiter=',',
quotechar='"')
data = [data for data in rows]
print(data)
url_part1 = "http://www.monetas.ch/htm/651/de/Firmen-Suchresultate.htm?Firmensuche="
url_enter_company = [data for data in rows]
url_last_part = "&CompanySearchSubmit=1"
firma_noga = []
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
# print all numbers that are in a span
numbers = [d.text for d in lii]
print("NOGA Codes: ")
I want to get in dictionary the result, where the key should be the company name (item in a list) and the value should be the number that I read from the span:
dict = {"firma1": "620100", "firma2": "262000, 465101"}
Can some one help me, I am new at web scraping and python, and don't know what I am doing wrong.
Split your string with regex and do your stuff depending on wether it is a number or not:
import re
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
EDIT:
Well, myString is somewhat expected to be a string.
To get the text content of your spans as a string you should be able to use .text something like this:
spans = soup.find_all('span')
for span in spans:
myString = span.text #
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
Abstracting from my requirements in comments I think somethinfg like this should work for you:
firma_noga = ['firma1', 'firma2', 'firma3'] #NOT EMPTY as in your code!
res_dict = {}
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
for l in lii:
if data not in res_dict:
res_dict[data] = [l]
else:
res_dict[data].append(l)
Obviously this will work obviously if firma-noga won't be empty like in your code; and all the rest (your) parsing logic should be valid as well.

Merge Python arrays in a loop?

I'm currently getting an output of A,A,B,B instead of A,B,A,B.
I really want to associate the values of each table header with each table data element (like a dictionary).
import requests
from bs4 import BeautifulSoup
courseCode = "IFB104"
page = requests.get("https://www.qut.edu.au/study/unit?unitCode=" + courseCode)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all(class_='table assessment-item')
numOfTables = 0
tableDataArray = []
for tbl in table:
numOfTables = numOfTables + 1
tableDataArray += [tbl.find_all('th'),tbl.find_all('td')]
If I understood correctly, you need to use dict, instead of list:
import requests
from bs4 import BeautifulSoup
courseCode = "IFB104"
page = requests.get("https://www.qut.edu.au/study/unit?unitCode=" + courseCode)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all(class_='table assessment-item')
numOfTables = 0
tableFormatted1 = []
tableFormatted2 = {}
for tbl in table:
numOfTables = numOfTables + 1
keys = tbl.find_all('th')
values = tbl.find_all('td')
new_data = dict(zip(keys, values))
# Method 1
tableFormatted1.append(new_data)
# Method 2
for k, v in new_data.items():
if k in tableFormatted2:
tableFormatted2[k].append(v)
else:
tableFormatted2[k] = [v]
print('List of dictionaries')
print(tableFormatted1)
print('')
print('Dictionary with list')
print(tableFormatted2)
Edited:
Each iteration of tbl is overwriting the iteration already done. So, it is necessary to change the structure. I've just provided two methods.

Categories