Facing issue at the time of Web scraping - python

I am trying to extract reviews from Glass door. However I am facing issues. Please follow my codes below-
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent =BeautifulSoup(url.content,"lxml")
print(urlContent)
review = urlContent.find_all('a',class_='reviewLink')
review
title = []
for i in range(0,len(review)):
title.append(review[i].get_text())
title
rating= urlContent.find_all('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
score=[]
for i in range(0,len(rating)):
score.append(rating[i].get_text())
rev_pros=urlContent.find_all("span",{"data-test":"pros"})
pros=[]
for i in range(0,len(rev_pros)):
pros.append(rev_pros[i].get_text())
pros
rev_cons=urlContent.find_all("span",{"data-test":"cons"})
cons=[]
for i in range(0,len(rev_cons)):
cons.append(rev_cons[i].get_text())
cons
advse=urlContent.find_all("span",{"data-test":"advice-management"})
advse
advise=[]
for i in range(0,len(advse)):
advise.append(advse[i].get_text())
advise
location=urlContent.find_all('span',class_='authorLocation')
location
job_location=[]
for i in range(0,len(location)):
job_location.append(location[i].get_text())
job_location
import pandas as pd
df=pd.DataFrame()
df['Review Title']=title
df['Overall Score']=score
df['Pros']=pros
df['Cons']=cons
df['Jobs_Location']=job_location
df['Advise to Mgmt']=advise
Here I am facing two challenges-
Unable to extract anything for 'advse'(used for 'Advise to
Managemnt').
Getting error when I use 'Job Location' as a column in the data
frame.(ValueError: Length of values does not match length of index).
For this error my finding was- there were ten rows for
other columns however for 'Job Location' there are less rows as
location not disclosed in some reviews.
Can any body help me on this. Thanks in advance.

A better approach would be to find a <div> that encloses each of the reviews and then extract all the information needed from it before moving to the next. This would make it easier to deal with the case where information is missing in some reviews.
For example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent = BeautifulSoup(url.content,"lxml")
get_text = lambda x: x.get_text(strip=True) if x else ""
entries = []
for entry in urlContent.find_all('div', class_='row mt'):
review = entry.find('a', class_="reviewLink")
rating = entry.find('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
rev_pros = entry.find("span", {"data-test":"pros"})
rev_cons = entry.find("span", {"data-test":"cons"})
location = entry.find('span', class_='authorLocation')
advice = entry.find("span", {"data-test":"advice-management"})
entries.append([
get_text(review),
get_text(rating),
get_text(rev_pros),
get_text(rev_cons),
get_text(location),
get_text(advice)
])
columns = ['Review Title', 'Overall Score', 'Pros', 'Cons', 'Jobs_Location', 'Advise to Mgmt']
df = pd.DataFrame(entries, columns=columns)
print(df)
The get_text() function ensures that if nothing was returned (i.e. None) then an empty string is returned.
You will need to improve your logic for extracting the advice. The information for the whole page is held inside a <script> tag. One of them holds the JSON data. The advice information is not moved into HTML until a user clicks on it, as such it would need to be extracted from the JSON. If this approach is used, then it would also make sense to extract all of the other information also directly from the JSON.
To do this, locate all the <script> tags and determine which contains the reviews. Convert the JSON into a Python data structure (using the JSON library). Now locate the reviews, for example:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent = BeautifulSoup(url.content,"lxml")
entries = []
for script in urlContent.find_all('script'):
text = script.text
if "appCache" in text:
# extract the JSON from the script tag
data = json.loads(text[text.find('{'): text.rfind('}') + 1])
# Go through all keys in the dictionary and pick those containing reviews
for key, value in data['apolloState'].items():
if ".reviews." in key and "links" not in key:
location = value['location']
city = location['id'] if location else None
entries.append([
value['summary'],
value['ratingOverall'],
value['pros'],
value['cons'],
city,
value['advice']
])
columns = ['Review Title', 'Overall Score', 'Pros', 'Cons', 'Jobs_Location', 'Advise to Mgmt']
df = pd.DataFrame(entries, columns=columns)
print(df)
This would give you a dataframe as follows:
Review Title Overall Score Pros Cons Jobs_Location Advise to Mgmt
0 Upper management n... 3 Great benefits, lo... Career advancement... City:1146821 Listen to your emp...
1 Sales 2 Good atmosphere lo... Drive was very far... None None
2 As an organization... 2 Free water and goo... Not a lot of diver... None None
3 Great place to grow 4 If your direct man... Owners are heavily... City:1146821 None
4 Great Company 5 Great leadership, ... To grow and move u... City:1146821 None
5 Lots of opportunit... 5 This is a fast pac... There's a sense of... City:1146821 Continue listening...
6 Interesting work i... 3 Working with great... High workload and ... None None
7 Wonderful 5 This company care... The drive, but we ... City:1146577 Continue growing y...
8 Horrendous 1 The pay was fairly... Culture of abuse a... City:1146821 Upper management l...
9 Upper Leadership a... 1 Strong Company, fu... You don't have a f... City:1146577 You get rid of fol...
It would help if you added print(data) to see the whole structure of the data being returned. The only issue with this approach is a further lookup would be needed to convert the city ID into an actual location. That information is also contained in the JSON.

Related

Web scrape using Python - Execution takes too long

I am trying to webscrape the "Active Positions" table from the following website:
https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings
My code is below:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.nasdaq.com/market-activity/stocks/aapl/institutional-holdings')
soup = BeautifulSoup(html_text, 'lxml')
job1 = soup.find('div', classs_ = 'dialog-off-canvas-main-canvas')
job2 = job1.find('div', class_ = 'page with-primary-nav hide-more-videos')
job3 = job2.find('div', class_ = 'page__main')
job4 = job3.find('div', class_ = 'page__content')
job5 = job4.find('div', class_ = 'quote-subdetail__content quote-subdetail__content--new')
job6 = job5.findAll('div', class_ = 'layout layout--2-col-large')
job7 = job6.find('div', class_ = 'institutional-holdings institutional-holdings--paginated')
job8 = job7.find('div', class_ = 'institutional-holdings__section institutional-holdings__section--active-positions')
job9 = job8.find('div', class_ = 'institutional-holdings__table-container')
job10 = job9.find('table', class_ = 'institutional-holdings__table')
job11 = job10.find('tbody', class_ = 'institutional-holdings__body')
job12 = job11.findAll('tr', class_ = 'institutional-holdings__row').text
print(job12)
I have chosen to include nearly every class path to attempt to speed up the execution, as including only a couple took up to 10 minutes before i decided to interupt. However, i still get the same long execution with no output. Is there something wrong with my code? Or can I improve this by doing something I haven't thought of? Thanks.
Data is being hydrated in page via Javascript XHR calls. Here is a way of getting ActivePositions by scraping the API endpoint directly:
import requests
import pandas as pd
url = 'https://api.nasdaq.com/api/company/AAPL/institutional-holdings?limit=15&type=TOTAL&sortColumn=marketValue&sortOrder=DESC'
headers = {
'accept': 'application/json, text/plain, */*',
'origin': 'https://www.nasdaq.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data']['activePositions']['rows'])
print(df)
Result in terminal:
positions holders shares
0 Increased Positions 1,780 239,170,203
1 Decreased Positions 2,339 209,017,331
2 Held Positions 283 8,965,339,255
3 Total Institutional Shares 4,402 9,413,526,789
In case you want to scrape the big 4,402 Institutional Holders table, there are ways for that too.
EDIT: Here is how you can save the data to a json file:
df.to_json('active_positions.json')
Although it might make more sense to save it as tabular data (csv):
df.to_csv('active_positions.csv')
Pandas docs: https://pandas.pydata.org/docs/

how can I scrape data beyond the page limit on Zillow?

I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]

Extracting the required information for a Script tag of scraped webpage using BeautifulSoup

I'm a webscraping novice and I am looking for pointers of what to do next, or potentially a working solution, to scrape the following webpage: https://www.capology.com/club/leicester/salaries/2019-2020/
I would like to extract the following for each row (player) of the table:
Player Name i.e. Jamie Vardy
Weekly Gross Base Salary (in GBP) i.e. £140,000
Annual Gross Base Salary (in GBP) i.e. £7,280,000
Position i.e. F
Age i.e. 33
Country England
The following code creates the 'soup' for the JavaScript table of information I want:
import requests
from bs4 import BeautifulSoup
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
url = 'https://www.capology.com/club/leicester/salaries/2019-2020/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
script = soup.find_all('script')[11].string # 11th script tag in the webpage
I can see the 'soup' assigned to the script variable has all the information I need, however, I am struggling to extract the information that I need as a pandas DataFrame?
I would subsequently like to set up this up for pagination, to scrape each team in the the 'Big 5' European Leagues (Premier League, Serie A, La Liga, Bundeliga, and Ligue 1), for the 17-18, 18-19, 19-20, and 20-21 (current) seasons. However, that's the final stage solution and I am happy to go away and try and do that myself if that's a time consuming request.
A working solution would be fantastic but just some pointers so that I can go away and learn this stuff myself as efficiently as possible would be great.
Thanks very much!
This is a task that is best suited for a tool like selenium, as the site uses the scrip to populate the page with the table after it loads, and it is not trivial to parse the values from the script source:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import urllib.parse, collections, re
d = webdriver.Chrome('/path/to/chromedriver')
d.get((url:='https://www.capology.com/club/leicester/salaries/2019-2020/'))
league_teams = d.execute_script("""
var results = [];
for (var i of Array.from(document.querySelectorAll('li.green-subheader + li')).slice(0, 5)){
results.push({league:i.querySelector('.league-title').textContent,
teams:Array.from(i.querySelectorAll('select:nth-of-type(1).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(1),
years:Array.from(i.querySelectorAll('select:nth-of-type(2).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(2)})
}
return results;
""")
vals = collections.defaultdict(dict)
for i in league_teams:
for y, full_year in [[re.sub('\d{4}\-\d{4}', '2020-2021', i['years'][0][0]), '2020-21'], *i['years']][:4]:
for t, team in i['teams']:
d.get(urllib.parse.urljoin(url, t) + (y1:=re.findall('/\d{4}\-\d{4}/', y)[0][1:]))
hvals = [x.get_text(strip=True) for x in soup(d.page_source, 'html.parser').select('#table thead tr:nth-of-type(3) th')]
tvals = soup(d.page_source, 'html.parser').select('#table tbody tr')
full_table = [dict(zip(hvals, [j.get_text(strip=True) for j in k.select('td')])) for k in tvals]
if team not in vals[i['league']]:
vals[i['league']][team] = {full_year:None}
vals[i['league']][team][full_year] = full_table

Unable to parse html in Python

Why am i not able to parse this web page which is in html into a csv?
url='c:/x/x/x/xyz.html' #html(home page of www.cloudtango.org) data is stored inside a local drive
with open(url, 'r',encoding='utf-8') as f:
html_string = f.read()
soup= bs4.BeautifulSoup('html_string.parser')
data1= html_string.find_all('td',{'class':'company'})
full=[]
for each in data1:
comp= each.find('img')['alt']
desc= each.find_next('td').text
dd={'company':comp,'description':desc}
full.append(dd)
Error:
AttributeError: 'str' object has no attribute 'find_all'
The html_string is of type string, it doesn't have .find_all() method.
To get information from specified URL, you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.cloudtango.org/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data1 = soup.find_all("td", {"class": "company"})
full = []
for each in data1:
comp = each.find("img")["alt"]
desc = each.find_next("td").text
dd = {"company": comp, "description": desc}
full.append(dd)
print(pd.DataFrame(full))
Prints:
company description
0 BlackPoint IT Services BlackPoint’s comprehensive range of Managed IT Services is designed to help you improve IT quality, efficiency and reliability -and save you up to 50% on IT cost. Providing IT solutions for more …
1 ICC Managed Services The ICC Group is a global and independent IT solutions company, providing a comprehensive, customer focused service to the SME, enterprise and public sector markets. \r\n\r\nICC deliver a full …
2 First Focus First Focus is Australia’s best managed service provider for medium sized organisations. With tens of thousands of end users supported across hundreds of customers, First Focus has the experience …
...and so on.
EDIT: To read from local file:
import pandas as pd
from bs4 import BeautifulSoup
with open('your_file.html', 'r') as f_in
soup = BeautifulSoup(f_in.read(), "html.parser")
data1 = soup.find_all("td", {"class": "company"})
full = []
for each in data1:
comp = each.find("img")["alt"]
desc = each.find_next("td").text
dd = {"company": comp, "description": desc}
full.append(dd)
print(pd.DataFrame(full))

Unsure how to web-scrape a specific value that could be in several different places

So I've been working on a web-scraping program and have been having some difficulties with one of the last bits.
There is this website that shows records of in-game fights like so:
Example 1: https://zkillboard.com/kill/44998359/
Example 2: https://zkillboard.com/kill/44917133/
I am trying to always scrape the full information of the player who scored the killing blow. That means their name, their corporation name, and their alliance name.
The information for the above examples are:
Example 1: Name = Happosait, Corp. = Arctic Light Inc., Alliance = Arctic Light
Example 2: Name = Lord Veninal, Corp. = Sniggerdly, Alliance = Pandemic Legion
While the "Final Blow" is always listed in the top right with the name, the name does not have the corporation and alliance with it as well. The full information is always listed below in the right-hand column, "## Involved", but their location in that column depends on how much damage they did in the fight, so it is not always on top, or anywhere specific for that matter.
So while I can get their names with:
kbPilotName = soup.find_all('td', style="text-align: center;")[0].find_all('a', href=re.compile('/character/'))[0].img.get('alt')
How can I get the rest of their information?
There is a textarea element containing all the data you are looking for. It's all in one text, but it's structured. You can choose a different way to parse it, but here is an example using regex:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
pattern = re.compile(r"(?s)Name: (.*?)Security: (.*?)Corp: (.*?)Alliance: (.*?)")
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content)
data = soup.select('form.form textarea#eft')[0].text
for name, security, corp, alliance in pattern.findall(data):
print name.strip()
Prints:
Happosait (laid the final blow)
Baneken
Perkel
Tibor Vherok
Kheo Dons
Kayakka
Lina Ectelion
Jay Burner
Zalamus
Draacan Ferox
Luwanii
Jousen Momaki
Varcuntis Morannear
Grimm K-Man
Wob'Niar
Godfrey Silvarna
Quintus Corvus
Shadow Altair
Sieren
Isha Vir
Argyrosdraco
Jack None
Strixi
Alternative solution (parsing "involved" page):
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
involved_url = 'https://zkillboard.com/kill/44998359/involved/'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
session.get(url)
response = session.get(involved_url)
soup = BeautifulSoup(response.content)
for row in soup.select('table.table tr.attacker'):
name, corp, alliance = row.select('td.pilot > a')
print name.text, corp.text, alliance.text

Categories