I am trying to check the dates/times availability for an exam using Python mechanize and send someone an email if a particular date/time becomes available in the result (result page screenshot attached)
import mechanize
from BeautifulSoup import BeautifulSoup
URL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
br = mechanize.Browser()
response = br.open(URL)
# there are some errors in doctype and hence filtering the page content a bit
response.set_data(response.get_data()[200:])
br.set_response(response)
br.select_form(name="entry_form")
# select Oakland for the 1st set of checkboxes
for i in range(0, len(br.find_control(type="checkbox",name="cb_examSites").items)):
if i ==2:
br.find_control(type="checkbox",name="cb_examSites").items[i].selected =True
# select salesperson for the 2nd set of checkboxes
for i in range(0, len(br.find_control(type="checkbox",name="cb_examTypes").items)):
if i ==1:
br.find_control(type="checkbox",name="cb_examTypes").items[i].selected =True
reponse = br.submit()
print reponse.read()
I am able to get the response but for some reason the data within my table is missing
here are the buttons from the initial html page
<input type="submit" value="Get Exam List" name="B1">
<input type="button" value="Clear" name="B2" onclick="clear_entries()">
<input type="hidden" name="action" value="GO">
one part of the output (submit response) where the actual data is lying
<table summary="California Exams Scheduling" class="General_list" width="100%" cellspacing="0"> <EVERTHING INBETWEEN IS MISSING HERE>
</table>
All the data within the table is missing. I have provided a screenshot of the table element from chrome browser.
Can someone please tell me what could be wrong ?
Can someone please tell me how to get the date/time out of the response (assuming I have to use BeautifulSoup) and so has to be something on these lines. I am trying to find out if a particular date I have in mind (say March 8th) in the response shows up a Begin Time of 1:30 pm..screenshot attached
soup = BeautifulSoup(response.read())
print soup.find(name="table")
update - looks like my issue might be related to this question and am trying my options . I tried the below as per one of the answers but cannot see any tr elements in the data (though can see this in the page source when I check it manually)
soup.findAll('table')[0].findAll('tr')
Update - Modfied this to use selenium, will try and do further at some point soon
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import urllib3
myURL = "http://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
browser = webdriver.Firefox() # Get local session of firefox
browser.get(myURL) # Load page
element = browser.find_element_by_id("Checkbox5")
element.click()
element = browser.find_element_by_id("Checkbox13")
element.click()
element = browser.find_element_by_name("B1")
element.click()
5 years later, maybe this can help someone. I took your problem as a training exercise. I completed it using the Requests package. (I use python 3.9)
The code below is in two parts:
the request to retrieve the data injected into the table after the POST request.
## the request part
url = "https://secure.dre.ca.gov/PublicASP/CurrentExams.asp"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
params = {
"cb_examSites": [
"'Fresno'",
"'Los+Angeles'",
"'SF/Oakland'",
"'Sacramento'",
"'San+Diego'"
],
"cb_examTypes": [
"'Broker'",
"'Salesperson'"
],
"B1": "Get+Exam+List",
"action": "GO"
}
s = rq.Session()
r = s.get(url, headers=headers)
s.headers.update({"Cookie": "%s=%s" % (r.cookies.keys()[0], r.cookies.values()[0])})
r2 = s.post(url=url, data=params)
soup = bs(r2.content, "lxml") # contain data you want
Parsing the response (a lot of ways to do it mine is maybe a bit stuffy)
table = soup.find_all("table", class_="General_list")[0]
titles = [el.text for el in table.find_all("strong")]
def beetweenBr(soupx):
final_str = []
for br in soupx.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
final_str.append(next_s.strip())
return "\n".join(final_str)
d = {}
trs = table.find_all("tr")
for tr in trs:
tr_text = tr.text
if tr_text in titles:
curr_title = tr_text
splitx = curr_title.split(" - ")
area, job = splitx[0].split(" ")[0], splitx[1].split(" ")[0]
if not job in d.keys():
d[job] = {}
if not area in d[job].keys():
d[job][area] = []
continue
if (not tr_text in titles) & (tr_text != "DateBegin TimeLocationScheduledCapacity"):
tds = tr.find_all("td")
sub = []
for itd, td in enumerate(tds):
if itd == 2:
sub.append(beetweenBr(td))
else :
sub.append(td.text)
d[job][area].append(sub)
"d" contain data u want. I didn't go as far as sending an email yet.
Related
I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()
On this page, there is a series of tables I'm trying to get specific data from an unnamed table and unnamed cells. I used Copy Selector from the inspect elements in Chrome to find the CSS selector. When I'm asking Python to print that specific CSS Selector, i'm getting 'Nonetype' object is not callable
Specifically on this page, I am trying to get the number "198" to show up from the table in #general-info, article:nth-child(4), table:nth-child(2),
The CSS Selector path is :
"html body div#program-details section#general-info article.grid-50 table tbody tr td"
and this comes up using the Copy Selector
#general-info > article:nth-child(4) > table:nth-child(2) > tbody > tr > td:nth-child(2)
Most of the code is accessing the site and bypassing the EULA. Skip to the bottom for the code i'm having problems with.
import mechanize
import requests
import urllib2
import urllib
import csv
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = br.open('https://login.ama-assn.org/account/login') #the login url
br.select_form(name = "go") #Alternatively you may use this instead of the above line if your form has name attribute available.
br["username"] = "wasabinoodlz" #the key "username" is the variable that takes the username/email value
br["password"] = "Bongshop10" #the key "password" is the variable that takes the password value
logged_in = br.submit() #submitting the login credentials
logincheck = logged_in.read() #reading the page body that is redirected after successful login
#print (logincheck) #printing the body of the redirected url after login
# EULA agreement stuff
cont = br.open('https://freida.ama-assn.org/Freida/eula.do').read()
cont1 = br.open('https://freida.ama-assn.org/Freida/eulaSubmit.do').read()
# Begin request for page data
req = br.open('https://freida.ama-assn.org/Freida/user/programDetails.do?pgmNumber=1205712369').read()
#Da Soups!
soup = BeautifulSoup(req)
#print soup.prettify() # use this to read html.prettify()
for score in soup.select('#general-info > article:nth-child(4) > table:nth-child(2) > tbody > tr > td:nth-child(2)'):
print score.string
You need to initialize BeautifulSoup using html5lib parser.
soup = BeautifulSoup(req, 'html5lib')
BeautifulSoup only implements the nth-of-type pseudo selector.
data = soup.select(
'#general-info > '
'article:nth-of-type(4) > '
'table:nth-of-type(2) > '
'tbody > '
'tr > '
'td:nth-of-type(2)'
)
In the code below, I fill a form then submit it on a website. Then I scrape the resulting data and then writes it to to a csv file (All these work very well). But there is a on that result page with its text 'Later' please how can I click this link. I am using. I have checked a similar question: this but it doesn't quite answer my question.
# import needed libraries
from mechanize import Browser
from datetime import datetime
from bs4 import BeautifulSoup
import csv
br = Browser()
# Ignore robots.txt
br.set_handle_robots(False)
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Chrome')]
# Retrieve the Google home page, saving the response
br.open('http://fahrplan.sbb.ch/bin/query.exe/en')
# Enter the text input (This section should be automated to read multiple text input as shown in the question)
br.select_form(nr=6)
br.form["REQ0JourneyStopsS0G"] = 'Eisenstadt' # Origin train station (From)
br.form["REQ0JourneyStopsZ0G"] ='sarajevo' # Destination train station (To)
br.form["REQ0JourneyTime"] = '5:30' # Search Time
br.form["date"] = '18.01.17' # Search Date
# Get the search results
br.submit()
# get the response from mechanize Browser
soup = BeautifulSoup(br.response().read(), 'html.parser', from_encoding="utf-8")
trs = soup.select('table.hfs_overview tr')
# scrape the contents of the table to csv (This is not complete as I cannot write the duration column to the csv)
with open('out.csv', 'w') as f:
for tr in trs:
locations = tr.select('td.location')
if len(locations) > 0:
location = locations[0].contents[0].strip()
prefix = tr.select('td.prefix')[0].contents[0].strip()
time = tr.select('td.time')[0].contents[0].strip()
#print tr.select('td.duration').contents[0].strip()
durations = tr.select('td.duration')
#print durations
if len(durations) == 0:
duration = ''
#print("oops! There aren't any durations.")
else:
duration = durations[0].contents[0].strip()
f.write("{},{},{}, {}\n".format(location.encode('utf-8'), prefix, time, duration))
The HTML with the Later link looks like
<a accesskey="l" class="hafas-browse-link" href="http://fahrplan.sbb.ch/bin/query.exe/en?ld=std2.a&seqnr=1&ident=kv.047469247.1487285405&REQ0HafasScrollDir=1" id="hfs_linkLater" title="Search for later connections">Later</a>
You can find the url using:
In [22]: soup.find('a', text='Later')['href']
Out[22]: u'http://fahrplan.sbb.ch/bin/query.exe/en?ld=std2.a&seqnr=1&ident=kv.047469247.1487285405&REQ0HafasScrollDir=1'
To make the browser go to that link call br.open:
In [21]: br.open(soup.find('a', text='Later')['href'])
Out[21]: <response_seek_wrapper at 0x7f346a5da320 whose wrapped object = <closeable_response at 0x7f3469bee830 whose fp = <socket._fileobject object at 0x7f34697f26d0>>>
I wish to write to a CSV file a list of all authors with their URL to a CSV file who class themselves as a specific tag on Google Scholar. For example, if we were to take 'security' I would want this output:
author url
Howon Kim https://scholar.google.pl/citations?user=YUoJP-oAAAAJ&hl=pl
Adrian Perrig https://scholar.google.pl/citations?user=n-Oret4AAAAJ&hl=pl
... ...
I have written this code which prints each author's name
# -*- coding: utf-8 -*-
import urllib.request
import csv
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
mydivs = soup.findAll("h3", { "class" : "gsc_1usr_name"})
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
for each in mydivs:
for anchor in each.find_all('a'):
print (anchor.text)
However, this only does it for the first page. Instead, I would like to go through every page. How can I do this?
I'm not writing the code for you.. but I'll give you an outline for how you can.
Look at the bottom of the page. See the next button? Search for it the containing div has an id of gsc_authors_bottom_pag which should be easy to find. I'd do this with selenium, find the next button (right) and click it. Wait for the page to load, scrape repeat. Handle edge cases (out of pages, etc).
If the after_author=* bit didn't change in the url you could just increment the url start.. but unless you want to try to crack that code (unlikely) then just click the next button.
This page use <button> instead of <a> for link to next/previous page.
Button to next page has aria-label="Następna".
There are two buttons to next page but you can use any of them.
Button has JavaScript code to redirect to new page
window.location=url_to_next_page
but it is simple text so you can use slicing to get only url
import urllib.request
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
while True:
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
# ... do something on page ...
# find buttons to next page
buttons = soup.findAll("button", {"aria-label": "Następna"})
# exit if no buttons
if not buttons:
break
on_click = buttons[0].get('onclick')
print('javascript:', on_click)
#add `domain` and remove `window.location='` and `'` at the end
url = 'http://scholar.google.pl' + on_click[17:-1]
# converting some codes to chars
url = url.encode('utf-8').decode('unicode_escape')
print('url:', url)
BTW: if you speak Polish then you can visit on Facebook: Python Poland or Python: pierwsze kroki
Since furas is already answered on how to loop through all pages, this is a complementary answer to his answer. The script below scrapes much more than your question asks and scrapes to a .csv file.
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, os, csv
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
def get_profiles_to_csv():
html = requests.get('http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# creating CSV File
with open('awesome_file.csv', mode='w') as csv_file:
# defining column names
fieldnames = ['Author', 'URL']
# defining .csv writer
# https://docs.python.org/3/library/csv.html#csv.DictWriter
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
# writing (creating) columns
writer.writeheader()
# collecting scraped data
author_data = []
# Selecting container where all data located
for result in soup.select('.gs_ai_chpr'):
name = result.select_one('.gs_ai_name a').text
link = result.select_one('.gs_ai_name a')['href']
# https://stackoverflow.com/a/6633693/15164646
# id = link
# id_identifer = 'user='
# before_keyword, keyword, after_keyword = id.partition(id_identifer)
# author_id = after_keyword
# affiliations = result.select_one('.gs_ai_aff').text
# email = result.select_one('.gs_ai_eml').text
# try:
# interests = result.select_one('.gs_ai_one_int').text
# except:
# interests = None
# "Cited by 107390" = getting text string -> splitting by a space -> ['Cited', 'by', '21180'] and taking [2] index which is the number.
# cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
# because we have a csv.DictWriter() we converting to the required format
# dict() keys should be exactly the same as fieldnames, otherwise it will throw an error
author_data.append({
'Author': name,
'URL': f'https://scholar.google.com{link}',
})
# iterating over celebrity data list() that became dict() and writing it to the .csv
for data in author_data:
writer.writerow(data)
# print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')
# output from created csv:
'''
Author,URL
Johnson Thomas,https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ
Martin Abadi,https://scholar.google.com/citations?hl=pl&user=vWTI60AAAAAJ
Adrian Perrig,https://scholar.google.com/citations?hl=pl&user=n-Oret4AAAAJ
Vern Paxson,https://scholar.google.com/citations?hl=pl&user=HvwPRJ0AAAAJ
Frans Kaashoek,https://scholar.google.com/citations?hl=pl&user=YCoLskoAAAAJ
Mihir Bellare,https://scholar.google.com/citations?hl=pl&user=2pW1g5IAAAAJ
Matei Zaharia,https://scholar.google.com/citations?hl=pl&user=I1EvjZsAAAAJ
John A. Clark,https://scholar.google.com/citations?hl=pl&user=xu3n6owAAAAJ
Helen J. Wang,https://scholar.google.com/citations?hl=pl&user=qhu-DxwAAAAJ
Zhu Han,https://scholar.google.com/citations?hl=pl&user=ty7wIXoAAAAJ
'''
Alternatively, you can do the same thing using Google Scholar Profiles API from SerpApi. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import csv, os
def get_profiles_to_csv():
with open('awesome_serpapi_file_pagination.csv', mode='w') as csv_file:
fieldnames = ['Author', 'URL']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_profiles",
"mauthors": "label:security"
}
search = GoogleSearch(params)
while True:
results = search.get_dict()
try:
for result in results['profiles']:
name = result['name']
link = result['link']
writer.writerow({'Author': name, 'URL': link})
except:
print('Done')
break
if (not 'pagination' in results) and (not 'next' in results['pagination']):
break
search.params_dict.update(dict(parse_qsl(urlsplit(results["pagination"]["next"]).query)))
get_profiles_to_csv()
# part of the output from created csv:
'''
Author,URL
Johnson Thomas,https://scholar.google.com/citations?hl=en&user=eKLr0EgAAAAJ
Martin Abadi,https://scholar.google.com/citations?hl=en&user=vWTI60AAAAAJ
Adrian Perrig,https://scholar.google.com/citations?hl=en&user=n-Oret4AAAAJ
Vern Paxson,https://scholar.google.com/citations?hl=en&user=HvwPRJ0AAAAJ
Frans Kaashoek,https://scholar.google.com/citations?hl=en&user=YCoLskoAAAAJ
Mihir Bellare,https://scholar.google.com/citations?hl=en&user=2pW1g5IAAAAJ
Matei Zaharia,https://scholar.google.com/citations?hl=en&user=I1EvjZsAAAAJ
John A. Clark,https://scholar.google.com/citations?hl=en&user=xu3n6owAAAAJ
Helen J. Wang,https://scholar.google.com/citations?hl=en&user=qhu-DxwAAAAJ
Zhu Han,https://scholar.google.com/citations?hl=en&user=ty7wIXoAAAAJ
'''
Disclaimer, I work for SerpApi.
I'm doing a web scrape of a website with 122 different pages with 10 entries per page. The code breaks on random pages, on random entries each time it is ran. I can run the code on a url one time and it works while other times it does not.
def get_soup(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return soup
def from_soup(soup, myCellsList):
cellsList = soup.find_all('li', {'class' : 'product clearfix'})
for i in range (len(cellsList)):
ottdDict = {}
ottdDict['Name'] = cellsList[i].h3.text.strip()
This is only a piece of my code, but this is where the error is occurring. The problem is that when I use this code, the h3 tag is not always appearing in each item in the cellsList. This results in a NoneType error when the last line of the code is ran. However, the h3 tag is always there in the HTML when I inspect the webpage.
cellsList vs html 1
same comparison made from subsequent soup request
What could be causing these differences and how can I avoid this problem? I was able to run the code successfully for a time, and it seems to have all of a sudden stopped working. The code is able to scrape some pages without problem but it randomly does not register the h3 tags on random entries on random pages.
There are slight discrepancies in the html for various elements as you progress through the site pages, the best way to get the name is actually to select the outer div and extract the text from the anchor.
This will get all the info from each product and put it into dicts where the keys are 'Tissue', 'Cell' etc.. and the values are the relating descriptionm:
import requests
from time import sleep
def from_soup(url):
with requests.Session() as s:
s.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
# id for next oage anchor.
id_ = "#layoutcontent_2_middlecontent_0_threecolumncontent_0_content_ctl00_rptCenterColumn_dcpCenterColumn_0_ctl00_0_productRecords_0_bottomPaging_0_liNextPage_0"
soup = BeautifulSoup(s.get(url).content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k, "", 1).split())
yield d
# get anchor for next page and loop until no longer there.
nxt = soup.select_one(id_)
# loop until mo more next page.
while nxt:
# sleep between requests
sleep(.5)
resp = s.get(nxt.a["href"])
soup = BeautifulSoup(resp.content)
for li in soup.select("ul.product-list li.product.clearfix"):
name = li.select_one("div.product-header.clearfix a").text.strip()
d = {"name": name}
for div in li.select("div.search-item"):
k = div.strong.text
d[k.rstrip(":")] = " ".join(div.text.replace(k,"",1).split())
yield d
After running:
for ind, h in enumerate(from_soup(
"https://www.lgcstandards-atcc.org/Products/Cells_and_Microorganisms/Cell_Lines/Human/Alphanumeric.aspx?geo_country=gb")):
print(ind, h)
You will see 1211 dicts with all the data.