I am attempting to get data from ITC TradeMap (I have selected the page at random so don't give that too much thought) using Requests and then clean (have not done this yet) and export it using Pandas, however I am facing difficulties getting the full datasets.
import pandas as pd
import requests as rq
#Pandas Settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
#Request Settings
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c643%7c%7c%7c%7c36%7c%7c%7c2%7c1%7c1%7c2%7c2%7c1%7c2%7c1%7c1%7c1'
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
#Output Settings
output_file = 'ITC_Test.xlsx'
#Work
request = rq.post(url, verify=False, headers= headers, data=payload)
table = pd.read_html(request.content)
table[8].to_excel(output_file)
print(table[8])
So far I am in the testing stage of this and solving issues as they arise (e.g. If requested without verify = False, it throws a severside ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997) error), but that's besides the point.
The real problem I am facing is that while most data needed for successful queries is contained in the url itself and I will simply for loop through it when time comes, the view settings are not, and without them I am limited to retrieving only 25 rows and 5 columns of data (top 25 trade partners over the last 5 years).
Those settings are located in dropdown windows which seem to be fed into a aspnetForm, I have tried to use data parameter of post to feed it with those values:
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
However the output does not seem to be affected and it still only returns 25 rows and 5 columns of data instead of 300 rows and 10 columns I would expect it to.
Thanks!
I have seen some questions here which seemed similar and tries to implement those ideas, however most likely due to the fact that I haven't worked with those libraries and my knowledge of Python in general is rather basic I was unable to resolve the issues so any help would be much appreciated.
Thanks!
I found 3 problems:
it has to use Session() to send cookies
it has to send all values in payload - so first I GET page to get all values from <input>, <select>
it sends new values in different variables.
You used variables which described current state but it send in variables with DropDownList in name
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
EDIT:
I saved payload generated by code and payload from browser and I used program Meld to compare files to see differences.
I had to correct code which gets values from <select> because it needs to search <option> with selected
For some addresses it needed to manually some values because normally JavaScript was adding these values
And it needed to skip DropDownList_Product_Group
Full working code:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as BS
# hide SSL warnings
from requests.packages.urllib3.exceptions import InsecureRequestWarning
rq.packages.urllib3.disable_warnings(InsecureRequestWarning)
# request settings
#url = 'https://www.trademap.org/Country_SelProduct_TS.aspx?nvpm=1%7c%7c%7c%7c%7c88%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c1%7c1'
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c616%7c%7c%7c%7cTOTAL%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c%7c1'
print('url:', url.replace('%7c', '|'))
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
s = rq.Session()
# --- GET ---
print('sending GET ...')
response = s.get(url, verify=False, headers=headers)
soup = BS(response.content, 'html.parser')
form = soup.find('form') #.find('div', {'id': 'div_nav_combo'})
payload = {}
#print('--- inputs ---')
inputs = form.find_all('input')
for item in inputs:# + select:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name: #and name != 'pg_goal' and 'button' not in name.lower():
payload[name] = value
#print(name, '=', value)
#print('--- selects ---')
selects = form.find_all('select')
for item in selects:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name:
value = item.find('option', {'selected': True}) or ""
if value:
value = value['value']
payload[name] = value
#print(name, '=', value)
#print('--- textareas ---')
#textareas = form.find_all('textarea')
#for item in textareas:
# name = item.get('name', '')
# print('name:', name)
# value = item.get('value', '')
# print('value:', value)
# --- POST ---
print('sending POST ...')
#payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'
#payload['ctl00$MenuControl$DDL_Language'] = 'en'
# added by JavaScript
payload['ctl00$NavigationControl$DropDownList_Country_Group'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner_Group'] = '-2'
# has to remove it for `PageSize` (at least for some addresses)
del payload['ctl00$NavigationControl$DropDownList_Product_Group']
response = s.post(url, verify=False, headers=headers, data=payload)
#print(response.content[:1000])
# --- rest ---
# pandas settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
# output settings
output_file = 'ITC_Test.xlsx'
all_tables = pd.read_html(response.content)
table = all_tables[8]
table.to_excel(output_file)
print('len(table):', len(table))
#print(table)
Result: table with ~230 rows and ~20 columns
Related
New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres  Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()
I am a beginner in Python and I am trying to access the following data using python.
1) https://www.nseindia.com/corporates/corporateHome.html, click on 'Corporate Announcements' under 'Corporate Information' on the left pane.
2) Entering the company symbol (KSCL for example) and selecting the announcement period
3) Click on any individual row subject to get additional detail
The first two steps translates to the below url 'https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp?period=More%20than%203%20Months&symbol=kscl&industry=&subject='. This is working fine in my python requests code.
However I am not able to replicate the third step, the request is successful but I am not getting the data. Following is the code that I am using, I am stuck please help.
I compared all the request headers that are going when I tried this from browser to what I am sending with python and they match. I tried sending cookie too but that didn't work. I think cookie might not be required as the website works in browser after disabling cookies too. I am running this on Python 3.5.
import requests as rq
from requests.utils import requote_uri
from requests_html import HTMLSession
import demjson as dj
from urllib.parse import quote
class BuyBack:
def start(self):
# Define headers used across all requests
self.req_headers = {'user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',}
self.req_headers['Accept'] = '*/*'
self.req_headers['Accept-Encoding'] = 'gzip, deflate, br'
self.getAllSymbols()
def readAnnouncement(self, pyAnnouncement):
# This is done using request_html
symbol = pyAnnouncement['sym']
desc = pyAnnouncement['desc']
tstamp = pyAnnouncement['date']
seqId = pyAnnouncement['seqId']
payload = {'symbol' : symbol,'desc' : desc, 'tstamp' : tstamp, 'seqId' : seqId}
quote_payload = {}
params_string = '?'
#formats as required with '%20' for spaces
for(k,v) in payload.items():
quote_payload [quote(k)] = quote(v)
params_string += quote(k)
params_string += '='
params_string += quote(v)
params_string += '&'
params_string = params_string[:-1]
announDetail_Url = 'https://nseindia.com/corporates/corpInfo/equities/AnnouncementDetail.jsp'
self.req_headers['Referer'] = 'https://www.nseindia.com/corporates/corpInfo/equities/Announcements.html'
self.req_headers['X-Requested-With'] = 'XMLHttpRequest'
self.req_headers['Host'] = 'www.nseindia.com'
annReqUrl = announDetail_Url + params_string
session = HTMLSession()
r = session.get(annReqUrl, headers = self.req_headers)
print(r.url)
#I am not getting the proper data in the response
print(r.content)
print(r.request.headers)
def getAllSymbols(self):
# To get the list of symbols to run the rest of the process, for now just run with one
symbol = 'KSCL'
self.getAnnouncements(symbol)
def getAnnouncements(self,symbol):
# To get a list of all announcements so far in the last few months
# This is done by using requests and demjson because the request returns a js object
# Open request to get everything
payload = {'symbol' : symbol,'Industry' : '', 'ExDt' : '', 'subject' : ''}
corporateActions_url='https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp'
r = rq.get(corporateActions_url, headers = self.req_headers, params=payload)
for line in r.iter_lines():
lineAscii = line.decode("ascii")
if len(lineAscii) > 5:
pyAnnouncements = dj.decode(lineAscii)
#Tried setting the cookie but no use
#cookie = r.headers['Set-Cookie']
#self.req_headers['Cookie'] = cookie
# read from the announcements
if pyAnnouncements['success']:
#for x in pyAnnouncements['rows']:
for i in range(0,1):
self.readAnnouncement(pyAnnouncements['rows'][i])
BuyBack_inst = BuyBack()
BuyBack_inst.start()
When I try this flow from browser, the second call response will have a href link to another pdf. But I am not getting that href link in my python response.
The following works for me to get all PDF hrefs given a symbol and announcement period:
import demjson
import requests
from bs4 import BeautifulSoup
symbol = 'KSCL'
s = requests.Session()
r = s.get("https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp"
f"?period=Last%201%20Month&symbol={symbol}&industry=&subject=")
for ann in demjson.decode(r.text.strip())['rows']:
url = (
"https://www.nseindia.com/corporates/corpInfo/equities/AnnouncementDetail.jsp?"
f"symbol={ann['sym']}"
f"&desc={ann['desc']}"
f"&tstamp={int(ann['date']) // 100}"
f"&seqId={ann['seqId']}"
)
soup = BeautifulSoup(s.get(url).content, 'html.parser')
print(soup.select_one('.t1 a[href$=".pdf"]')['href'])
Result:
/corporate/KSCL_20122018134432_Outcome_046.pdf
/corporate/KSCL_20122018133033_Outcome_043.pdf
I was hoping someone could possibly help me with urllib posting. My goal for this program is to post an IP address and to obtain its relative location. I know there are many APIs and such out there, but my school isn't too keen on having any of their computers modified in any way (this is for a comp sci class). So as of right now the code below gets me my location as my computer IP is already sensed by the website (I'm guessing in a header?), but what I'd like to do is just input an IP and have a returned location. ipStr is just the IP string (and in this case it's Time Warner Cable's IP in NYC). I tried setting values and submitting the data but no matter what I set the values to, it just returns my own computers location. Any ideas?
ipStr = "72.229.28.185"
url = "https://www.iplocation.net/"
values = {'value': ipStr}
headers = {}
headers ['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url,data=data, headers = headers)
resp = urllib.request.urlopen(req)
page = str(resp.read())
npattern = "Google Map for"
nfound = re.search(npattern,page)
ns = nfound.start()
ne = nfound.end()
location = ""
while page[ne:ne +1] != "(":
location += page[ne:ne+1]
ne += 1
You just need to change the parameter name from value to query, for example:
values = {'query': ipStr}
If you look at the name of the input field on the page (https://www.iplocation.net/) you'll see the field's name is query.
I am trying to programmatically send a list of genes to the well-known website DAVID (http://david.abcc.ncifcrf.gov/summary.jsp) for functional annotation. Although there are other two ways - the API service (http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html) and the web service (http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html), the former has stricter query limitations and the latter doesn't accept my ID type (http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885), so the only choice seems to be a program to post the form, parse the resulting page and extract the download link. Using the firefox plugin 'httpFox' to monitor the transmission, I gave a try with the following script:
import urllib
import urllib2
import requests as rq
import time
_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'
def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
domain_hash = '260267544' # according to what's been sent by firefox
random_uid = '1113731634' # according to what's been sent by firefox
global _t0
init_time = _t0
global _t
prev_time = _t
_t = int(time.time())
curr_time = _t
global _n
_n += 1
session_count = _n
campaign_count = 1
utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id))
return(cookie)
# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers)
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)
# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
gene.append(line.rstrip('\n'))
fh.close()
# then post the form
headers = { # all below is according to what's been sent by firefox
'Host' : 'david.abcc.ncifcrf.gov',
'User-Agent' : user_agent,
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate',
'Referer' : url,
'Cookie': cookie,
'Connection' : 'keep-alive',
# 'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
# 'Content-Length' : '3581'
}
data = { # all below is according to what's been sent by firefox
'idType' : 'OFFICIAL_GENE_SYMBOL',
'uploadType' : 'list',
'multiList' : 'false',
'Mode' : 'paste',
'useIndex' : 'null',
'usePopIndex' : 'null',
'demoIndex' : 'null',
'ids' : '\n'.join(gene),
'removeIndex' : 'null',
'renameIndex' : 'null',
'renamePopIndex' : 'null',
'newName' : 'null',
'combineIndex' : 'null',
'selectedSpecies' : 'null',
'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
'uploadHTML' : 'null',
'managerHTML' : 'null',
'sublist' : '',
'rowids' : '',
'convertedListName' : 'null',
'convertedPopName' : 'null',
'pasteBox' : '\n'.join(gene),
'fileBrowser' : '',
'Identifier' : 'OFFICIAL_GENE_SYMBOL',
'rbUploadType' : 'list'}
r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
fh = open("python.html", 'w')
fh.write(r.text)
fh.close()
However, the page got by my code is 272KB, definitely different from the content returned by httpFox, which is 428KB. I compared the header and the form sent by my script and by firefox, the difference seems only to be in
the cookie fields __utma and __utmz, but they are related to google analytics, it sounds they shouldn't really matter, and
the fields 'Content-Type' and 'Content-Length' in the 2nd header where I commented. Due to the suggestion in Is Python requests doing something wrong here, or is my POST request lacking something?, it appears unnecessary to specify them manually. However even after I commented them, it doesn't work.
Above is the basic situation, and I appreciate if someone can help figure out specifically where the problem is. Besides, I've seen some other advice, e.g. trying the browser emulator 'mechanize'. But I am more curious about the reason, i.e. is it something wrong with my program and if so how to correct it, or are these modules simply not sufficient for the task? Thanks a lot.
My list to post is:
Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras
My browser post procedure is:
in firefox open http://david.abcc.ncifcrf.gov/summary.jsp
in the left panel in default, input the above gene list in the box "Step 1: Enter Gene List A: Paste a list"
click the drop-down button and select "OFFICIAL_GENE_SYMBOL" in "Step 2: Select Identifier"
check the radio button "Gene List" in "Step 3: List Type"
click "Submit List" in "Step 4: Submit List"
Then the browser returns a new page with a pop-up window prompting users to select the species and background, which is the content tracked by httpFox in this post, also is what I am trying to capture by my script.
Use Selenium:
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('http://david.abcc.ncifcrf.gov/summary.jsp')
sleep(0.1)
query = """Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras"""
listBox = driver.find_element_by_id("LISTBox")
listBox.send_keys(query)
IDT = driver.find_element_by_id("IDT")
IDT.send_keys("O")
radioCheck = driver.find_element_by_name("rbUploadType")
radioCheck.click()
submitButton = driver.find_element_by_name("B52")
submitButton.click()
sleep(0.1)
alert = driver.switch_to_alert()
alert.accept()
sleep(0.1)
html = driver.page_source
The variable "html" contains the page source.