I am not super used to Beautifulsoup yet (even though it is super useful). My question that I have is if I have a website like this
https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp
and I were to get results from passing in P2RY12 into "gene name" input box, what do I need to do?
Also, in general, if I want to get a search result from a certain website what do I need to do?
If you open Firefox/Chrome webmaster tools, you can observe where the page is making requests. So when typing P2RY12 into search box and clicking the submit button, the page is making POST request to http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action.
In general, you need to know the URL and parameters sent to the URL to get any information back.
This example grabs some information from the first page of results:
import requests
from bs4 import BeautifulSoup
url = 'http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action'
data = {
'totalCount': -1,
'searchForm.chrom': 0,
'searchForm.start': '',
'searchForm.end': '',
'searchForm.rsid': '',
'searchForm.popu': 0,
'searchForm.geneid': '',
'searchForm.genename': 'P2RY12',
'searchForm.goterm': '',
'searchForm.gokeyword': '',
'searchForm.limitFlag': 1,
'searchForm.numlimit': 1000
}
headers = {
'Referer': 'https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp',
}
soup = BeautifulSoup(requests.post(url, data=data, headers=headers).text, 'html.parser')
for td in soup.select('table.table7 tr > td:nth-child(3)'):
a = td.select_one('a')
print('SNP ID:', a.get_text(strip=True))
t1 = a.find_next_sibling('br').find_next_sibling(text=True)
print('Position:', t1.strip())
print('Location:', ', '.join( l.get_text(strip=True) for l in t1.find_next_siblings('a') ))
print('Genotype:', a.find_next_siblings('br')[2].find_next_sibling(text=True).strip())
print('-' * 80)
Prints:
SNP ID: cfa19627795
Position: Chr23:45904511
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: G
--------------------------------------------------------------------------------
SNP ID: cfa19627797
Position: Chr23:45904579
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
SNP ID: cfa19627803
Position: Chr23:45904842
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
...and so on.
Related
I am trying to mimic the following browser actions via python's requests:
Land on https://www.bundesanzeiger.de/pub/en/to_nlp_start
Click "More search options"
Click checkbox "Also find historicised data" (corresponds to POST param: isHistorical: true)
Click button "Search net short positions"
Click button "Als CSV herunterladen" to download csv file
This is the code I have to simulate this:
import requests
import re
s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)
matches = re.search(
r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
r.text
)
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)
However, even though sr gives me a status_code 200, it's really an error when I check sr.url, which shows https://www.bundesanzeiger.de/pub/en/error-404?9
Digging a bit deeper, I noticed that request_url above resolves to something like
https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form
but when I check the request url in Chrome, it's actually
https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`
The 87 here seems to change, suggesting it's some session ID, but when I'm doing this using requests it doesn't appear to resolve properly.
Any idea what I'm missing here?
You can try this script to download the CSV file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'
data = {
'fulltext': '',
'positionsinhaber': '',
'ermittent': '',
'isin': '',
'positionVon': '',
'positionBis': '',
'datumVon': '',
'datumBis': '',
'isHistorical': 'true',
'nlp-search-button': 'Search+net+short+positions'
}
headers = {
'Referer': 'https://www.bundesanzeiger.de/'
}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')
soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )
a = soup.select_one('a[title="Download as CSV"]')['href']
a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')
print( s.get(a, headers=headers).content.decode('utf-8-sig') )
Prints:
"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"
...and so on.
If you check https://www.bundesanzeiger.de/robots.txt, this website does not like to be indexed. The website could be denying access to the default user agent used by bots. This might help : Python requests vs. robots.txt
<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html
Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try
If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.
I want to scrape data from the website https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx. I have to select the State as Delhi, District as Adarsh Nagar (4) & click on Search button, and scrape all the information.
So far I tried using the given below code as
import requests
from bs4 import BeautifulSoup
Error was coming as 'HTTPS 443 SSL', which I ressolved using 'verify = False
resp = requests.get('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState']='Delhi'
dictinfo['ddldistrict']='Adarsh Nagar (4)'
dictinfo['__EVENTTARGET']='btnSearch'
dictinfo = {k:(None,str(v)) for k,v in dictinfo.items()}
r=requests.post('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False,files=dictinfo)
r
Error: Response [500]
soup2
Error:
Invalid postback or callback
argument. Event validation is enabled using <pages
enableEventValidation="true"/> in configuration or <%# Page
EnableEventValidation="true" %> in a page. For security purposes,
this feature verifies that arguments to postback or callback events
originate from the server control that originally rendered them. If
the data is valid and expected, use the
ClientScriptManager.RegisterForEventValidation method in order to
register the postback or callback data for validation.
Can someone please help me to scrape it or get it done.
(I can only use REQUEST & BEAUTIFULSOUP library, no SELENIUM, MECHANIZE,etc. libraries. )
Try the script below to get the tabular results meant to be populated choosing two dropdown items as you stated above from that webpage. Turn out that you have to make two subsequent post requests to populate the results.
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
url = 'https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0'
resp = s.get(url,verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState'] = 'DL'
res = s.post(url,data=dictinfo)
soup_obj = BeautifulSoup(res.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup_obj.select('input[name]')}
payload['ddldistrict'] = 'ADN'
r = s.post(url,data=payload)
sauce = BeautifulSoup(r.text,"lxml")
for items in sauce.select("#dgDisplay tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
Output you may see in the console like:
['Firm Name', 'City', 'Licences', 'Reg. Pharmacists / Comp. Person']
['A ONE MEDICOS', 'DELHI-251/1, GALI NO.1, KH, NO, 739/251/1, NEAR HIMACHAL BHAWAN,SARAI PIPAL THALA, VILLAGE AZAD PUR,', 'R - 2', 'virender kumar, DPH, [22295-17/10/2013]']
['AAROGYAM', 'DELHI-PVT. SHOP NO. 1, GF, 121,VILLAGE BHAROLA', 'R - 2', 'avinesh bhadoriya, DPH, [27033-]']
['ABCO INDIA', 'DELHI-SHOP NO-452/22,BHUSHAN BHAWAN RING ROAD,FLYOVER AZAD PUR', 'W - 2', 'sanjay dubey , SSC, [C-P-03/01/1997]']
['ADARSH MEDICOS', 'DELHI-NORTHERN SIDE B-107, GALI NO. 1,,MAJLIS PARK, VILLAGE BHAROLA,', 'R - 2', 'dilip kumar, BPH, [28036-11/01/2018]']
<div class="columns small-5 medium-4 cell header">Ref No.</div>
<div class="columns small-7 medium-8 cell">110B60329</div>
Website is https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results
I would like to run a loop and return '110B60329'. I have ran beautiful soup and done a find_all(div), I then define the 2 different tags as head and data based on their class. I then ran iteration through the 'head' tags hoping it would return the info in the div tag i have defined as data .
Python returns a blank (cmd prompt reprinted the filepth).
Would anyone kindly know how i might fix this. My full code is.....thanks
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html= soup(response.content,"lxml")
properties_col = html.find_all('div')
for col in properties_col:
ref = 'n/a'
des = 'n/a'
head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"})
data = col.find_all("div",{"class":"columns small-7 medium-8 cell"})
for i,elem in enumerate(head):
#for i in range(elems):
if head [i].text == "Ref No.":
ref = data[i].text
print ref
You can do this by two ways.
1) If you are sure that the website that your are scraping won't change its content you can find all divs by that class and get the content by providing an index.
2) Find all left side divs (The titles) and if one of them matches what you want get the next sibling to get the text.
Example:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
#Method 1
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell")
Reference = LeftBlockData[0].get_text().strip()
Description = LeftBlockData[2].get_text().strip()
print(Reference)
print(Description)
#Method 2
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"):
RightColumn = column.next_sibling.next_sibling.get_text().strip()
if "Ref No." in column.get_text().strip():
print (RightColumn)
if "Description" in column.get_text().strip():
print (RightColumn)
The prints will output (in order):
110B60329
STORE
110B60329
STORE
Your problem is that you are trying to match a node text that have a lot of tabs with a non-spaced string.
For example your head [i].textvariable contains
Ref No., so if you compare it with Ref No. it'll give a false result. Striping it will solve.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results")
soup = BeautifulSoup(r.text, 'lxml')
for row in soup.find_all(class_='table-row'):
print(row.get_text(strip=True, separator='|').split('|'))
out:
['Ref No.', '110B60329']
['Office', 'LOTHIAN VJB']
['Description', 'STORE']
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR']
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.']
['Tenant', 'PROPRIETOR']
['Occupier']
['Net Annual Value', '£1,750']
['Marker']
['Rateable Value', '£1,750']
['Effective Date', '01-APR-10']
['Other Appeal', 'NO']
['Reval Appeal', 'NO']
get_text() is very powerful tool, you can strip the white space and put separator in the text.
You can use this method to get clean data and filter it.
hi all I am new to python. please help me with this requirement.
http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true
In this link, I have to choose date first, then the rating company will list its publications as links. Now i wanted to search a link that contains a word of my interest say "stable". I have tried the following using python 3.4.2
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true"
r = requests.get(url)
soup = BeautifulSoup(r.content)
example_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)
This is not printing anything. Iam seeing below as result
>>>
[]
Obviously Iam not giving date as input.
1. How to input from and to dates as today's date ? (Obviously to check periodically for updates of the links containing a word of interest, which will be question for later time)
For example after giving from date: 31-12-2014 to date: 31-12-2014 as inputs
is the output I need as hyperlink.
Any suggestion will be much useful. Thanks in advance
Here is the updated code still Iam not able to get the result. >>> [] is the output
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
#Getting the current date
today = datetime.today()
#For the sake of brevity some parameters are missing on the payload
payload = {
'selArchive': 1,
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selSector': '',
'selIndustry': '',
'selCompany': ''
}
example_url = "http://www.example.com/
r = requests.post(example_url, data=payload)
rg = requests.get(example_url)
soup = BeautifulSoup(rg.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)
You should be doing a POST instead of a GET for this particular site (this link on how to form a post request with parameters).
Check this example:
from datetime import datetime
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
#Getting the current date
today = datetime.today()
#Here I'm only passing from and to dates (current date) and the industry parameter
payload = {
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selIndustry': '',
'txtPhrase': '',
'txtInclude': '',
'txtExclude': '',
'selSubServices': 'ALL',
'selServices': 'all',
'maxresults': 10,
'pageno': 1,
'srchInSrchCol': '01',
'sortOptions': 'date',
'isSrchInSrch': '01',
'txtShowQuery': '01',
'tSearch': 'Find a Rating',
'txtSearch': '',
'selArchive': 1,
'selSector': 148,
'selCompany': '',
'x': 40,
'y': 11,
}
crisil_url = "http://www.crisil.com/ratings/ratings-rationales.jsp?result=true&Sector=true"
r = requests.post(crisil_url, data=payload)
soup = BeautifulSoup(r.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(crisil_links)
result_links = [urljoin(crisil_url, tag['href']) for tag in results]
print (result_links)
You will need to check the ids of the industries you are filtering, so be sure to check them via Inspect Element, selecting a the select box of industries on the browser.
After that, you will get the response and do the parsing via BeautifulSoup, as you are doing now.
Checking periodically:
To check this periodically you should consider crontab if using Linux/Unix or a Scheduled task if using Windows.