I am trying to scrape a file in this site.
https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011
I am looking to download excelsheet with complete directory of towns of TRIPURA. first one in grid list.
my code is :
import requests
import selenium
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
soup
And the corresponding element to get our file is given below. how to actually download that particular excel. it will direct to another window where the purpose has to be given and email address. It would be great if you could provide solution to this.
<div class="view-content">
<div class="views-row views-row-1 views-row-odd views-row-first ogpl-grid-list">
<div class="views-field views-field-title"> <span class="field-content"><span class="title-content">Complete Town Directory by India/State/District/Sub-District Level, Census 2011 - TRIPURA</span></span> </div>
<div class="views-field views-field-field-short-name confirmation-popup-177303 download-confirmation-box file-container excel"> <div class="field-content"><a class="177303 data-extension excel" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura" target="_blank" title="excel (Open in new window)">excel</a></div> </div>
<div class="views-field views-field-dms-allowed-operations-3 visual-access"> <span class="field-content">Visual Access: NA</span> </div>
<div class="views-field views-field-field-granularity"> <span class="views-label views-label-field-granularity">Granularity: </span> <div class="field-content">Decadal</div> </div>
<div class="views-field views-field-nothing-1 download-file"> <span class="field-content"><span class="download-filesize">File Size: 44.5 KB</span></span> </div>
<div class="views-field views-field-field-file-download-count"> <span class="field-content download-counts"> Download: 529</span> </div>
<div class="views-field views-field-field-reference-url"> <span class="views-label views-label-field-reference-url">Reference URL: </span> <div class="field-content">http://www.censusindia.gov.in/2011census...</div> </div>
<div class="views-field views-field-dms-allowed-operations-1 vote_request_data_api"> <span class="field-content"><a class="api-link" href="https://data.gov.in/resources/complete-town-directory-indiastatedistrictsub-district-level-census-2011-tripura/api" title="View API">Data API</a></span> </div>
<div class="views-field views-field-field-note"> <span class="views-label views-label-field-note">Note: </span> <div class="field-content ogpl-more">NA</div> </div>
<div class="views-field views-field-dms-allowed-operations confirmationpopup-177303 data-export-cont"> <span class="views-label views-label-dms-allowed-operations">EXPORT IN: </span> <span class="field-content"><ul></ul></span> </div> </div>
When you click on the excel link it opens the following page :
https://data.gov.in/node/ID/download
It seems that the ID is the name of the first class of the link eg t.find('a')['class'][0]. Maybe there is a more concise method to get the id but it works as is using the classname
Then the page https://data.gov.in/node/ID/download redirects to the final URL (of the file).
The following is gathering all the URL in a list :
import requests
from bs4 import BeautifulSoup
URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'
src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')
node_list = [
t.find('a')['class'][0]
for t in soup.findAll("div", { "class" : "excel" })
]
url_list = []
for url in node_list:
node = requests.get("https://data.gov.in/node/{0}/download".format(url))
soup = BeautifulSoup(node.content, 'html.parser')
content = soup.find_all("meta")[1]["content"].split("=")[1]
url_list.append(content)
print(url_list)
Complete code that downloads the files using default filename (using this post) :
import requests
from bs4 import BeautifulSoup
import urllib2
import shutil
import urlparse
import os
def download(url, fileName=None):
def getFileName(url,openUrl):
if 'Content-Disposition' in openUrl.info():
# If the response has Content-Disposition, try to get filename from it
cd = dict(map(
lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
openUrl.info()['Content-Disposition'].split(';')))
if 'filename' in cd:
filename = cd['filename'].strip("\"'")
if filename: return filename
# if no filename was found above, parse it out of the final URL.
return os.path.basename(urlparse.urlsplit(openUrl.url)[2])
r = urllib2.urlopen(urllib2.Request(url))
try:
fileName = fileName or getFileName(url,r)
with open(fileName, 'wb') as f:
shutil.copyfileobj(r,f)
finally:
r.close()
URL = 'https://data.gov.in/catalog/complete-towns-directory-indiastatedistrictsub-district-level-census-2011'
src = requests.get(URL)
soup = BeautifulSoup(src.content, 'html.parser')
node_list = [
t.find('a')['class'][0]
for t in soup.findAll("div", { "class" : "excel" })
]
url_list = []
for url in node_list:
node = requests.get("https://data.gov.in/node/{0}/download".format(url))
soup = BeautifulSoup(node.content, 'html.parser')
content = soup.find_all("meta")[1]["content"].split("=")[1]
url_list.append(content)
print("download : " + content)
download(content)
Related
I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)
I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name
hi i would like to get some info which is on below < del> and < ins> tags but i could not find any solution for it can is anyone has idea about this scraping and is there any for getting those informations
this is my python code
import requests
import json
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
base_url = "https://www.n11.com/super-firsatlar"
r = requests.get(base_url,headers=header)
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
books = soup.find_all('li',attrs={"class":"column"})
result=[]
for book in books:
title=book.find('h3').text
link=base_url +book.find('a')['href']
picture = base_url + book.find('img')['src']
price=book.find('p', {'class': 'del'})
single ={'title':title,'link':link,'picture':picture,'price':price}
result.append(single)
with open('book.json','w', encoding='utf-8') as f:
json.dump(result,f,indent=4,ensure_ascii=False)
else:
print(r.status_code)
and this my html page
<li class="column">
<script type="text/javascript">
var customTextOptionMap = {};
</script>
<div id="p-457010862" class="columnContent ">
<div class="pro">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="plink" data-id="457010862">
<img data-original="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
width="215" height="215"
src="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
alt="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="lazy" style="">
<h3 class="productName ">
Oppo A73 128 GB (Oppo Türkiye Garantili)</h3>
<span class="loading"></span>
</a>
</div>
<div class="proDetail">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="oldPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<del>2.999, 00 TL</del>
</a> <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="newPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<ins>2.899, 00<span content="TRY">TL</span></ins>
</a>
<div class="discount discountS">
<div>
<span class="percent">%</span>
<span class="ratio">3</span>
</div>
</div>
<span class="textImg freeShipping"></span>
<p class="catalogView-hover-separate"></p>
<div class="moreOpt">
<a title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="textImg moreOptBtn"
href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"></a>
</div>
</div>
</div>
</li>
Unless I am not understanding your question, it should be as simple as doing:
del_data = soup.find_all("del")
ins_data = soup.find_all("ins")
is this not what you're trying to achieve? If not please clarify your question
del and ins are not class names but tags. You can simply find them with Soup.find_all('del')
price = book.find_all('del')
for p in price:
print(p.text)
gives
2.999,00 TL
189,90 TL
8.308,44 TL
499,90 TL
6.999,00 TL
99,00 TL
18,00 TL
499,00 TL
169,99 TL
1.499,90 TL
3.010,00 TL
2.099,90 TL
......
which is what you want I guess. You have to access the text attribute here. So, the element is located. The way you want to serialize it is a different question.
I want to fetch the number 121 from the above code. But the soup object that I am getting is not showing the number.
Link to my Image
[<div class="open_pln" id="pln_1">
<ul>
<li>
<div class="box_check_txt">
<input id="cp1" name="cp1" onclick="change_plan(2,102,2);" type="checkbox"/>
<label for="cp1"><span class="green"></span></label>
</div>
</li>
<li id="li_open"><span>Desk</span> <br/></li>
<li> </li>
</ul>
</div>]
The number 121 for open offices is not inside HTML code, but in the JavaScript. You can use regex to extract it:
import re
import requests
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
htmlpage = requests.get(url).text
open_offices = re.findall(r'var openOffices\s*=\s*(\d+)', htmlpage)[0]
private_offices = re.findall(r'var privateOffices\s*=\s*(\d+)', htmlpage)[0]
print('Open offices: {}'.format(open_offices))
print('Private offices: {}'.format(private_offices))
Prints:
Open offices: 121
Private offices: 40
Without re module:
import requests
from bs4 import BeautifulSoup
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
searchstr = "var openOffices = "
script = soup.select_one(f"script:contains('{searchstr}')").text
print(script.split(searchstr)[1].split(";")[0])
Output:
121
you have to find all the li attribute using soup like this -
attribute=req["li"]
all_links = soup.find_all(attribute)
for link in all_links:
print(link.text.strip())
How can I get the hrefs from the Remixed From part on the left hand side for the following page
webpage = 'http://www.thingiverse.com/thing:1492411'
I tried something like:
response = requests.get('http://www.thingiverse.com/thing:1492411')
soup = BeautifulSoup(response.text, "lxml")
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.find("href")
print(adress)
Of course didn't work
EDIT: HTML tag looks like this:
<a class="thing-stamp" href="/thing:1430125">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/57/ea/1b/cc/c2/1105df2596f30c4df6dcf12ca5800547_thumb_small.JPG"/> <div>Anet A8 button guide by Simhopp</div>
</div>
</a>
<a class="thing-stamp" href="/thing:1430727">
<div class="stamp-content">
<img alt="" class="render" src="https://cdn.thingiverse.com/renders/93/a9/c8/45/da/05fe99b5215b04ec33d22ea70266ac72_thumb_small.JPG"/> <div>Frame brace for Anet A8 by Simhopp</div>
</div>
</a>
try:
remm = soup.find_all("a", class_="thing-stamp")
for rems in remm:
adress = rems.get('href')
print(adress)