I am new to Beautiful Soup.
I need to get data from HTML file.
<div class="ques_ans_block">
<div class="question">
<p>is this correct ?</p>
<div>
<p class="answer"></p>
<div class="moreinfo" style="display: block;">
<p class="answer"> <p>
<p class="answer"></p>
</div>
</div>
condition is , there can be "moreinfo" div present or abscent.
so i need to find question and answer(including answer from "moreinfo" if present) innertext for each ques_ans_block ?
This will give output as json containing Question, answer and FaqID.
import bs4
import json
import codecs
arrayList = []
bsp = bs4.BeautifulSoup(open('input.html'))
ques_ans_block = bsp.find_all("div", {"class": "ques_ans_block"})
s = ""
count = 1
for i in ques_ans_block:
data = {}
q = i.select('.question')
for a in q:
s+=a.text+"\n"
for a in q:
a.extract()
data["Question"] = s
del i['.question']
v = ""
a = i.select('p')
for a in a:
v+=a.text+"\n"
a = i.select('li')
for a in a:
v+=a.text+"\n"
data["Answer"] = v
data["FaqId"] = count
print "\n"
arrayList.append(data)
count = count + 1
s = ""
#print arrayList
with codecs.open('output.json','wt','utf-8') as outfile:
json.dump(arrayList, outfile, indent=4)
Related
I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name
this is data structure
<div class = 'xxx' id = 'yyy'>
<div class id = 'zzz' class = 'kkk'>
<script type = 'bbb'>
// noise word
</script>
<strong class = '111'>...</strong>
<span class = '222'>...</span>
<br>
### target data
<br>
### target data2
<br>
### target data 3
▶ 장도리 | 그림마당 보기<br/>▶ 경향신문 바로가기▶ 경향신문 구독신청하기<br/><br/>©경향신문(www.khan.co.kr), 무단전재 및 재배포 금지<br/><br/>
<!-- // 본문 내용 -->
</div>
... etc
i wanna get only target data. and i use bs4 for get word data in html.
bellow is my source code to get word
soup.find_all('div',{'id':'yyy'})
but it's return to many noise data.
how can i get only target data use bs4 with selenium??
this is target url and i wanna get text in article body
and bellow is my parsing source code
def crwaling_article(self,url):
"""
Parameters
----------
url : List
Article URL to get data.
Returns : DataFrame
Article Description Data Frame
-------
None.
"""
chrome_driver = webdriver.Chrome('D:/바탕 화면/인턴/python/crawling_software/crwaler/news_crwaling/chromedriver.exe')
chrome_driver.get(url)
html = chrome_driver.page_source
soup = BeautifulSoup(html , 'html.parser')
title = soup.find('div',{'class':'article_info'}).find('h3',{'id':'articleTitle'}).get_text()
date = soup.find('div', {'class':'article_info'}).find('span',{'class':'t11'}).get_text()
article = soup.find_all('div',{'id':'articleBodyContents'})
chrome_driver.quit()
self.set_title(title)
self.set_date(date)
I think remove all sub tags will work. have a look:
data = """<div class='xxx' id = 'yyy'>
<div class id = 'zzz' class='kkk'>
<script type = 'bbb'>
// noise word
</script>
<strong class = '111'>...</strong>
<span class = '222'>...</span>
<br>
### target data
<br>
### target data2
<br>
### target data 3
▶ 장도리 | 그림마당 보기<br/>▶ 경향신문 바로가기▶ 경향신문 구독신청하기<br/><br/>©경향신문(www.khan.co.kr), 무단전재 및 재배포 금지<br/><br/>
<!-- // 본문 내용 -->
</div>
</div>
"""
from bs4 import BeautifulSoup
if __name__ == '__main__':
soup = BeautifulSoup(data, 'html.parser')
for tag in soup.find('div', {'id': 'yyy'}).find_all(): # this loop delete all sub tags.
if not tag.find_all(): # no sub tag
tag.decompose()
print(soup.text.strip())
output:
### target data
### target data2
### target data 3
©경향신문(www.khan.co.kr), 무단전재 및 재배포 금지
It's hard to come up with clean solution, because of incomplete HTML, but this can give you an idea: you can .extract() non-necessary tags and leave only desired data:
from bs4 import BeautifulSoup
txt = '''<div class = 'xxx' id = 'yyy'>
<div class id = 'zzz' class = 'kkk'>
<script type = 'bbb'>
// noise word
</script>
<strong class = '111'>...</strong>
<span class = '222'>...</span>
<br>
### target data
<br>
### target data2
<br>
### target data 3
▶ 장도리 | 그림마당 보기<br/>▶ 경향신문 바로가기▶ 경향신문 구독신청하기<br/><br/>©경향신문(www.khan.co.kr), 무단전재 및 재배포 금지<br/><br/>
<!-- // 본문 내용 -->
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
div = soup.select_one('#yyy')
# remove all non-necessary tags
div.select_one('script').extract()
div.find('strong', class_='111').extract()
div.find('span', class_='222').extract()
for t in div.a.next_siblings: # remove all tags after <a>
t.extract()
div.a.extract() # remove <a> itself
print(div.prettify())
Prints:
<div class="xxx" id="yyy">
<div class="kkk" id="zzz">
<br/>
### target data
<br/>
### target data2
<br/>
### target data 3
</div>
</div>
You can use multiple CSS Selectors to scrape the correct data:
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
URL = "https://news.naver.com/main/read.nhn?mode=LS2D&mid=shm&sid1=100&sid2=264&oid=016&aid=0001737526"
driver = webdriver.Chrome()
driver.get(URL)
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tags1, tags2 in zip(
soup.select(
"#articleBodyContents > br:nth-child(5), br:nth-child(7), br:nth-child(9)"
),
soup.select("#articleBodyContents > span:nth-child(10)"),
):
print(tags1.next)
print(tags2.next.next)
Output:
[헤럴드경제=강문규 기자] 문재인 대통령은 14일 라임·옵티머스 사건과 관련해 청와대 참... And on
I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)
You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}
I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?
use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.
You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.
I have a soup in Python like this:
<p>
<span style="text-decoration: underline; color: #3366ff;">
Title:
</span>
Info
</p>
<p>
<span style="color: #3366ff;">
<span style="text-decoration: underline;">
Title2:
</span>
</span>
Info2
</p>
I'd like to get it to look like this:
<p>
Title:
Info
</p>
<p>
Title2:
Info2
</p>
Is there a way to do this with bs4?
You'll be wanting to use beautifulsoup's unwrap() for this.
import bs4
soup1 = bs4.BeautifulSoup(htm1, 'html.parser')
for match in soup1.findAll('span'):
match.unwrap()
print soup1
You can also use replace_with to remove span tags:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for span_tag in soup.findAll('span'):
span_tag.replace_with('')
print(soup)
I wrote this function if it can help :
def deleteBalise(string):
for i in range(2):
# identifying <
rankBegin = 0
for carac in string:
if carac == '<':
break
rankBegin += 1
# identifying >
rankEnd = 0
for carac in string:
if carac == '>':
break
rankEnd += 1
stringToReplace = string[rankBegin:rankEnd+1]
string = string.replace(stringToReplace,'')
return string