BeautifulSoup crawler parsing - python

I'm trying to make a crawler using bs4
and this is the target page where I'm going to crawl data from
http://sports.news.naver.com/wfootball/news/index.nhn?page=1
And this is the data I want to get
<html>~~
<head>...</head>
<body>
<some several layer...>
<div class="news_list" id="_newsList">
<h3 class="blind"> ~ </h3>
<ul>
<li>
<div class="text">
<a href="~~~~">
<span>"**targetData**"</span>
</li>
<li>
same structure
</li>
<li>...</li>
several <li>...</li>
</ul>
</div>
</layers...>
</body>
</html>
And this is my code
#-*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup
targettrl = 'http://sports.news.naver.com/wfootball/news/index.nhn?page=1'
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'html.parser')
print(soup.find_all("div", {"class":"news_list"}))
And result
[]
What should I do?

The content you are after, is loaded dynamically with JavaScript, and hence, not found in the page source inside the <div> tag you were searching for. But, the data is available in the page source inside a <script> tag in the form of JSON.
You can scrape it using this:
import re
import json
import requests
r = requests.get('http://sports.news.naver.com/wfootball/news/index.nhn?page=1')
script = re.findall('newsListModel:(.*})', r.text)[0]
data = json.loads(script)
for item in data['list']:
print(item['title'])
url = 'http://sports.news.naver.com/wfootball/news/read.nhn?oid={}&aid={}'.format(item['oid'], item['aid'])
print(url)
Output:
‘로마 영웅’ 제코, “첼시 거절? 돈이 중요하지 않아”
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089601
손흥민, 맨시티전 선발 전망...'시즌 19호' 기대(英 매체)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089600
드디어 정해진 UCL 4강 진출팀, 4강 대진은 과연?
http://sports.news.naver.com/wfootball/news/read.nhn?oid=047&aid=0002185662
극적 승리 후 ‘유물급 분수대’에 뛰어든 로마 구단주 벌금형
http://sports.news.naver.com/wfootball/news/read.nhn?oid=081&aid=0002907313
'토너먼트의 남자' 호날두, 팀 득점 6할 책임지다
http://sports.news.naver.com/wfootball/news/read.nhn?oid=216&aid=0000094120
맨유 스카우트 파견...타깃은 기성용 동료 수비수 모슨
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089598
이번에는 '타점', 늘 새로운 호날두의 키워드
http://sports.news.naver.com/wfootball/news/read.nhn?oid=413&aid=0000064740
승무패 14회차, 토트넘 홈에서 맨시티 이길 것
http://sports.news.naver.com/wfootball/news/read.nhn?oid=396&aid=0000477162
3부까지 추락했던 울버햄튼, 6년 만에 EPL 승격 눈앞
http://sports.news.naver.com/wfootball/news/read.nhn?oid=413&aid=0000064739
메시 밀어낸 호날두…역대 챔스리그 한 시즌 최다골 1∼3위 독식(종합)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=001&aid=0010020684
‘에어 호날두’ 있기에···레알 8년 연속 챔스 4강
http://sports.news.naver.com/wfootball/news/read.nhn?oid=011&aid=0003268539
[UEL] '황희찬 복귀' 잘츠부르크, 역전 드라마 가능할까?
http://sports.news.naver.com/wfootball/news/read.nhn?oid=436&aid=0000028419
[UCL 포커스] 호날두 환상골도, 만주키치 저력도, 부폰에 묻혔다
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089597
[UCL 핫피플] ‘120골+11G 연속골’ 호날두, 역사는 진행형
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089596
[SPO 이슈] ‘재점화’ 케인vs살라, 득점왕 경쟁 포인트3
http://sports.news.naver.com/wfootball/news/read.nhn?oid=477&aid=0000118254
UCL 8강 키워드: 공은 둥글다…'3점 차'여도 안심할 수 없었으니까
http://sports.news.naver.com/wfootball/news/read.nhn?oid=477&aid=0000118253
"과르디올라, 시즌 종료 전 1년 계약 연장한다" (英 미러)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=529&aid=0000022390
케이토토 승무패 14회차 투표율 중간집계
http://sports.news.naver.com/wfootball/news/read.nhn?oid=382&aid=0000638196
‘월드컵 스카우팅 리포트 2018’ 발간
http://sports.news.naver.com/wfootball/news/read.nhn?oid=005&aid=0001088317
레알 마드리드, 천신만고 끝에 챔피언스리그 4강
http://sports.news.naver.com/wfootball/news/read.nhn?oid=052&aid=0001134496
Here, item is a dictionary with the following format:
{'aid': '0002185662',
'datetime': '2018.04.12 14:09',
'officeName': '오마이뉴스',
'oid': '047',
'sectionName': '챔스·유로파',
'subContent': '최고의 명경기 예약, 추첨은 4월 13일 오후 7시[오마이뉴스 이윤파 기자]이변이 가득했던 17-18 UEFA 챔피언스리그 8강이 모두 끝나고 4강 진출 팀...',
'thumbnail': 'http://imgnews.naver.net/image/thumb154/047/2018/04/12/2185662.jpg',
'title': '드디어 정해진 UCL 4강 진출팀, 4강 대진은 과연?',
'totalCount': 0,
'type': 'PHOTO',
'url': None}
So, you can access anything you want using this. Just replace ['title'] with whatever you want. Everything inside each of the <li> tags is available in the dictionaries.

This will work.
import requests
from bs4 import BeautifulSoup
targettrl = 'http://sports.news.naver.com/wfootball/news/index.nhn?page=1'
soup = BeautifulSoup(requests.get(targettrl).content, 'html.parser')
print(soup.find_all("div", class_="news_list"))

Try
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'lxml')
instead of
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'html.parser')
I tried with your code and #KeyurPotdar and worked for me.

Related

scrape a sub attribute? with bs4 in python

I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png", }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>
here is the relevant snippet of what I have so far:
from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re
url = "www.website.com/image.png"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
elsoupo = soup.find(attrs={"class": "a fancy title for this class"})
print(elsoupo)
just started working with python, so hopefully I'm wording this so it makes some sense.
Tried to narrow it down with a second attribute that could have any number but I just None back.
elsoupoNum = elsoupo.find(attrs={"id":"^[-+]?[0-9]+$"})
print(elsoupoNum)
data-preview is an attribute for li element with a (ill-formed) json string as its value. I corrected it for simplicity, you may want to check this.
code
from bs4 import BeautifulSoup
import json
str = '''
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>
'''
soup = BeautifulSoup(str, 'html.parser')
li = soup.select_one('li[data-preview]')
data = li.attrs['data-preview']
print(data)
j=json.loads(data)
print(j['id'])
output
{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }
288857982

Web scrape second number between tags

I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name

How do I get this text residing within the tags using Beautiful Soup?

I want to fetch the number 121 from the above code. But the soup object that I am getting is not showing the number.
Link to my Image
[<div class="open_pln" id="pln_1">
<ul>
<li>
<div class="box_check_txt">
<input id="cp1" name="cp1" onclick="change_plan(2,102,2);" type="checkbox"/>
<label for="cp1"><span class="green"></span></label>
</div>
</li>
<li id="li_open"><span>Desk</span> <br/></li>
<li> </li>
</ul>
</div>]
The number 121 for open offices is not inside HTML code, but in the JavaScript. You can use regex to extract it:
import re
import requests
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
htmlpage = requests.get(url).text
open_offices = re.findall(r'var openOffices\s*=\s*(\d+)', htmlpage)[0]
private_offices = re.findall(r'var privateOffices\s*=\s*(\d+)', htmlpage)[0]
print('Open offices: {}'.format(open_offices))
print('Private offices: {}'.format(private_offices))
Prints:
Open offices: 121
Private offices: 40
Without re module:
import requests
from bs4 import BeautifulSoup
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
searchstr = "var openOffices = "
script = soup.select_one(f"script:contains('{searchstr}')").text
print(script.split(searchstr)[1].split(";")[0])
Output:
121
you have to find all the li attribute using soup like this -
attribute=req["li"]
all_links = soup.find_all(attribute)
for link in all_links:
print(link.text.strip())

How to loop thru a div class to get access to the li class within?

Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class
I have tried the method stated here on this page
How to get all the li tag within div tag
and then the current logic shown below that is within my file
#PRODUCT ATTRIBUTES (STYLE, SKU, BRAND) need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(m, html)
#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)
#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)
brdmatches = re.findall(brd, html)
The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..
itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']
and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.
SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
Do not use Regex to parse HTML
There are better and safer ways to do this.
Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:
from parsel import Selector
from bs4 import BeautifulSoup
html = ('<div class="pdp-desc-attr spec-prod-attr">'
'<ul class="prod-attr-list">'
'<li class="first first-item">Brand: adidas</li>'
'<li>Country of Origin: Imported</li>'
'<li class="last last-item">Style: F18AAW400D</li>'
'</ul>'
'</div>')
# Using parsel
sel = Selector(text=html)
for li in sel.xpath('//li'):
print(li.xpath('./text()').get())
# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for li in soup.find_all('li'):
print(li.text)
Output:
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
I would use an html parser and look for the class of the ul. Using bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
for item in soup.select('.prod-attr-list:has(> li)'):
print([sub_item.text for sub_item in item.select('li')])

Categories