I am trying to extract information from a repeating set of rows containing many embedded 's. For the page, I am trying to write a scraper to get various elements from this page. For some reason, I can't find a way to get to the tag with the class that contains the information for each row. Further, I am not able to isolate the sections that I will need to extract the information. For reference, here is a sample of one row:
<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
<div class="row team-event-result team-result">
<div class="col-md-12 main-info">
<div class="row">
<div class="col-md-7 event-name">
<dl>
<dt>Team Number:</dt>
<dd>11733</dd>
<dt>Team:</dt>
<dd> Aqua Duckies</dd>
<dt>Program:</dt>
<dd>FIRST LEGO League Jr.</dd>
</dl>
</div>
The script I have started to build looks like the following:
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})
whenever I run len(rows), it always results in 0. I seem to have hit a wall and am having trouble. Thanks for your help!
The content of this page is generated dynamically so to catch that you need to use any browser simulator like selenium. Here is the script which will fetch your desired content. Give this a shot:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
print("Event_Info: {}\nEvent_Location: {}\n".format(docs,location))
driver.quit()
The results look something like:
Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA
Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA
This seems like an issue of multiple-class tags. I believe this question might help you figure out the solution.
You can search specifically for dt and dd, the tags containing the target data:
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as uReq
import re
data = str(uReq('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017').read())
s = soup(data, 'lxml')
headers = map(lambda x:x[:-1], [[b.text for b in i.find_all('dt')] for i in s.find_all('dl')][0])
data = [[re.sub('\s{2,}', '', b.text) for b in i.find_all('dd')] for i in s.find_all('dl')]
print(data)
final_data = [dict(zip(headers, i)) for i in data]
print(final_data)
When running this code on your example above, the output is:
[[u'11733', u' Aqua Duckies', u'FIRST LEGO League Jr.']]
[{u'Program': u'FIRST LEGO League Jr.', u'Team Number': u'11733', u'Team': u' Aqua Duckies'}]
Related
Pretty simple code
import requests
from bs4 import BeautifulSoup
link = 'https://www.birdsnest.com.au/brands/boho-bird/73067-amore-wrap-dress'
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
page_new = soup.find('div', class_='model-info clearfix')
results = page_new.find_all('p')
for result in results:
print(result.text)
Output
usually wears a size .
She is wearing a size in this style.
Her height is .
Show ’s body measurements
The problem is that the model's name is inside <strong> tags and a span inside the <strong> tag.
Like so.
<div class="model-info-header">
<p>
<strong><span class="model-info__name">Marnee</span></strong> usually wears a size <strong><span class="model-info__standard-size">8</span></strong>.
She is wearing a size <strong><span class="model-info__wears-size">10</span></strong> in this style.
</p>
<p class="model-info-header__height">Her height is <strong><span class="model-info__height">178 cm</span></strong>.</p>
<p>
<span class="js-model-info-more model-info__link model-info-header__more">Show <span class="model-info__name">Marnee</span>’s body measurements</span>
</p>
</div>
How to get the BOLD elements inside the <p> tags.
The model name is generated dynamically. Try this:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
link = 'https://www.birdsnest.com.au/brands/boho-bird/73067-amore-wrap-dress'
driver = webdriver.Chrome()
driver.get(link)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
page_new = soup.find('div', class_='model-info clearfix')
results = page_new.find_all('p')
for result in results:
print(result.text)
Output:
Marnee usually wears a size 8.
She is wearing a size 10 in this style.
Her height is 178 cm.
Show Marnee’s body measurements
Marnee’s body measurements are:
Bust 81 cm
Waist 64 cm
Hips 89 cm
Im scraping a page and found that with my xpath and regex methods i cant seem to get to a set of values that are within a div class
I have tried the method stated here on this page
How to get all the li tag within div tag
and then the current logic shown below that is within my file
#PRODUCT ATTRIBUTES (STYLE, SKU, BRAND) need to figure out how to loop thru a class and pull out the 2 list tags
prodattr = re.compile(r'<div class=\"pdp-desc-attr spec-prod-attr\">([^<]+)</div>', re.IGNORECASE)
prodattrmatches = re.findall(prodattr, html)
for m in prodattrmatches:
m = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(m, html)
#STYLE
sty = re.compile(r'<li class=\"last last-item\">([^<]+)</li>', re.IGNORECASE)
stymatches = re.findall(sty, html)
#BRAND
brd = re.compile(r'<li class=\"first first-item\">([^<]+)</li>', re.IGNORECASE)
brdmatches = re.findall(brd, html)
The above is the current code that is NOT working.. everything comes back empty. For the purpose of my testing im merely writing the data, if any, out to the print command so i can see it on the console..
itmDetails2 = dets['sku'] +","+ dets['description']+","+ dets['price']+","+ dets['brand']
and within the console this is what i get this, which is what i expect and the generic messages are just place holders until i get this logic figured out.
SKUE GOES HERE,adidas Women's Essentials Tricot Track Jacket,34.97, BRAND GOES HERE
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
Do not use Regex to parse HTML
There are better and safer ways to do this.
Take a look in this code using Parsel and BeautifulSoup to extract the li tags of your sample code:
from parsel import Selector
from bs4 import BeautifulSoup
html = ('<div class="pdp-desc-attr spec-prod-attr">'
'<ul class="prod-attr-list">'
'<li class="first first-item">Brand: adidas</li>'
'<li>Country of Origin: Imported</li>'
'<li class="last last-item">Style: F18AAW400D</li>'
'</ul>'
'</div>')
# Using parsel
sel = Selector(text=html)
for li in sel.xpath('//li'):
print(li.xpath('./text()').get())
# Using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for li in soup.find_all('li'):
print(li.text)
Output:
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
Brand: adidas
Country of Origin: Imported
Style: F18AAW400D
I would use an html parser and look for the class of the ul. Using bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<div class="pdp-desc-attr spec-prod-attr">
<ul class="prod-attr-list">
<li class="first first-item">Brand: adidas</li>
<li>Country of Origin: Imported</li>
<li class="last last-item">Style: F18AAW400D</li>
</ul>
</div>
'''
soup = bs(html, 'lxml')
for item in soup.select('.prod-attr-list:has(> li)'):
print([sub_item.text for sub_item in item.select('li')])
My task is to automate printing the wikipedia infobox data.As an example, I am scraping the Star Trek wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) and extract infobox section from the right hand side and print them row by row on screen using python. I specifically want the info box. So far I have done this:
from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)
This gives me everything from the info box. A snippet is shown below:
[<tr><th class="summary" colspan="2" style="text-align:center;font-
size:125%;font-weight:bold;font-style: italic; background: lavender;">
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"
I want to extract the data only and print it on screen. So What i want is:
Created by Gene Roddenberry
Original work Star Trek: The Original Series
Print publications
Book(s)
List of reference books
List of technical manuals
Novel(s) List of novels
Comics List of comics
Magazine(s)
Star Trek: The Magazine
Star Trek Magazine
And so on till the end of the infobox. So basically a way of printing every row of the infobox data so I can automate it for any wiki page? (The class of infobox table of all wiki pages is 'infobox vevent' as shown in the code)
This page should help you to parse your html as a simple string without the html tags Using BeautifulSoup Extract Text without Tags
This is a code from that page, it belongs to #0605002
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
By using beautifulsoup,you need to reformat the data as you want. use fresult = [e.text for e in result] to get each result
If you want to read a table on html you can try some code like this,though this is using pandas.
import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")
I'm trying to make a crawler using bs4
and this is the target page where I'm going to crawl data from
http://sports.news.naver.com/wfootball/news/index.nhn?page=1
And this is the data I want to get
<html>~~
<head>...</head>
<body>
<some several layer...>
<div class="news_list" id="_newsList">
<h3 class="blind"> ~ </h3>
<ul>
<li>
<div class="text">
<a href="~~~~">
<span>"**targetData**"</span>
</li>
<li>
same structure
</li>
<li>...</li>
several <li>...</li>
</ul>
</div>
</layers...>
</body>
</html>
And this is my code
#-*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup
targettrl = 'http://sports.news.naver.com/wfootball/news/index.nhn?page=1'
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'html.parser')
print(soup.find_all("div", {"class":"news_list"}))
And result
[]
What should I do?
The content you are after, is loaded dynamically with JavaScript, and hence, not found in the page source inside the <div> tag you were searching for. But, the data is available in the page source inside a <script> tag in the form of JSON.
You can scrape it using this:
import re
import json
import requests
r = requests.get('http://sports.news.naver.com/wfootball/news/index.nhn?page=1')
script = re.findall('newsListModel:(.*})', r.text)[0]
data = json.loads(script)
for item in data['list']:
print(item['title'])
url = 'http://sports.news.naver.com/wfootball/news/read.nhn?oid={}&aid={}'.format(item['oid'], item['aid'])
print(url)
Output:
‘로마 영웅’ 제코, “첼시 거절? 돈이 중요하지 않아”
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089601
손흥민, 맨시티전 선발 전망...'시즌 19호' 기대(英 매체)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089600
드디어 정해진 UCL 4강 진출팀, 4강 대진은 과연?
http://sports.news.naver.com/wfootball/news/read.nhn?oid=047&aid=0002185662
극적 승리 후 ‘유물급 분수대’에 뛰어든 로마 구단주 벌금형
http://sports.news.naver.com/wfootball/news/read.nhn?oid=081&aid=0002907313
'토너먼트의 남자' 호날두, 팀 득점 6할 책임지다
http://sports.news.naver.com/wfootball/news/read.nhn?oid=216&aid=0000094120
맨유 스카우트 파견...타깃은 기성용 동료 수비수 모슨
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089598
이번에는 '타점', 늘 새로운 호날두의 키워드
http://sports.news.naver.com/wfootball/news/read.nhn?oid=413&aid=0000064740
승무패 14회차, 토트넘 홈에서 맨시티 이길 것
http://sports.news.naver.com/wfootball/news/read.nhn?oid=396&aid=0000477162
3부까지 추락했던 울버햄튼, 6년 만에 EPL 승격 눈앞
http://sports.news.naver.com/wfootball/news/read.nhn?oid=413&aid=0000064739
메시 밀어낸 호날두…역대 챔스리그 한 시즌 최다골 1∼3위 독식(종합)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=001&aid=0010020684
‘에어 호날두’ 있기에···레알 8년 연속 챔스 4강
http://sports.news.naver.com/wfootball/news/read.nhn?oid=011&aid=0003268539
[UEL] '황희찬 복귀' 잘츠부르크, 역전 드라마 가능할까?
http://sports.news.naver.com/wfootball/news/read.nhn?oid=436&aid=0000028419
[UCL 포커스] 호날두 환상골도, 만주키치 저력도, 부폰에 묻혔다
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089597
[UCL 핫피플] ‘120골+11G 연속골’ 호날두, 역사는 진행형
http://sports.news.naver.com/wfootball/news/read.nhn?oid=139&aid=0002089596
[SPO 이슈] ‘재점화’ 케인vs살라, 득점왕 경쟁 포인트3
http://sports.news.naver.com/wfootball/news/read.nhn?oid=477&aid=0000118254
UCL 8강 키워드: 공은 둥글다…'3점 차'여도 안심할 수 없었으니까
http://sports.news.naver.com/wfootball/news/read.nhn?oid=477&aid=0000118253
"과르디올라, 시즌 종료 전 1년 계약 연장한다" (英 미러)
http://sports.news.naver.com/wfootball/news/read.nhn?oid=529&aid=0000022390
케이토토 승무패 14회차 투표율 중간집계
http://sports.news.naver.com/wfootball/news/read.nhn?oid=382&aid=0000638196
‘월드컵 스카우팅 리포트 2018’ 발간
http://sports.news.naver.com/wfootball/news/read.nhn?oid=005&aid=0001088317
레알 마드리드, 천신만고 끝에 챔피언스리그 4강
http://sports.news.naver.com/wfootball/news/read.nhn?oid=052&aid=0001134496
Here, item is a dictionary with the following format:
{'aid': '0002185662',
'datetime': '2018.04.12 14:09',
'officeName': '오마이뉴스',
'oid': '047',
'sectionName': '챔스·유로파',
'subContent': '최고의 명경기 예약, 추첨은 4월 13일 오후 7시[오마이뉴스 이윤파 기자]이변이 가득했던 17-18 UEFA 챔피언스리그 8강이 모두 끝나고 4강 진출 팀...',
'thumbnail': 'http://imgnews.naver.net/image/thumb154/047/2018/04/12/2185662.jpg',
'title': '드디어 정해진 UCL 4강 진출팀, 4강 대진은 과연?',
'totalCount': 0,
'type': 'PHOTO',
'url': None}
So, you can access anything you want using this. Just replace ['title'] with whatever you want. Everything inside each of the <li> tags is available in the dictionaries.
This will work.
import requests
from bs4 import BeautifulSoup
targettrl = 'http://sports.news.naver.com/wfootball/news/index.nhn?page=1'
soup = BeautifulSoup(requests.get(targettrl).content, 'html.parser')
print(soup.find_all("div", class_="news_list"))
Try
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'lxml')
instead of
soup = BeautifulSoup(urllib.urlopen(targettrl).read(), 'html.parser')
I tried with your code and #KeyurPotdar and worked for me.
I am using the following code to write to a csv file.
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
import re
page = urllib2.urlopen('http://finance.yahoo.com/q/ks?s=F%20Key%20Statistics').read()
f = csv.writer(open("pe_ratio.csv","wb"))
f.writerow(["Name","PE","Revenue % YOY","ROA% YOY","OCF Positive","Debt - Equity"])
soup = BeautifulSoup(page)
all_data = soup.findAll('td', "yfnc_tabledata1")
f.writerow(('Ford', all_data[2].getText()))
name_company = soup.findAll("div", {"class" : "title"})
# find all h2
#print soup.prettify
#h2 div class="title"
print name_company
I have found what I want to put in the csv file but now I need to limit it to just, "Ford Motor Co. (F). When I print name_company out I get this:
[<div class="title"><h2>Ford Motor Co. (F)</h2> <span class="rtq_exch"> <span class="rtq_dash">-</span>NYSE </span><span class="wl_sign"></span></div>]
I have tried using name_company.next and name_company.content[0]. What would work? name_company uses findall and I don't know if that makes .content and .next null. Thanks for your help in advance.
Use find() to get next <h2> tag and use string to read its text node.
name_company = soup.findAll("div", {"class" : "title"})
for name in name_company:
print name.find('h2').string
UPDATE: See comments.
for name in name_company:
ford = name.find('h2').string
f.writerow([ford, all_data[2].getText()])
It yields:
Name,PE,Revenue % YOY,ROA% YOY,OCF Positive,Debt - Equity
Ford Motor Co. (F),11.23