questions about scraping a review website

questions about scraping a review website - python

I am currently crawling a review site with beautiful soup.
the review page contains reviews from different students,
and each student would evaluate the school on several aspects.
therefore, the structure of the page generally looks like:
student A - title A:
aspect 1
comment toward aspect 1
aspect2
commet on aspect 2
aspect3
commet on aspect 3
student B - title B:
aspect 1
comment toward aspect 1
aspect2
commet on aspect 2
aspect4
commet on aspect 4
some students only made comments on particular aspects. the aspects they dont comment on will not be shown on the website.
each review in code
<!-- mod-reviewTop -->
<div class="mod-reviewTop">
<!-- mod-reviewTop-inner -->
<div class="mod-reviewTop-inner">
<dl>
<dd>
<div class="mod-reviewTitle" itemprop="summary">
title 1 : It was ok.
</div>
</dd>
</dl>
<!-- /mod-reviewItem -->
</div>
<!-- /mod-reviewTop -->
<!-- mod-reviewBottom -->
<div class="mod-reviewBottom">
<!-- mod-reviewList-list -->
<div class="mod-reviewList-list js-review-detail" itemprop="description">
<!-- js-mod-reviewList-list -->
<div class="js-mod-reviewList-list">
<ul>
<li>
<div class="mod-reviewTitle3">
Total Evaluation
</div>
<div class="mod-reviewList-txt">
We can freely choose the course we want, and thus a lot of different knowledge can be learned.
</div>
</li>
<li>
<div class="mod-reviewTitle3">
Course
</div>
<div class="mod-reviewList-txt">
the courses are good.
</div>
</li>
<li>
<div class="mod-reviewTitle3">
Lab
</div>
<div class="mod-reviewList-txt">
we don’t join lab in the first 2 year.
</div>
</li>
</ul>
</div>
<!-- /js-mod-reviewList-list -->
</div>
<!-- /mod-reviewList-list -->
</div>
<!-- /mod-reviewBottom -->
you can see that even though the title of aspects are different, they all start with 'div class="mod-reviewTitle3" ' , and the comments all start with 'div class="mod-reviewList-txt"'.
my question is that how do i write good code to store these information into a data set:
| title | aspect1 comment | aspect2 comment
A good very nice
i have tried the code below, but aspect comment in each block doesnt work well
datatest = soup.find_all("div", {"class":"mod-reviewTop"})
datatest1 = soup.find_all("div", {"class":"mod-reviewBottom"})
for item in datatest:
a = item.select('.mod-reviewTitle')
c = item.select('.mod-reviewTitle3')
d = item.select('.mod-reviewList-txt')
g = item.select('.js-mod-reviewList-list')
f= item.select('.mod-reviewItem')
for i in range(len(a)):
f1= f[i].text[7]
f2= f[i].text[17]
f3= f[i].text[26]
f4= f[i].text[37]
f5= f[i].text[46]
f6= f[i].text[55]
f7= f[i].text[63]
print a[i].text
print f1, f2, f3, f4, f5, f6, f7
for item in datatest1:
for k in range(len(g)):
print g[k].text
print e[k].text
print k
i regard this as a programming problem..
i have try loops but didnt work well
if you can kindly give me some reference or how the structure shall work logically, pls kindly leave me a comment.. thanks

Tips:
You should attach aspects and comments to corresponding titles, which means you store them together by using a proper data structure. like this(Just one possible way)
[
(title1,[
(aspect1, comment1),
(aspect2, comment2),
...
]),
(title2,[
(aspect1, comment1),
(aspect2, comment2),
...
]),
...
]
So when retrieving data you want, organize operations with nested for loops. i.e. Once you find an aspect, for example, try to get the corresponding comment and store them together. Escape finding all aspects, then all comments.
Code
Here is a demo.
blocks = soup.find_all("div", {"class":"mod-reviewTop"})
contents = soup.find_all("div", {"class":"mod-reviewBottom"})
data = []
for i,block in enumerate(blocks):
aspects = []
title = str(block.find('div',{'class':'mod-reviewTitle'}).text).strip()
for aspect_block in contents[i].find_all('li'):
aspect = str(aspect_block.find('div',{'class':'mod-reviewTitle3'}).text).strip()
comment = str(aspect_block.find('div',{'class':'mod-reviewList-txt'}).text).strip()
aspects.append((aspect,comment))
data.append((title,aspects))
print data
with open("output.txt","w") as file:
for title, aspects in data:
file.write(title)
for aspect in aspects:
file.write('|'+aspect[0]+'\t'+aspect[1])
file.write('\n')

Related

Find different elements in two different divs with the same class using Selenium

I'm using Python and Selenium to Scrape a website. Used find_by_element to find all the values that I need but I've run into something more challenging. The website html show the exactly structure to two different values and I cannot use a simple find_element_by_class because they have the same classes and ids. I don't want to use xpath or selector because I am iterating this through many "flight-row" divs and it would make thinks more hardcoded.
<div class="flight-row">
<div class="row row-eq-heights">
<div class="col-xs-4 col-md-4 no-padding"><span class="airline-name">gol</span><span class="flight-number">AM-477</span></div>
<div class="col-xs-4 col-md-4">
<div class="flight-timming"><span class="flight-time">06:15</span><span class="flight-destination">IAH</span></div><span class="flight-data">01/10/19</span></div>
<div class="col-xs-4 col-md-4 no-padding">
<div class="duration"><span class="flight-duration">21:25</span><span class="flight-stops" aria-label="Paradas do voo">2 paradas</span></div>
</div>
<div class="col-xs-4 col-md-4">
<div class="flight-timming"><span class="flight-destination">GIG</span><span class="flight-time">05:40</span></div><span class="flight-data">02/10/19</span></div>
</div>
</div>
I wanna get the values from flight-time, flight-destination and flight-data from the both "col-xs-4 col-md-4" divs.
This is a little of my code:
outbound_flights = driver.find_elements_by_css_selector("div[class^='flight-item ']")
for outbound_flight in outbound_flights:
airline = outbound_flight.find_element_by_css_selector("span[class='airline-name']")
Thank you!

Try the following css selector to get flight-time, flight-destination and flight-data
outbound_flights = driver.find_elements_by_css_selector("div.col-xs-4.col-md-4:not(.no-padding)")
for outbound_flight in outbound_flights:
flight_time = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-time").text
print(flight_time)
flight_destination = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-destination").text
print(flight_destination)
flight_data = outbound_flight.find_element_by_css_selector("span.flight-data").text
print(flight_data)
Output on console:
06:15
IAH
01/10/19
05:40
GIG
02/10/19
EDITED Answer:
outbound_flights = driver.find_elements_by_css_selector("div.col-xs-4.col-md-4:not(.no-padding)")
flighttime=[]
for outbound_flight in outbound_flights:
flight_time = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-time").text
print(flight_time)
flighttime.append(flight_time)
flight_destination = outbound_flight.find_element_by_css_selector("div.flight-timming span.flight-destination").text
print(flight_destination)
flight_data = outbound_flight.find_element_by_css_selector("span.flight-data").text
print(flight_data)
departure_time=flighttime[0]
arrival_time=flighttime[1]
print("Departure time :" + departure_time)
print("Arrival time :" + arrival_time)

You can get values by index.
(//*[#class='flight-time'])[1] and (//*[#class='flight-time'])[2]

How to add a div to a list of words using python

I'm trying to create a table in standard HTML. But i have a lot of words to be wrapped in div tags. Normally I would do this server side or something but that isn't a reality in this project. So i'm restricted to plain old web technologies. I was wondering though, if there is a python solution to my problem (i.e. creating a small script so i can run through a list of given words, output the required HTML, and then copy and paste the HTML into various html files). So if i have a list of words below, could i run a python program, and add the needed divs with the classes to each word. One problem would be the "order" needs to increment to the amount of words given.
Words
animals
cat
dog
mouse
lion
What the output should be
<div class="Rtable Rtable--1cols">
<div style="order:0" class="Rtable-cell-head">animals</div>
<div style="order:1;" class="Rtable-cell">cat</div>
<div style="order:2;" class="Rtable-cell">dog</div>
<div style="order:3;" class="Rtable-cell">mouse</div>
<div style="order:4;" class="Rtable-cell">lion</div>
</div>

A jinja solution:
from __future__ import print_function
from jinja2 import Template
template = Template("""
<div class="Rtable Rtable--1cols">
<div style="order:0" class="Rtable-cell-head">animals</div>
{%- for order, animal in animals %}
<div style="order:{{ order }};" class="Rtable-cell">{{ animal }}</div>
{%- endfor %}
</div>
""")
animals = """
cat
dog
mouse
lion
""".split()
print(template.render(animals=list(enumerate(animals, 1))))
output:
<div class="Rtable Rtable--1cols">
<div style="order:0" class="Rtable-cell-head">animals</div>
<div style="order:1;" class="Rtable-cell">cat</div>
<div style="order:2;" class="Rtable-cell">dog</div>
<div style="order:3;" class="Rtable-cell">mouse</div>
<div style="order:4;" class="Rtable-cell">lion</div>
</div>
Pure python version:
from __future__ import print_function
template = """
<div class="Rtable Rtable--1cols">
<div style="order:0" class="Rtable-cell-head">animals</div>\
{animals}
</div>
"""
animal_template = """
<div style="order:{order};" class="Rtable-cell">{animal}</div>"""
animals = """
cat
dog
mouse
lion
""".split()
animal_divs = ''.join([animal_template.format(order=i, animal=animal)
for i, animal in enumerate(animals, 1)])
print(template.format(animals=animal_divs))
the output is the same.
update: the great convenience of Python's split is that it removes all white space (including newlines), however, if you have animal names with spaces in them (e.g. "white rhino") then you'll need to take another approach where you split by lines, strip any whitespace from each line, and skip the line if it contains only whitespace:
animals = [animal.strip() for animal in """
cat
dog
mouse
lion
""".splitlines() if animal.strip()]
(this solution is similar to the node.js solution below)
However, if your users know javascript and not Python, then a node.js solution might be better:
const animals = `
cat
dog
mouse
lion
`.split('\n').map(v => v.trim()).filter(v => !!v);
const animal_template = (animal, order) => `<div style="order:${order+1};" class="Rtable-cell">${animal}</div>`;
const template = animals => `
<div class="Rtable Rtable--1cols">
<div style="order:0" class="Rtable-cell-head">animals</div>
${animals.map(animal_template).join('\n ')}
</div>`
console.log(template(animals));

you can use bs4 to create, manipulate and delete html : pip install beautifulsoup4
read the file containing the list of words (1 element per line)
with open('/path/to/list', 'r') as f:
lines = f.readlines()
create the html you need :
from bs4 import BeautifulSoup
# creating parent
soup = BeautifulSoup('<div class="Rtable Rtable--1cols"></div>')
# adding head
new_tag = soup.new_tag('div', **{'class': 'Rtable Rtable--1cols'})
new_tag.string = line[0]
soup.append(new_tag)
# adding elements
for i, line in enumerate(lines[1:]):
new_tag = soup.new_tag('div', **{'class': 'Rtable-cell', 'style': f'order:{i};'})
new_tag.string = line
soup.append(new_tag)
print(soup)

Store web scraping results in DataFrame or dictionary

I'm taking an online course, and I'm trying to automate the process capturing the course structure for my personal notes, which I keep locally in a Markdown file.
Here's an example chapter:
And here's a sample of how the HTML looks:
<!-- Header of the chapter -->
<div class="chapter__header">
<div class="chapter__title-wrapper">
<span class="chapter__number">
<span class="chapter-number">1</span>
</span>
<h4 class="chapter__title">
Introduction to Experimental Design
</h4>
<span class="chapter__price">
Free
</span>
</div>
<div class="dc-progress-bar dc-progress-bar--small chapter__progress">
<span class="dc-progress-bar__text">0%</span>
<div class="dc-progress-bar__bar chapter__progress-bar">
<span class="dc-progress-bar__fill" style="width: 0%;"></span>
</div>
</div>
</div>
<p class="chapter__description">
An introduction to key parts of experimental design plus some power and sample size calculations.
</p>
<!-- !Header of the chapter -->
<!-- Body of the chapter -->
<ul class="chapter__exercises hidden">
<li class="chapter__exercise ">
<a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
<span class="chapter__exercise-icon exercise-icon ">
<img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
</span>
<h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
<span class="chapter__exercise-xp">
50 xp
</span>
</a> </li>
So far, I've used BeautifulSoup to pull out all the relevant information:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
outline_list = []
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
outline_list.append(item.text.strip())
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
outline_list.append(lesson_name)
outline_list.append(lesson_link)
except KeyError:
pass
This gives me a list like this:
['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]
My goal is to put this all into an .md file that would look something like this:
# Introduction to Experimental Design
* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)
My question is: What's the best way to structure this data so that I can easily access it later on when I'm writing the text file? Would it be better to have a DataFrame with columns chapter, lesson, lesson_link? A DataFrame with a MultiIndex? A nested dictionary? If it were a dictionary, what should I name the keys? Or is there another option I'm missing? Some sort of database?
Any thoughts would be much appreciated!

If I see it right, you're currently appending every element in order of it's appearance to the list outline_list. But obviously you don't have 1, but instead 3 types of distinct data:
chapter__title
chapter__exercise.name
chapter__exercise.link
Each title can have multiple exercises, which are always a pair of name and link. Since you also want to keep the data in this structure for your text-file, you can come up with any structure that represents this hierarchy. An example:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict() # {chapter: [(lesson_name, lesson_link), ...], ...}
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
chapter = item.text.strip()
chapters[chapter] = []
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
chapters[chapter].append((lesson_name, lesson_link))
except KeyError:
pass
From there it should be easy to write your text file:
for chapter, lessons in chapters.items():
# write chapter title
for lesson_name, lesson_link in lessons:
# write lesson

python xpath extract text outside tag based on the span text

I want to extract the text outside the tag and match it with the text inside the span.
This is the code:
<div class="info">
<p>
<i class="icon-trending-up"></i>
<span>Rank:</span>
600
</p>
<p>
<i class="icon-play"></i>
<span>Total Videos:</span>
36
</p>
<p>
<i class="icon-bar-chart"></i>
<span>Video Views:</span>
1,815,767
</p>
<hr>
<p>
<i class="icon-user-plus"></i>
<span>Followers:</span>
732
</p>
</div>
I want to extract something like this in separate items.
item['rank'] = rank
Rank: 600
item['videos'] = videos
Total Videos: 36
item['views'] = views
Video Views: 1,815,767
I do not want the <p> tag below <hr>
This is what i have tried by now:
rank = response.xpath("//div[#class='info']//hr/preceding-sibling::p//text()='Videos:'").extract()
This is the result:
[u'0']
OR
rank = response.xpath("//div[#class='info']//hr/preceding-sibling::p/span[contains(text(), 'Videos:')]/text()|//hr/preceding-sibling::p//text()[not(parent::span)]").extract()
This is the result:
[u' 600', u'Total Videos:', u' 36', u' 1,815,767']
Basically i want to extract The number Based on the span Text, and every <p> tag separated in it's on item.
Thank you
UPDATE
I can't use anything like p[1], p[2] etc...because those <p> may swap, or it might be only 2 on other pages. The <span> text will remain the same

What about:
item["rank"] = response.xpath('//span[.="Rank:"]/following-sibling::text()[1]').extract_first()
item["videos"] = response.xpath('//span[.="Video Views:"]/following-sibling::text()[1]').extract_first()

This should work. It looks a bit clumsy because it has to deal with the nested elements.
item['rank'] = ''.join(s.strip() for s in response.xpath('//div//span[contains(., "Rank:")]/ancestor::p/text()').extract())

Using Xpath axes to extract preceding element

I'm trying to scrape data from a site with this structure below. I want to extract information in each of the <li id="entry">, but both of the entries should also extract the category information from <li id="category"> / <h2>
<ul class="html-winners">
<li id="category">
<h2>Redaktionell Print - Dagstidning</h2>
<ul>
<li id="entry">
<div class="entry-info">
<div class="block">
<img src="bilder/tumme10/4.jpg" width="110" height="147">
<span class="gold">Guld: Svenska Dagbladet</span><br>
<strong><strong>Designer:</strong></strong> Anna W Thurfjell och SvD:s medarbetare<br>
<strong><strong>Motivering:</strong></strong> "Konsekvent design som är lätt igenkänningsbar. Små förändringar förnyar ständigt och blldmotiven utnyttjas föredömligt."
</div>
</div>
</li>
<li id="entry">
<div class="entry-info">
<div class="block"><img src="bilder/tumme10/3.jpg" width="110" height="147">
<span class="silver">Silver: K2 - Kristianstadsbladet</span>
</div>
</div>
</li>
</ul>
</li>
I use a scrapy with the following code:
start_urls = [
"http://www.designpriset.se/vinnare.php?year=2010"
]
rules = (
Rule(LinkExtractor(allow = "http://www.designpriset.se/", restrict_xpaths=('//*[#class="html-winners"]')), callback='parse_item'),
)
def parse(self, response):
for sel in response.xpath('//*[#class="entry-info"]'):
item = ByrauItem()
annons_list = sel.xpath('//span[#class="gold"]/text()|//span[#class="silver"]/text()').extract()
byrau_list = sel.xpath('//div/text()').extract()
kategori_list = sel.xpath('/preceding::h2/text()').extract()
for x in range(0,len(annons_list)):
item['Annonsrubrik'] = annons_list[x]
item['Byrau'] = byrau_list[x]
item['Kategori'] = kategori_list[x]
yield item
annons_list and byrau_list works perfect, they use xpath to go down the heirarchy from the starting point //*[#class="entry-info"]. But kategori_list gives me "IndexError: list index out of range". Am I writing the xpath preceding axe the wrong way?

As mentioned by #kjhughes in comment, you need to add . just before / or // to make your XPath expression relative to current context element. Otherwise the expression will be considered relative to the root document. And that's why the expression /preceding::h2/text() returned nothing.
In the case of /, you can also remove it from the beginning of your XPath expression as alternative way to make to it relative to current context element :
kategori_list = sel.xpath('preceding::h2/text()').extract()
Just a note, preceding::h2 will return all h2 elements located before the <div class="entry-info">. According to the HTML posted, I think the following XPath expression is safer from returning unwanted h2 elements (false positive) :
query = 'parent::li/parent::ul/preceding-sibling::h2/text()'
kategori_list = sel.xpath(query).extract()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

questions about scraping a review website - python

Related

Find different elements in two different divs with the same class using Selenium

How to add a div to a list of words using python

Store web scraping results in DataFrame or dictionary

python xpath extract text outside tag based on the span text

Using Xpath axes to extract preceding element

Categories

Resources