extract text from html with python xpath

extract text from html with python xpath - python

I want to extract the price of a player in futbin. Some part of the html is here:
<div class="pr pr_pc" id="pr_pc">PR: 10,250 - 150,000</div>
<div id="pclowest" class="hide">23500</div>
I´ve programmed this with python:
from lxml import html
import requests
page = requests.get('https://www.futbin.com/18/player/15660/Mar%C3%A7al/')
tree = html.fromstring(page.content)
player = tree.xpath('//*[#id="pclowest"]')
print 'player: ', player
I want to extract the value 23500 automatically, but i cannot. Someone can help me?
Edit:
Theres another piece of code where data maybe can be extracted:
<div class="bin_price lbin">
<span class="price_big_right">
<span id="pc-lowest-1" data-price="23,000">23,000 <img alt="c" class="coins_icon_l_bin" src="https://cdn.futbin.com/design/img/coins_bin.png">
</span>
</span>
</div>
Could be possible to extract data-price here?

Related

Search for text inside a tag using beautifulsoup and returning the text in the tag after it

I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.

You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.

Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)

How to scrape a whole website using beautifulsoup

I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's my first approach:
I need to fetch the data out of this page: http://europa.eu/youth/volunteering/evs-organisation_en
Firstly, I do a view on the page source to find HTML elements?
view-source:https://europa.eu/youth/volunteering/evs-organisation_en
Note: I need to fetch the data that comes right below this line:
EVS accredited organisations search results: 6066
I chose beautiful soup for this job - since it is very powerful:
I Use find_all:
soup.find_all('p')[0].get_text() # Searching for tags by class and id
Note: Classes and IDs are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.
See the class:
<div class="col-md-4">
<div class="vp ey_block block-is-flex">
<div class="ey_inner_block">
<h4 class="text-center">"People need people" Zaporizhya oblast civic organisation of disabled families</h4>
<p class="ey_info">
<i class="fa fa-location-arrow fa-lg"></i>
Zaporizhzhya, <strong>Ukraine</strong>
</p> <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Sending</p>
<p><strong>PIC no:</strong> 935175449</p>
<div class="empty-block">
Read more </div>
</div>
so this leads to:
# import libraries
import urllib2
from bs4 import BeautifulSoup
page = requests.get("https://europa.eu/youth/volunteering/evs-organisation_en")
soup = BeautifulSoup(page.content, 'html.parser')
soup
Now, we can use the find_all method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text
<div class="col-md-4">
so we choose:
soup.find_all(class_="col-md-4")
Now I have to combine all.
update: my approach: so far:
I have extracted data wrapped within multiple HTML tags from a webpage using BeautifulSoup4. I want to store all of the extracted data in a list. And - to be more concrete: I want each of the extracted data as separate list elements separated by a comma (i.e.CSV-formated).
To begin with the beginning:
here we have the HTML content structure:
<div class="view-content">
<div class="row is-flex"></span>
<div class="col-md-4"></span>
<div class </span>
<div class= >
<h4 Data 1 </span>
<div class= Data 2</span>
<p class=
<i class=
<strong>Data 3 </span>
</p> <p class= Data 4 </span>
<p class= Data 5 </span>
<p><strong>Data 6</span>
<div class=</span>
<a href="Data 7</span>
</div>
</div>
Code to extract:
for data in elem.find_all('span', class_=""):
This should give an output:
data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)
Output:
[' Data 1 ', ' Data 2 ', ' Data 3 ' and so forth]
question: / i need help with the extraction part...

try this
data = [ele.text for ele in soup.find_all(text = True) if ele.text.strip() != '']
print(data)

Scraping the links from a specific url

this is my first question if I have explained anything wrong please forgive me.
I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
<div id="body">
<div id="wrapper">
#multiple divs but i don't need them
<div id="live-master"> #what I need is under this div
<span id="contextual">
#multiple divs but i don't need them
<div id="live-score-master"> #what I need is under this div
<div ng-app="live-menu" id="live-score-rightcoll">
#multiple divs but i don't need them
<div id="left-score-lefttemp" style="padding-top: 35px;">
<div id="dvScores">
<table cellspacing=0 ...>
<colgroup>...</colgroup>
<tbody>
<tr class="row line-bg1"> #this changes to bg2 or bg3
<td class="row">
<span class="row">
<a href="www.example.com" target="_blank" class="td_row">
#I need to extract this link
</span>
</td>
#Multiple td's
</tr>
#multiple tr class="row line-bg1" or "row line-bg2"
.
.
.
</tbody>
</table>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex.
My python code is below also:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")

if you are trying scrape urls then you should get hrefs :
urls = soup.find_all('a', href=True)

This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.
import requests
req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
index = el[0]
team_1 = el[2].replace(' ', '-')
team_2 = el[4].replace(' ', '-')
print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))

It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.
session = requests.session()
data = session.get ("http://website.com").content #usage xample
After this you can do the parsing, additional scraping, etc.

IndexError: list index out of range while using bs4

This I the Link where I am trying to fetch data flipkart
and the part of code :
<div class="toolbar-wrap line section">
<div class="ratings-reviews-wrap">
<div itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating" class="ratings-reviews line omniture-field">
<div class="ratings">
<meta itemprop="ratingValue" content="1">
<div class="fk-stars" title="1 stars">
<span class="unfilled">★★★★★</span>
<span class="rating filled" style="width:20%">
★★★★★
</span>
</div>
<div class="count">
<span itemprop="ratingCount">2</span>
</div>
</div>
</div>
</div>
</div>
here I have to fetch 1 star from title= 1 star and 2 from <span itemprop="ratingCount">2</span>
I try the following code
x = link_soup.find_all("div",class_='fk-stars')[0].get('title')
print x, " product_star"
y = link_soup.find_all("span",itemprop="ratingCount")[0].string.strip()
print y
but It give the
IndexError: list index out of range

The content that you see in the browser is not actually present in the raw HTML that is retrieved from this URL.
When loaded with a browser, the page executes AJAX calls to load additional content, which is then dynamically inserted into the page. One of the calls gets the ratings info that you are after. Specifically this URL is the one that contains the HTML that is inserted as the "action bar".
But if you retrieve the main page using Python, e.g. with requests, urllib et. al., the dynamic content is not loaded and that is why BeautifulSoup can't find the tags.
You could analyse the main page to find the actual link, retrieve that, and then run it through BeautifulSoup. The link looks like it begins with /p/pv1/spotList1/spot1/actionBar so that, or perhaps actionBar is sufficient to locate the actual link.
Or you could use selenium to load the page and then grab and process the rendered HTML.

Using regex on python + beautiful soup

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex

You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))

Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract text from html with python xpath - python

Related

Search for text inside a tag using beautifulsoup and returning the text in the tag after it

How to scrape a whole website using beautifulsoup

Scraping the links from a specific url

IndexError: list index out of range while using bs4

Using regex on python + beautiful soup

Categories

Resources