lxml python script, how can I remove count of duplicate id - python

Ok so I'm getting stuck on how to work around this issue here.
this is just a private counter of online people for a game.
After some research, I managed to get down to this code which I added a bit on the search, to get the count of all the images with on.png ...and it does actually work!
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img[#src="pics/on.png"])'))
Now my frustration is that that "on.png" is repeated 2 times in case of the Guild Master is online.
Can anyone think of a way to get around it? this is part of the HTML
<tr><td class='tabellatitolo a_dx' style=' padding:10px;' >Master
<td class='tabelladati' style=' padding:10px;' ><img align=absmiddle src='pics/on.png'>
<a href='?f=pg&id=55110'>Modernist</a>
<tr><td class='tabellatitolo a_dx' style=' padding:10px;' >Membri<p>(5)
<td class='tabelladati' style=' padding:10px;' >**<img align=absmiddle src='pics/on.png'>
<a href='?f=pg&id=55110'>**Modernist**</a>** - <br><img align=absmiddle src='pics/off.png'>
<a href='?f=pg&id=232720'>Human Slayer</a> - <i>Ti stimo!</i><br>
<img align=absmiddle src='pics/off.png'> <a href='?f=pg&id=68194'>Juggernaut</a><br>
<img align=absmiddle src='pics/off.png'> <a href='?f=pg&id=67121'>XeDiOr ThE KoOl</a><br>
<img align=absmiddle src='pics/on.png'> <a href='?f=pg&id=142638'>Lisbet Irmgard</a><br>
I was maybe thinking to use context position or maybe leverage on that "Membri" (members)?
Thanks any hint will be appriciated :)

I'm going to give a more brutal but possibly simpler answer:
import re
import requests
def get_img_cnt(url):
response = requests.get(url)
# just take the bit after the 'Membri' section
member_content = response.content.split('>Membri<')[1]
# count the number of times you see the image
return len(re.findall('pics/on.png', member_content))
How well it will work will depend on the rest of the html (that you haven't provided). I'd go for string searching (like this) before I start doing html parsing generally. If it works, it's a simpler and faster solution.

Related

beautifulsoup Case Insensitive?

I was reading: Is it possible for BeautifulSoup to work in a case-insensitive manner?
But it's not what I actually needed, I'm looking for all img tags in webpage, which include: IMG, Img etc...
This code:
images = soup.findAll('img')
Will only look for img tags case sensitive so how can I solve this problem without adding new line for every single possibility (and maybe forget to add some)?
Please Note that the above question isn't about the tag but it's properties.
BeautifulSoup is not case sensitiv per se just give it a try. If you miss some information in your result maybe there is another issue. You could force it to parse sensitiv while using xml parser if needed in some case.
Note: In newer code avoid old syntax findAll() instead use find_all() - For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html = '''
<img src="" alt="lower">
<IMG src="" alt="upper">
<iMG src="" alt="mixed">
'''
soup = BeautifulSoup(html)
soup.find_all('img')
Output
[<img alt="lower" src=""/>,
<img alt="upper" src=""/>,
<img alt="mixed" src=""/>]

How to extract specific string on a web page using Python

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv
As you can see, this is the page source of mbasic.facebook.com.
What I'm trying to do is scrape all the anchor tags that have a pattern like this:
Example
<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">
Example with wild card.
<a class="cf" href="*">
so I decided to add a wild card identifier after href="*" since the value are dynamic.
Here's my (not working) Python Code.
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))
Note that in the page, there are several patterns like this so I need to capture all and print it.
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=lADKURnNsk4AX8WTS1F&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=96f40cb2f95acbcfe9f6e4dc6cb31161&oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=Z2daQ-qGgpsAX8BmLKr&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=22f2b487166a7cd06e4ff650af4f7a7b&oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>
My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!
Tried another set of code but no luck :)
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)
I think your wildcard match needs a dot in front like .*
I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.
You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.
import re
s = """<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">"""
patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)
Output
>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">',
'<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">']
As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Post this the below set of commands should solve your purpose
req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')
url in the above command is the link you want to scrape and headers argument takes by browser specs/version

Python - XPath issue while scraping the IMDb Website

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.
From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo

Python, BeautifulSoup finding HTML segment

I am a newbie just trying to follow the webscraping examples from automate the boring stuff webscraping example. What I'm trying is to automate downloading images from phdcomics in one python code that will
find the link of the image from HTML and download then
find the link for the previous page from HTML and go there to repeat step 1 until the very first page.
For the downloading current page image, the segment of the HTML code after printing soup.prettify() looks like this -
<meta content="Link to Piled Higher and Deeper" name="description">
<meta content="PHD Comic: Remind me" name="title">
<link
href="http://www.phdcomics.com/comics/archive/phd041218s.gif" rel="image_src">
<div class="jumbotron" style="background-color:#52697d;padding: 0em 0em 0em; margin-top:0px; margin-bottom: 0px; background-image: url('http://phdcomics.com/images/bkg_bottom_stuff3.png'); background-repeat: repeat-x;">
<div align="center" class="container-fluid" style="max-width: 1800px;padding-left: 0px; padding-right:0px;">
and then when I write
newurl=soup.find('link', {'rel': "image_src"}).get('href')
it gives me what I need, which is
"http://www.phdcomics.com/comics/archive/phd041218s.gif"
In the next step when I want to find the previous page link, which I believe is in the following part of the HTML code -
<!-- Comic Table --!>
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="right" valign="top">
<a href=http://phdcomics.com/comics/archive.php?comicid=2004><img height=52 width=49 src=http://phdcomics.com/comics/images/prev_button.gif border=0 align=middle><br></a><font
face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>previous </b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1995><img src=http://phdcomics.com/comics/images/jump_bck10.gif border=0></a><br><a href=http://phdcomics.com/comics/archive.php?comicid=2000><img src=http://phdcomics.com/comics/images/jump_bck5.gif border=0></a><br><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>jump</b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1><img src=http://phdcomics.com/comics/images/first_button.gif border=0 align=middle><br></a><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>first</b></i></font><br><br> </td>
<td align="center" valign="top"><font color="black">
From this part of the code I want to find
=http://phdcomics.com/comics/archive.php?comicid=2004
as my previous link.
when I try something like this -
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
print(Prevlink)
it gives me an error like this-
Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
Even when I try to do this-
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print(Prevlink)
I get similar error -
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'
What should be the right way to get the right 'href'?
TIA
The problem is in the way comments are added on the html of Phd comics.
If you see closely in the output of soup.prettify() you will find comments like this
<!-- Comic Table --!>
when it should be,
<!-- Comic Table -->
This causes BeautifulSoup to miss certain tags. There are many ways to parse and remove comments like using regex, Comment, but it might be difficult to get them to work in this case. The easiest way would be to fix comment tags after collecting the html.
from bs4 import BeautifulSoup
import requests
url = "https://phdcomics.com/"
r = requests.get(url)
data = r.text
data = data.replace("--!>","-->") # fix comments
soup = BeautifulSoup(data)
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print Prevlink
http://phdcomics.com/comics/archive.php?comicid=2004
Update:
To auto find the requested link, we need to find the parent element of "http://phdcomics.com/comics/images/prev_button.gif" and extract the link
img_tag = soup.find('img',{'src':'http://phdcomics.com/comics/images/prev_button.gif'})
print img_tag.find_parent().get('href')
http://phdcomics.com/comics/archive.php?comicid=2005

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

Categories