wrong python html parsing

wrong python html parsing - python

My code:
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
print(kardiz)
my output :
[<div class="views-row views-row-1 views-row-odd views-row-first">
<span class="views-field views-field-title"> <span class="field-content">Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri</span> </span>
<span class="views-field views-field-created"> <span class="field-content"><i class="fa fa-calendar"></i> Salı, Aralık 5, 2017 - 09:58 </span> </span> </div>]
But I want to get just " Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri ". How can I achieve that?

You can call .text on a result from BeautifulSoup. It takes the textual content of the elements found, skipping the tags of the elements.
e.g.
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
for result in icerik:
print(result.text)

You can try like this as well to get the title and link from that page. I used css selector to get them:
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("#content .field-content a"):
link = urljoin(url,item['href'])
print("Title: {}\nLink: {}\n".format(item.text,link))
Partial output:
Title: 2017-2018 Güz Dönemi Final Sınav Programı (TASLAK)
Link: http://yaz.tek.firat.edu.tr/tr/node/481
Title: NETAŞ İşyeri Eğitimi Mülakatları Hakkında Duyuru
Link: http://yaz.tek.firat.edu.tr/tr/node/480

Related

Trying to get content of span in Python using BeautifulSoup

from bs4 import BeautifulSoup
url = 'C:\\Users\\Zandrio\\Documents\\Python-Selexion\\HTML-localhost\\Selexion.html'
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
prettify = soup.prettify
Model = "".join([div.text for div in soup.find_all('div', {'class' : 'title-options'})])
print(Model)
Output:
PS C:\Users\Zandrio> & C:/Users/Zandrio/AppData/Local/Programs/Python/Python38/python.exe c:/Users/Zandrio/Documents/Requests/selexion.py
SQQE55Q90R
Merk:
Samsung Afdrukken
HTML:
<div class="title-options">
<span>
SQQE55Q90R
</span>
<span>
Merk: Samsung
</span>
<span class="print"> Afdrukken
</span>
</div>
I just want the Model number in this case, that is SQQE55Q90R here. Please suggest any solution.

from bs4 import BeautifulSoup
url = 'C:\\Users\\Zandrio\\Documents\\Python-Selexion\\HTML-localhost\\Selexion.html'
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
div = soup.body.find('div', attrs={'class': 'title-options'})
model_number = div.span.text.strip() # text of first span
print(model_number)

Can't access a tweet id with beautiful soup

My goal is to retrieve the ids of tweets in a twitter search as they are being posted. My code so far looks like this:
import requests
from bs4 import BeautifulSoup
keys = some_key_words + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query).text
soup = BeautifulSoup(req, "lxml")
for tweets in soup.findAll("li",{"class":"js-stream-item stream-item stream-item"}):
print(tweets)
However, this doesn't return anything. Is there a problem with the code itself or am I looking at the wrong place of the source code? I understand that the ids should be stored here:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item" **data-item-id**="1210306781806833664" id="stream-item-tweet-1210306781806833664" data-item-type="tweet">

from bs4 import BeautifulSoup
data = """
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" **data-item-id**="1210306781806833664"
id="stream-item-tweet-1210306781806833664"
data-item-type="tweet"
>
...
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(item.get("**data-item-id**"))
Output:
1210306781806833664

how to extract an attribute value of div using BeautifulSoup

I have a div whose id is "img-cont"
<div class="img-cont-box" id="img-cont" style='background-image: url("http://example.com/example.jpg");'>
I want to extract the url in background-image using beautiful soup.How can I do it?

You can you find_all or find for the first match.
import re
soup = BeautifulSoup(html_str)
result = soup.find('div',attrs={'id':'img-cont','style':True})
if result is not None:
url = re.findall('\("(http.*)"\)',result['style']) # return a list.

Try this:
import re
from bs4 import BeautifulSoup
html = '''\
<div class="img-cont-box" \
id="img-cont" \
style='background-image: url("http://example.com/example.jpg");'>\
'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='img-cont')
print(re.search(r'url\("(.+)"\)', div['style']).group(1))

Iterate through the resultset bs4

I have used bs4 to extract this resultset in bs4.
<div>
<div>
</div>
Content 1
</div>
<div>
Content 2
</div>
I am trying to extract these 2 elements.
Moi not cute not hot, the ugly bui bui type 1 and Actually, moi also dun know
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
Here is my code. However, how do i iterate through the result set so that it only extracts the content way before the closing div.
letters.find_all('div') returns an empty set.

All the messages:
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
for a in letters:
print [b.strip() for b in a.text.strip().split('\n') if b.strip()]

Using BeautifulSoup to extract <span> WITH tags

How can I properly extract the value of a <span> WITH the <br/> tags?
i.e.
from bs4 import BeautifulSoup
html_text = '<span id="spamANDeggs">This is<br/>what<br/>I want. WITH the <br/> tags.</span>'
soup = BeautifulSoup(html_text)
text_wanted = soup.find('span',{'id':'spamANDeggs'}).GetText(including<br/>...)

You can use decode_contents() method just like this:
from bs4 import BeautifulSoup
html_text = '<span id="spamANDeggs">This is<br/>what<br/>I want. WITH the <br/> tags.</span>'
soup = BeautifulSoup(html_text)
text_wanted = soup.find('span', {'id': 'spamANDeggs'}).decode_contents(formatter="html")
Now text_wanted equals "This is<br/>what<br/>I want. WITH the <br/> tags."

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

wrong python html parsing - python

Related

Trying to get content of span in Python using BeautifulSoup

Can't access a tweet id with beautiful soup

how to extract an attribute value of div using BeautifulSoup

Iterate through the resultset bs4

Using BeautifulSoup to extract <span> WITH tags

Categories

Resources