My beautiful soup code does not work as expected

My beautiful soup code does not work as expected - python

It's my project and it does not work
from bs4 import BeautifulSoup
import requests
import lxml
import html
html_txt = requests.get("http://transfer.ttc.com.ge/?page=live&setLng=ka")
soup = BeautifulSoup(html_text, "lxml")
job = soup.find("tr", class_= "text-left right txt-td")
print(job)

First of all you can't make a soup with a Requests object, you should add .text:
html_text = requests.get("http://transfer.ttc.com.ge/?page=live&setLng=ka").text
You first call the variable html_txt and then html_text, there is something wrong...
You should try this:
from bs4 import BeautifulSoup
import requests
import lxml
import html
html_text = requests.get("http://transfer.ttc.com.ge/?page=live&setLng=ka").text
soup = BeautifulSoup(html_text, "lxml")
job = soup.find("tr", class_= "text-left right txt-td")
print(job)

Related

How do I exclude certain beautifulsoup results that I don't want?

I am having issues trying to exclude results given from my beautiful soup program this is my code:
from bs4 import BeautifulSoup
import requests
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
I don't want to get the results that start with a "#" for example: #cite_ref-18
I have tried using for loops but I get this error message: KeyError: 0

You can use the str.startswith() method:
from bs4 import BeautifulSoup
import requests
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for tag in soup.find_all('a'):
link = tag.get('href')
if not str(link).startswith('#'):
print(link)

You can use CSS selector a[href]:not([href^="#"]). This will select all <a> tags with href= attribute but not the ones starting with # character:
import requests
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.select('a[href]:not([href^="#"])'):
print(link['href'])

Unable to retrieve crawling information

As you can see from the result screen in the picture, the class name is correct and there seems to be no mistake. But I'm not getting any results.
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('span .realtime_item'):
print(anchor)

enter image description here
Translate. The site can no longer be crawled this way.

it worked for me:
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select('.realtime_item'):
print(anchor)
print("\n\n")

You are not getting any data because SPAN doesn't have anything like realtime_item. Try to print soup and find if the value is there or not and then do select
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://www.naver.com")
soup = BeautifulSoup(response, 'html.parser')
print(soup)

locating child element by BeautifulSoup

I am new to BeautifulSoup and I am praticing with little tasks. Here I try to get the "previous" link in this site. The html is
here
My code is
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find('div', id="comic")
url2 = result.find('ul', class_='comicNav').find('a', rel='prev').find('href')
But it shows NoneType.. I have read some posts about the child elements in html, and I tried some different things. But it still does not work.. Thank you for your help in advance.

Tou could use a CSS Selector instead.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.select('.comicNav a[rel~="prev"]')[0]
print(result)
if you want just the href change
result = soup.select('.comicNav a[rel~="prev"]')[0]["href"]

To get prev link.find ul tag and then find a tag. Try below code.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
url2 = soup.find('ul', class_='comicNav').find('a',rel='prev')['href']
print(url2)
Output:
/2254/

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]

As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

Extracting table info using BeautifulSoup (bs4)

Could anyone please give me a snippet of BeautifulSoup code to extract some of the items in the table found here?
Here's my attempt:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
tables = soup.findAll("table")
However, this is failing -- tables turns out to be empty.
Sorry, I'm a BeautifulSoup noob.
Thanks!

The given url page does not contain any table element in the source.
table is generated by javascript inside an iframe.
import urllib
from bs4 import BeautifulSoup
url = 'http://biology.burke.washington.edu/conus/recordview/description.php?ID=1l9l0l421l55llll&tabs=21100111&frms=1&pglimit=A&offset=&res=&srt=&sql2='
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')
#print(tables)
selenium solution:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://biology.burke.washington.edu/conus/accounts/../recordview/record.php?ID=1ll&tabs=21100111&frms=1&res=&pglimit=A"
driver = webdriver.Firefox()
driver.get(url)
driver.switch_to_frame(driver.find_elements_by_tag_name('iframe')[0])
soup = BeautifulSoup(driver.page_source)
tables = soup.find_all('table')
#print(tables)
driver.quit()

this is my current workflow:
from bs4 import beautifulsoup
from urllib2 import urlopen
url = "http://somewebpage.com"
html = urlopen(url).read()
soup = BeautifulSoup(html)
tables = soup.find_all('table')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

My beautiful soup code does not work as expected - python

It's my project and it does not work from bs4 import BeautifulSoup import requests import lxml import html html_txt = requests.get("http://transfer.ttc.com.ge/?page=live&setLng=ka") soup = BeautifulSoup(html_text, "lxml") job = soup.find("tr", class_= "text-left right txt-td") print(job)

Related

How do I exclude certain beautifulsoup results that I don't want?

Unable to retrieve crawling information

locating child element by BeautifulSoup

how to reach dipper divs inside <span> tag using python crawler?

Extracting table info using BeautifulSoup (bs4)

Categories

Resources