Need code to get specific text from url

Need code to get specific text from url - python

I am trying to figure out what i need to add to this code so after the url source is read I can eliminate everything but text that is in between tags and then have have it print results
import urllib.request
req = urllib.request.Request('http://myurlhere.com')
response = urllib.request.urlopen(req)
the_page = response.read()
print (the_page)

You would need an HTML Parser.
Example using BeautifulSoup (it supports Python-3.x):
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request('http://onlinepermits.co.escambia.fl.us/CitizenAccess/Cap/CapDetail.aspx?Module=Building&capID1=14ACC&capID2=00000&capID3=00386&agencyCode=ESCAMBIA')
response = urllib.request.urlopen(req)
soup = BeautifulSoup(response)
print(soup.find('td', id='ctl00_PlaceHolderMain_PermitDetailList1_owner').div.table.text)
Prints:
SNB HOTEL INC2607 WILDE LAKE BLVD PENSACOLA FL 32526

Related

BS4 for HTML with special chars

I need some help, for a project I should have to parse information from a real estate website.
Somehow I am able to parse almost, everything, but it has a oneliner, which I've never seen before.
The code itself is too large, but some example snippet:
<div class="d-none" data-listing='{"strippedPhotos":[{"caption":"","description":"","urls":{"1920x1080":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_hd.jpg","800x600":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_l.jpg","228x171":"https:\/\/ot.ingatlancdn.com\/d6\/07\/32844921_216401477_m.jpg","80x60":"https:\/\/ot.ingatlancdn.com\/d6\/07
Can you please help me to identify this, and maybe a solution to how to parse all the contained info into a pandas DF?
Edit, code added:
other = []
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = soup.find_all('div', id="listing", class_="d-none", attrs="data-listing")
data

You could access the value of the attribute and convert the string via json.loads():
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
Then simply create your DataFrame via pandas.json_normalize():
pd.json_normalize(data['strippedPhotos'])
Example
Cause expected result is not clear, this just should point in a direction:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import json
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
### all data
pd.json_normalize(data)
### only strippedPhotos
pd.json_normalize(data['strippedPhotos'])

Scrape a page after translating it using bs4

I am trying to scrape a page which is in france by converting it into english.
Here is my code using beautiful soup and requests packages in python.
import requests
from bs4 import BeautifulSoup
url = '<url>'
headers = {"Accept-Language": "en,en-gb;q=0.5"}
r = requests.get(url, headers=headers)
c = r.content
soup = BeautifulSoup(c)
but this is still giving the text in french.
can anyone suggest changes/alternative code.

You can utilize TextBlob to convert strings to various languages, an example of converting the spans from the french ebay site :
import requests
from bs4 import BeautifulSoup
from textblob import TextBlob
url = 'https://www.ebay.fr/'
french = []
english = []
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c)
for li in soup.find_all('span'):
french.append(li.text)
Frenchstr = ''.join(french)
blob = TextBlob(Frenchstr)
print(Frenchstr)
Englishstr = blob.translate(to="EN")
print('------------------------------------------------')
print(Englishstr)

Is it possible to pull only part of the HTML page when attempting to pull HTML Data?

I have pulled the HTML code from a website. But I am not sure if I have pulled it all for some reason. Can anyone help?!
import urllib.request
import re
#This requests the website URL
url = ('https://www.myvue.com/whats-on')
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
def main():
html_page = urllib.request.urlopen(req).read()
content = html_page.decode(errors='ignore', encoding='utf-8')
#data = re.findall('<span rv-text="item.title">(.*?)</span>', content)
#print(data)
print(html_page)
main()

how to reach dipper divs inside <span> tag using python crawler?

the body tag has a <span> tag. There are many other divs inside the span tag. I want to go dipper but when I trying this code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.body.span
print (result)
the result was just this:
<span id="react-root"></span>
How can I reach to divs inside the span tag?
Can we parse the <span> tag? Is it possible? If yes so why I'm not able to parse the span?
By using this:
result = soup.body.span.contents
The output was:
[]

As talked in comments, urlopen(url) returns a file like object, which means that you need to read from it if you want to get what's inside it.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.instagram.com/artfido/'
data = urlopen(url)
soup = BeautifulSoup(data.read(), 'html.parser')
result = soup.body.span
print (result)
The code I used for my python 2.7 setup:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.instagram.com/artfido/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data.read(), 'lxml')
result = soup.body.span
print result
EDIT
for future reference, if you want something more simple for handling the url, there is a package called requests . In this case, it is similar but I find it easier to understand.
from bs4 import BeautifulSoup
import requests
url = 'https://www.instagram.com/artfido/'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
result = soup.body.span
print result

How to get all application link in every page?

I have this code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.padtube.com/Audio-Files-Player/30-01-1-2.html"
pageurl = urllib.urlopen(url)
soup = BeautifulSoup(pageurl)
for b in soup.select("table#dl-tbl-list th a[href]"):
print b['href']
When I run this code, it only give me the link only on the first page.
I can't get the application link on next page.

the site is using post to go to next page, so what you need is sending the page number via post.
i did this via http://www.python-requests.org/
import urllib
from bs4 import BeautifulSoup
import requests
url = "http://www.padtube.com/Audio-Files-Player/30-01-1-2.html"
#pageurl = urllib.urlopen(url)
pageurl = requests.post(url, data = {
'page': 2
})
pageurl = pageurl.text
soup = BeautifulSoup(pageurl)
for b in soup.select("table#dl-tbl-list th a[href]"):
print b['href']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need code to get specific text from url - python

Related

BS4 for HTML with special chars

Scrape a page after translating it using bs4

Is it possible to pull only part of the HTML page when attempting to pull HTML Data?

how to reach dipper divs inside <span> tag using python crawler?

How to get all application link in every page?

Categories

Resources