how to detect the language of webpage content by using python

how to detect the language of webpage content by using python - python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>

Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))

You can extract a chunk of content then use some python language detection like langdetect or guess-language.

Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.

You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Related

How to find all Elements of a specific Type with the new Requests-HTML library

I wanna find all specific fields in a HTML, in Beautiful soup everything is working with this code:
soup = BeautifulSoup(html_text, 'html.parser')
urls_previous = soup.find_all('h2', {'class': 'b_algo'})
but how can I make the same search with the requests library or can requests only find a single element in a HTML document, I couldn't find how to do it in the docs or examples ?
https://html.python-requests.org/
Example:
<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>
How can I find all Elements of a specific type with the requests library ?

with requests-html
from requests_html import HTML
doc = """<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>"""
#load html from string
html = HTML(html=doc)
x = html.find('h2')
print(x)

Python HTML parsing: removing excess HTML from get request output

I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
OUTPUT:
NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov
I want to only return the MOV file. So removing:
"NAME2041=Record_continiously/2018-06-02/8/"
I'm new to HTML parsing with python so I'm a bit confused with the functionality.
Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.
I have tried:
page.replace("NAME2041=Record_continiously/2018-06-02/8/","")
in which I receive an attribute error. Is anyone aware of any method that could accomplish this?
Here is a sample of the HTML I am working with...
<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539
</body>
</html>

Use str.split with negative indexing.
Ex:
page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])
Output:
MP_2018-06-03_00-33-15_60.mov

as you asked for explanation of your code here it is:
# import statements
from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object
soup = BeautifulSoup(page.content, 'html.parser') #
page.content returns string content of response
you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser
soup is the object of BeautifulSoup
.prettify() is method used to pretty print the content
In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)

The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium

Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)

You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.

You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!

You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']

this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Python code to consolidate CSS into HTML

Looking for python code that can take an HTML page and insert any linked CSS style definitions used by that page into it - so any externally referenced css page(s) are not needed.
Needed to make single files to insert as email attachments from existing pages used on web site. Thanks for any help.

Sven's answer helped me, but it didn't work out of the box. The following did it for me:
import bs4 #BeautifulSoup 3 has been replaced
soup = bs4.BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
t = soup.new_tag('style')
c = bs4.element.NavigableString(open(s["href"]).read())
t.insert(0,c)
t['type'] = 'text/css'
s.replaceWith(t)
open("output.html", "w").write(str(soup))

You will have to code this yourself, but BeautifulSoup will help you a long way. Assuming all your files are local, you can do something like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
s.replaceWith('<style type="text/css" media="screen">' +
open(s["href"]).read()) +
'</style>')
open("output.html", "w").write(str(soup))
If the files are not local, you can use Pythons urllib or urllib2 to retrieve them.

You can use pynliner. Example from their documentation:
html = "html string"
css = "css string"
p = Pynliner()
p.from_string(html).with_cssString(css)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to detect the language of webpage content by using python - python

You can extract a chunk of content then use some python language detection like langdetect or guess-language.

Maybe you have a header like this one : <HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr"> If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.

You can use BeautifulSoup to extract the language from HTML source code. <html class="no-js" lang="cs"> Extract the lang field from source code: from bs4 import BeautifulSoup import requests html = requests.get(url).content soup = BeautifulSoup(html, 'html.parser') print(soup.html["lang"])

Related

How to find all Elements of a specific Type with the new Requests-HTML library

Python HTML parsing: removing excess HTML from get request output

Using BeautifulSoup4 with Google Translate

How to find specific video html tag using beautiful soup?

Python code to consolidate CSS into HTML

Categories

Resources