How do I scrape multi-language web using Python - python

I'm using Python to scrape data from Japanese website where it offers both English & Japanese language. Link here
The problem is I got the data I needed but in the wrong language (Link of both languages are identical). I tried inspecting the html page and saw the element 'lang' as followed:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" class="">
Here is the code I used:
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i += 1
name = t.text_content()
print("{}".format(name))
col.append((name,[]))
At this point I got the head row of the table from the page but in Japanese version.
I'm new to Python and the scrapy. I don't know if there's any method I could use to get the data in English?
If there is any existing examples, templates or other resources I could use, that'd be better.
Thanks in advance!

I visited the website you added, so for english it adds a cookie (look at the headers for Request URL: https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0 in network tab), you will see
Set-Cookie: SFCM01LANG=en; Max-Age=63072000; Expires=Tue, 18-Oct-2022 19:14:29 GMT; Path=/
So I have basically used that,
change you code snippet to this
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url, cookies={'SFCM01LANG':'en'})
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

Related

Beautiful Soup doesn't give data for a site

I am trying to this site for information:
https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0
I tried writing code that has worked for other sites, but it just leaves me with an empty text file. Instead of filling up with data like it has for other sites. Here is my code:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)
for item in my_list:
time.sleep( 5 )
html = urlopen(item)
bsObj = BeautifulSoup(html.read(), "html.parser")
nameList = bsObj.prettify().split('.')
count = 0
for name in nameList:
print (name[2:])
outfile.write(name[2:] + ',' + item + '\n')
I am trying to split it into smaller parts and go from there. I have used this code on sites like this: https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online
for example and it worked.
Any ideas why it works for some sites and not others? thanks so much.
The website in question probably disallows webscraping, which is why you get:
HTTPError: HTTP Error 403: Forbidden
You can spoof your user agent, by pretending to be a browser agent. Here's an example of how to do it using the fantastic requests module. You'll pass a User-Agent header when making the request.
import requests
url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)
Output:
<!DOCTYPE doctype html>
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.
You can massage this code into your loop now.

HTML in browser doesn't correspond to scraped data in python

For a project I've to scrap datas from a different website, and I'm having problem with one.
When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.
Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
import requests
# specify the url
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'
# query the website and return the html to the variable 'page'
response = requests.get(quote_page)
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)
# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')
with open('allergene.txt', 'w') as f:
f.write(soup.encode('UTF-8', 'ignore'))
What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :
<p class="level1">
<img src="/static/img/state-0.png" alt="pas d'émission" class="state">
Herbacee
</p>
Do you have any idea what's wrong ?
Thanks for your help and happy new year guys :)
This page use JavaScript to render the table, the real page contains the table is:
http://www.alertepollens.org/gardens/garden/1/state/
You can find this url in Chrome Dev tools>>>Network.

how to detect the language of webpage content by using python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>
Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))
You can extract a chunk of content then use some python language detection like langdetect or guess-language.
Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.
You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Python web scraping returns error

I am currently learning Python and have tried to pick up web scraping. I have been using example code that I got from some tutorials, but I have encountered a problem with one of the sites I was looking at. The following code was supposed to return the title of the website:
import urllib
import re
urls = ["http://www.libyaherald.com"]
i=0
regex='<title>(.+?)</title>'
pattern = re.compile(regex)
while i< len(urls):
htmlfile = urllib.urlopen(urls[i])
htmltext = htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
The title for Libya Herald website returned back an error. I checked the source code for Libya Herald and the DOC TYPE is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">.
Does the doc type have something to do with me not being able to scrape from it?
As #Puciek said, with regex is gonna be very difficult to scrape html. I would recommend you start using some package, a very easo to use and install is BeautifulSoup.
Once you install it you can try this simple example:
from bs4 import BeautifulSoup
import requests
html = requests.get('http://www.libyaherald.com').text
bs = BeautifulSoup(html)
title = bs.find('title').text
print title
For serious python web scraping I strongly suggest Scrapy.
And as far as I know, when it comes to html parsing, regex is not a recommended way. Try BeautifulSoup (BS4) like Pizza guy said :)

Spynner doesn't load html from URL

I use spynner for scraping data from a site. My code is this:
import spynner
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()
This code fails to load the entire html page. This is the html that I received:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()
My question is: how do I load the complete html?
More information:
I tried with:
import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
if respond == None:
br.wait_load ()
but loading html is never complete or certain. What is the problem? I'm going crazy.
Again:
I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.
Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :
import spynner
def content_ready(browser):
if 'id="feedback-' in browser.html:
return True
br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)
with open("test.html", "w") as hf:
hf.write(br.html.encode("utf-8"))

Categories