I am trying to this site for information:
https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0
I tried writing code that has worked for other sites, but it just leaves me with an empty text file. Instead of filling up with data like it has for other sites. Here is my code:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)
for item in my_list:
time.sleep( 5 )
html = urlopen(item)
bsObj = BeautifulSoup(html.read(), "html.parser")
nameList = bsObj.prettify().split('.')
count = 0
for name in nameList:
print (name[2:])
outfile.write(name[2:] + ',' + item + '\n')
I am trying to split it into smaller parts and go from there. I have used this code on sites like this: https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online
for example and it worked.
Any ideas why it works for some sites and not others? thanks so much.
The website in question probably disallows webscraping, which is why you get:
HTTPError: HTTP Error 403: Forbidden
You can spoof your user agent, by pretending to be a browser agent. Here's an example of how to do it using the fantastic requests module. You'll pass a User-Agent header when making the request.
import requests
url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)
Output:
<!DOCTYPE doctype html>
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.
You can massage this code into your loop now.
Related
I'm trying to get BeautifulSoup to read this page but the URL is not passed correctly into the get() command.
The URL is https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10. But when I try to use BeautifulSoup to get the data from the URL it always gives an error saying that the URL is incorrect
response = requests.get(url = "https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10",
verify = False \
)
print(response.request.url, end="\r")
It was the double quotes, “ (U+201C) and ” (U+201D), that caused the error. I've been trying for hours but still don't know to figure out a way to pass the URL correctly.
I changed the double quotes to single around the URL
from bs4 import BeautifulSoup
import requests
url = 'https://www.econjobrumors.com/topic/supreme-court-to-%e2%80%9cconsider%e2%80%9d-taking-up-harvard-affirmative-action-case-on-june-10'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
print(soup)
prints out the html as expected, I edited it to fit in this answer
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<ALL THE CONTENT>Too much to paste in the answer</ALL THE CONTENT>
</html>
I'm using Python to scrape data from Japanese website where it offers both English & Japanese language. Link here
The problem is I got the data I needed but in the wrong language (Link of both languages are identical). I tried inspecting the html page and saw the element 'lang' as followed:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" class="">
Here is the code I used:
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i += 1
name = t.text_content()
print("{}".format(name))
col.append((name,[]))
At this point I got the head row of the table from the page but in Japanese version.
I'm new to Python and the scrapy. I don't know if there's any method I could use to get the data in English?
If there is any existing examples, templates or other resources I could use, that'd be better.
Thanks in advance!
I visited the website you added, so for english it adds a cookie (look at the headers for Request URL: https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0 in network tab), you will see
Set-Cookie: SFCM01LANG=en; Max-Age=63072000; Expires=Tue, 18-Oct-2022 19:14:29 GMT; Path=/
So I have basically used that,
change you code snippet to this
import requests
import lxml.html as lh
import pandas as pd
url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url, cookies={'SFCM01LANG':'en'})
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
I am an absolute newbie in the field of web scraping and right now I want to extract visible text from a web page. I found a piece of code online :
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
To the above code, I get the following result :
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
Anyway I could get a more concrete result and what wrong is happening with the code. Sorry for being clueless.
Try passing the html document and not url to prettify to:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))
soup = BeautifulSoup(web_page, "lxml")
you should pass a file-like object to BeautifulSoup,not url.
url is handled by urllib2.urlopen(url) and stored in web_page
This question already has answers here:
Does python urllib2 automatically uncompress gzip data fetched from webpage?
(4 answers)
Closed 7 years ago.
When I crawl the webpage using urllib2, I can't get the page source but a garbled string which I can't understand what it's. And my code as follow:
url = 'http://finance.sina.com.cn/china/20150905/065523161502.shtml'
conn = urllib2.urlopen(url)
content = conn.read()
print content
Can anyone help me find out what's wrong? Thank you so much.
Update: I think you can run the code above to get what I get. and follows is what I get in python:
{G?0????l???%ߐ?C0 ?K?z?%E
|?B ??|?F?oeB?'??M6?
y???~???;j????H????L?mv:??:]0Z?Wt6+Y+LV? VisV:캆P?Y?,
O?m?p[8??m/???Y]????f.|x~Fa]S?op1M?H?imm5??g?????k?K#?|??? ???????p:O
??(? P?FThq1??N4??P???X??lD???F???6??z?0[?}??z??|??+?pR"s?Lq??&g#?v[((J~??w1#-?G?8???'?V+ks0?????%???5)
And this is what I expected (using curl):
<html>
<head>
<link rel="mask-icon" sizes="any" href="http://www.sina.com.cn/favicon.svg" color="red">
<meta charset="gbk"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
Here is a possible way to get the source information using requests and BeautifulSoup
import requests
from bs4 import BeautifulSoup
#Url to request
url = "http://finance.sina.com.cn/china/20150905/065523161502.shtml"
r = requests.get(url)
#Use BeautifulSoup to organise the 'requested' content
soup=BeautifulSoup(r.content,"lxml")
print soup
I am building a basic data crawler in python using BeautifulSoup, for Batoto, the manga host. For some the reason, the URL works sometimes and other times it doesn't. For example:
from bs4 import BeautifulSoup
from urllib2 import urlopen
x= urlopen(*manga url here*)
y = BeautifulSoup(x)
print y
The result should be a tag soup of the page but instead I get a big wall of this
´ºŸ{›æP™oRhtüs2å÷%ëmßñ6Y›þ�GDŸ0ËÂ͇켮Yé)–ÀØÅð&ô]½f³ÓÞ€Þþ)ú$÷á�üv…úzW¿¾úà†lªÀí¥ï«·_ OTL_ˆêsÁÿƒÁÖ<Ø?°Þ›Â+WLç¥àEh>rýÜ>x ˆ‡eÇžù»èå»–Ùý e:›§`L_.‹¦úoÓ‘®e=‰ìÓ4Wëo’]~Ãõ¬À8>x:²âœ2¸ Á|&0ÍVpMLÎñ»v¥Ín÷-ÅÉ–T§`Ì.SÔsóë„œ¡×[˜·P6»�ùè�>Ô¾È]Œ—·ú£âÊgí%ضkwýÃ=Üϸ2cïÑfÙ_�×]Õê“ž?„UÖ* m³/`ñ§ÿL0³dµ·jªÅ}õ/õOXß×;«]®’ϯw‹·þ¡ÿ|Gýª`I{µœ}œí�ë–¼yÖÇ'�Wç�ëµÅþþ*ýœd{ÿDv:Ð íHzqÿÆ÷æélG-èÈâpÇßQé´^ÐO´®Xÿ�ýö(‹šëñþ"4!SÃõ2{òÿÜ´»ûE</kî?x´&ý˜`Ù)uÂï¹ã[ÏŠ²y°kÆpù}¢></uŒ¸kpž¼cì∬ƒcubÆ¡¢=en2‚påÓb9®`áï|z…p"i6pvif¨þõ“⟒></t`$ò-e></cé”r)$�ˆ)ìªÜrd&mÉÊ*ßdÒuÄ.Æ-hx#9[s=m�Ýfd2o1ˆ]‡[Ôádœtë¤qâxæ°‹qËÁ×,½ŠmʇꇢùÅýl></sí°çù¡h?‡ÌÜœbá‰æÆý¡sd~¬></zz¡ózwÎ[à!n‰Àš5¤…¸‘ݹŽ></sÃ:›3Ìæ></lÑggu�».Б#4õë\ÃñÆ:¸5ÔwÛ·…)~ÛacÑ,d³båÖ6></tg9y+wΉí%r8ƒ·}n`¼ÁÆ8˜”é²êÞ½°¶Ï></sÖ-di¨a±j9³4></ss„*w(ßibðïj*¶„)pâýÌ”a§%va{‰ò¦m mi></o³o˜Ÿ?¿Ñu-}{cÜ›a~:k²Ì></r+=ÅÌk˜c></wÓ¹âߊž‡ëf7vÑ�akÆ4ƒ‚></szŽµiÞêzâšÒ¬ú¢“âÀ#�-></qebndΑg*cxgsÆ€Ùüe¡³-ŠngÁ:�3ænæ5ï0`coäÏÖ9œ1Ða¯,æ—ªìàãÉÂð></j›h¶`à;)òiÖ š+></o”64ˆÎº9°��u—Úd¿ý¥pÎÖ‰0¢s:c�yƧ³t=ÕŸ“Ý‹41%}*,e³Ô¥ó></hiræe—';></v�fÞ«Ë¥n§Ð·¡kaììë\�`ùsõ©¸pv¦‘></bñ¼ut«w)Ø'¹ú#{)n0¡Žan¶Ë5èsª�–u–></y_x.mÅd:g}ëÕðhçð«õõ8ŠcËÕÌvžv™-šêÙ`b¹˜ùÃΓçˤÔÙtx¹�ßïǶÎgþ°r‹$ò†aÆ–š?ì<y«Ëñõo{%ׇo{ú¥Á»æ]‡></u´¬Ø¸eÖïÝtßÚ'è3®nh±ûk4È#l«s]–Åec¹ÑtmÓl|ë£Þ¼~zôéõûwêÓÑñÉÆw\soøÊiyjvØÖ$¯ÈoºÙoyã]æ5]-t^[“¡aÑ{²Å¸6¦ðtŒçm¼ÂÎz´></wà™´»äõ#©õ></mÏu:=¼þ·'�qwúËö«m„l^ˆær¥30q±ÒšŸëù></l(„7¼=xi’?¤;ö$ØË4ßoóiòyoµxÉøþ¨—«g³Ãíß{|></body></html>
wrapped in html and body tags.
Sometimes I will keep trying and it works, but it is so inconsistent, I can't figure out the reason for it.
Any help would be appreciated.
It seems to be urlopen having issues with encoding, requests works fine:
x = requests.get("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")
y = BeautifulSoup(x.content)
print y
<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="utf-8"/>
<title>Rakudai Kishi no Eiyuutan - Scanlations - Comic - Comic Directory - Batoto - Batoto</title>
.................
Using urlopen we get the following:
x = urlopen("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")
print x.read()
���������s+I���2���l��9C<�� ^�����쾯�dw�xzNT%��,T��A^�ݫ���9��a��E�C���W!�����ڡϳ��f7���s2�Px$���}I�*�'��;'3O>���'g?�u®{����e.�ڇ�e{�u���jf:aث
�����DS��%��X�Zͮ���������9�:�Dx�����\-�
�*tBW������t�I���GQ�=�c��\:����u���S�V(�><y�C��ã�*:�ۜ?D��a�g�o�sPD�m�"�,�Ɲ<;v[��s���=��V2�fX��ì�Cj̇�В~�
-~����+;V���m�|kv���:V!�hP��D�K�/`oԣ|�k�5���B�{�0�wa�-���iS
�>�œ��gǿ�o�OE3jçCV<`���Q!��5�B��N��Ynd����?~��q���� _G����;T�S'�#��t��Ha�.;J�61'`Й�#���>>`��Z�ˠ�x�#� J*u��'���-����]p�9{>����������#�<-~�K"[AQh0HjP
0^��R�]�{N#��
...................
So as you can see it is a problem with urlopen not BeautifulSoup.
The server is returning gzipped bytes. So to download the content using urllib2:
import sys
import urllib2
import gzip
import io
url = "http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615"
response = urllib2.urlopen(url)
# print(response.headers)
content = response.read()
if response.headers['Content-Encoding'] == 'gzip':
g = gzip.GzipFile(fileobj=io.BytesIO(content))
content = g.read()
encoding = response.info().getparam('charset')
content = content.decode(encoding)
This checks the content is the same as the page.text returned by requests:
import requests
page = requests.get(url)
# print(page.headers)
assert content == page.text
Since requests handles the gunzipping and decoding for you -- and more robustly too -- using requests is highly recommended.