create list from parsed web page in python - python

I'm a little new to web parsing in python. I am using beautiful soup. I would like to create a list by parsing strings from a webpage. I've looked around and can't seem to find the right answer. Doe anyone know how to create a list of strings from a web page? Any help is appreciated.
My code is something like this:
from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.any_url.com"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#The data I need is coming from HTML tag of td
page_find=soup.findAll('td')
for page_data in page_find:
print page_data.string
#I tried to create my list here
page_List = [page_data.string]
print page_List

Having difficulty understanding what you are trying to achieve... If you want all values of page_data.string in page_List, then your code should look like this:
page_List = []
for page_data in page_find:
page_List.append(page_data.string)
Or using a list comprehension:
page_List = [page_data.string for page_data in page_find]
The problem with your original code is that you create the list using the text from the last td element only (i.e. outside of the loop which processes each td element).

I'd recommend lxml over BeautifulSoup, when you start scraping alot of pages the speed advantage of lxml is hard to ignore.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.any_url.com').content)
page_list = [x for x in dom.xpath('//td/text()')]
print page_list

Here it is modified to call the web page as a string
import requests
the_web_page_as_a_string = requests.get(some_path).content
from lxml import html
myTree = html.fromstring(the_web_page_as_a_string)
td_list = [ e for e in myTree.iter() if e.tag == 'td']
text_list = []
for td_e in td_list:
text = td_e.text_content()
text_list.append(text)

Related

Removing Empty Lines of a list in python

My goal is to get a simple text output like:
https://widget.reviews.io/rating-snippet/dist.js
But I keep getting output like this:
https://widget.reviews.io/rating-snippet/dist.js
All these empty lines are the problem
--> Before there where [] but I removed them with ''.join
Now I only have these empty lines.
Here is my code:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://www.nutrimuscle.com")
soup = BeautifulSoup(html.text, "html.parser")
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = ''
kameleoonRegex = re.compile(r'[\w].*rating-snippet/dist.js')
#Everything I tried :D
kameleeonScript = kameleoonRegex.findall(javascript)
text = ''.join(kameleeonScript)
print(text)
It's probably not that hard but I've been on this for hours
if kameleeonScript: print(kameleeonScript[0])
did the job :)

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Trying to scrape information from an iterated list of links using Beautiful Soup or ElementTree

I'm attempting to scrape an xml database list of links for these addresses. (The 2nd link is an example page that actually contains some addresses. Many of the links don't.)
I'm able to retrieve the list of initial links I'd like to crawl through, but I can't seem to go one step further and extract the final information I'm looking for (addresses).
I assume there's an error with my syntax, and I've tried scraping it using both beautiful soup and Python's included library, but it doesn't work.
BSoup:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
data = bs.find("html",{"i"})
print data
Non 3rd Party:
import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text
Any input is appreciated! Thanks.
Your syntax is ok. You need to simply follow those links in the first page, here's how it will look like for the Milano page:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
company_menu = bs.find_all('loc')
for item in company_menu:
if 'milano' in item.text:
subpage = requests.get(item.text)
subsoup = BeautifulSoup(subpage.text)
adresses = subsoup.find_all(class_='riquadro_agenzia_off')
for adress in adresses:
companyname.append(adress.text)
print companyname
To get all addresses you can simply remove if 'milano' block in the code. I don't know if they are all formatted according to coherent rules, for milano addresses are under div with class="riquandro_agenzia_off", if other subpages are also formatted in this way then it should work. Anyway this should get you started. Hope it helps.

BeautifulSoup freezes

I was messing around with BeautifulSoup and found that it occasionally just takes an awful long time to parse a page despite no changes in the code or connection whatsoever. Any ideas?
from bs4 import BeautifulSoup
from urllib2 import urlopen
#The particular state website:
site = "http://sfbay.craigslist.org/rea/"
html = urlopen(site)
print "Done"
soup = BeautifulSoup(html)
print "Done"
#Get first 100 list of postings:
postings = soup('p')
If for some reason you wanted to read the text within the <a> tags you can do something like this.
postings = [x.text for x in soup.find("div", {"class":"content"}).findAll("a", {"class":"hdrlnk"})]
print(str(postings).encode('utf-8'))
This will return a list with the length of 100.
postings = soup('p')
This code is not good. The computer has to check each line to make sure p tag is in. one by one.
aTag = soup.findAll('a',class_='result_title hdrlnk')
for link in aTag:
print(link.text)

Categories