Scraping Multiple html files to CSV - python

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?

You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.

MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

Related

Can't replace HTML text using BeautifulSoup

I have been trying to use the code made available here to edit HTML files using Python:
https://www.geeksforgeeks.org/how-to-modify-html-using-beautifulsoup/
# Python program to modify HTML
# with the help of Beautiful Soup
# Import the libraries
from bs4 import BeautifulSoup as bs
import os
import re
# Remove the last segment of the path
base = os.path.dirname(os.path.abspath(__file__))
# Open the HTML in which you want to make changes
html = open(os.path.join(base, 'gfg.html'))
# Parse HTML file in Beautiful Soup
soup = bs(html, 'html.parser')
# Give location where text is
# stored which you wish to alter
old_text = soup.find("p", {"id": "para"})
# Replace the already stored text with
# the new text which you wish to assign
new_text = old_text.find(text=re.compile(
'Geeks For Geeks')).replace_with('Vinayak Rai')
# Alter HTML file to see the changes done
with open("gfg.html", "wb") as f_output:
f_output.write(soup.prettify("utf-8"))
But nothing really happens, I tried changing the way the file is opened and changing the HTML file type, but it does nothing.
I'm not very practiced when it comes to programming so I don't know how well I will be able to answer any questions, but I will try my best to give any opportune information.
Thank you for your time.
The code is working fine when you have both the files right next to each other in a single directory:
files in same directory
"Geeks for Geeks" present within a p tag with id "para".
<p id="para">Geeks For Geeks</p>
When you have other tags within enclosing p tag with id "para".
<p id="para"><strong>Geeks For Geeks</strong></p>
If you are using a code editor (such as Atom or Sublime) you should be able to see the changes. In case of text editors, the changes may not reflect right away unless you manually reopen the file (ensuring you have not saved the file after running the Python script).
So my suggestion is:
Keep them both in the same directory.
Close the html file before running the Python script
After the script has been executed through cmd/bash (or built-in IDE console), reload the web page.
Feel free to reach out in case if the issue still persists.
Thanks.

how should I automation of work with Python

I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python
An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")
I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.

Loop for automatically webscrape data from few pages

Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists

Trouble parsing html files (to csv) using ElementTree xpath in python

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks--the first one which was (thankfully) solve here, a few days ago. The (hopefully) final roadblock is this: I can not get it to properly parse the file using xpath. Below is a brief explanation, the python code and example of the html code.
The trouble starts here:
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
category=node.text
It runs, but does not parse. I do not get any traceback errors.
I think I am misunderstanding the logic of parsing with ElementTree.
There are several headers that are the same--it is therefor difficult to find a unique id/header. Here is an example of the html:
<span class="s1">Business: Give Back to the Community and Save Money
on Equipment, Technology, Promotional Products, and Market<span
class="Apple-converted-space"> </span></span>
For which the xpath is:
/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span
I would like to scrape the text from this span (among others) and put it in the excel spreadsheet.
You can see an example of a similar page HERE
At any rate, because many spans/headers are no uniquely identified, I think I should use xpath. However, I have yet to be able to figure out how to successfully use xpath commands with ElementTree. In searching the documentation, the answer to this question (as well as the logic) eludes me. I have read up on http://lxml.de/parsing.html as well as on this site and have yet to find something that works.
So far, the code iterates through all the files (in dropbox) nicely. It also creates the csv file and creates the headers (though not in separate columns, only as one line separated by semicolons-- but that should be easy to fix).
In sum, I would like it to parse the text from different lines on in each file (webpage) and dump it into the excel file.
Any input would be greatly appreciated.
The python code:
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = lxml.html.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
f.close()
14 June 2015 (most recent change); I have just changed this section
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
to this:
for node in tree.iter():
row = dict.fromkeys(cols)
Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
row['category'] = Category_name[0].text_content().encode('utf-8')
It still runs, but does not parse.
Try following code:
from lxml import etree
import requests
from StringIO import StringIO
data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]
It uses requests library to download the html code of the page. You can choose whatever method that fits your needs. The important part is the xpath that searches any <table> followed by <td> followed by <font>, and it returns a list with two elements. The second one are blank characters and the first one contains the text.
Run it and yields just the sentence you are looking for:
Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection

Reading 1000s of XML documents with BeautifulSoup

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.
You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:
from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print (full_filename)
f = open(full_filename, "r")
xml = f.read()
soup = BeautifulSoup(xml)
del xml
del soup
f.close()
If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.
I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.
Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.
Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.
Like this:
import xml.etree.cElementTree as cET
# ...
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print(full_filename)
parsed = cET.parse(full_filename)
del parsed
If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.
I would not separate that file out into many small files and then process them some more, I would process them all in one go.
I would just use a streaming api XML parser and parse the master file, get the name and write out the sub-files once with the correct name.
There is no need for BeautifulSoup which is primarily designed to handle HTML and uses a document model instead of a streaming parser.
There is no need for what you are doing to build an entire DOM just to get a single element all at once.

Categories