I have been trying to use the code made available here to edit HTML files using Python:
https://www.geeksforgeeks.org/how-to-modify-html-using-beautifulsoup/
# Python program to modify HTML
# with the help of Beautiful Soup
# Import the libraries
from bs4 import BeautifulSoup as bs
import os
import re
# Remove the last segment of the path
base = os.path.dirname(os.path.abspath(__file__))
# Open the HTML in which you want to make changes
html = open(os.path.join(base, 'gfg.html'))
# Parse HTML file in Beautiful Soup
soup = bs(html, 'html.parser')
# Give location where text is
# stored which you wish to alter
old_text = soup.find("p", {"id": "para"})
# Replace the already stored text with
# the new text which you wish to assign
new_text = old_text.find(text=re.compile(
'Geeks For Geeks')).replace_with('Vinayak Rai')
# Alter HTML file to see the changes done
with open("gfg.html", "wb") as f_output:
f_output.write(soup.prettify("utf-8"))
But nothing really happens, I tried changing the way the file is opened and changing the HTML file type, but it does nothing.
I'm not very practiced when it comes to programming so I don't know how well I will be able to answer any questions, but I will try my best to give any opportune information.
Thank you for your time.
The code is working fine when you have both the files right next to each other in a single directory:
files in same directory
"Geeks for Geeks" present within a p tag with id "para".
<p id="para">Geeks For Geeks</p>
When you have other tags within enclosing p tag with id "para".
<p id="para"><strong>Geeks For Geeks</strong></p>
If you are using a code editor (such as Atom or Sublime) you should be able to see the changes. In case of text editors, the changes may not reflect right away unless you manually reopen the file (ensuring you have not saved the file after running the Python script).
So my suggestion is:
Keep them both in the same directory.
Close the html file before running the Python script
After the script has been executed through cmd/bash (or built-in IDE console), reload the web page.
Feel free to reach out in case if the issue still persists.
Thanks.
Related
I'm building my first website using flask and HTML. Some of my data that I want to migrate to this website resides in Markdown format. I am trying to convert Markdown into HTML using this however, I cannot get my hear around it:
https://github.com/Python-Markdown/markdown
I import it into my *.py file not sure what are the next steps after. This is what I got so far
from markdown import markdown
html = markdown.markdown(text)
not sure what should be put into the "text" variable. Also I have my markdown data residing in an html file how do I reference that from here? I have read through the installation guide but it's not very clear for me.
Thank you.
According to the docs located at https://python-markdown.github.io/reference/#using-markdown-as-a-python-library
text is supposed to contain your markdown text. In the below example found in the docs, some_file.txt would be the file containing your markdown.
input_file = codecs.open("some_file.txt", mode="r", encoding="utf-8")
text = input_file.read()
html = markdown.markdown(text)
To get your text, you would need to parse it out of the HTML. There are several ways of doing this but we would need more information about the file to proceed. Is your HTML file stored locally? Where in the file is the markdown? A MRE would be helpful
Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists
i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+
Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.
MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.