Modify an html text using bs4 - python

I'm writing an script to translate the visible text of an html file from english to another language. And i can extract the text but i dont know how to put it back where it came from.
I'd like to know if there is a way using bs4 or other scraping lib to grab a block of text or something human readable, modify it, and then put it right were it came from something like
with open('../folder/index.html') as inf:
txt = inf.read()
soup = bs4.BeautifulSoup(txt)
'''for each block of text in the soup, extract it, translate it and put it back'''
with open('../folder/new_index.html','w') as f:
f.write(soup)
Is there any way to do this?

Currently your whole html file is saved in the variable soup. Try to do something like that:
#make list of lines
soupList = soup.split("\n")
#get line that you want to modify
lineToModify = soupList[<indexOfLine>]
#Do something with the line
modifiedLine = lineToModify + "hello"
#and put it back in the list
soupList[<indexOfLine>] = modifiedLine
#put the html file together again and write it#
soup = soupList.join("\n")
with open('../folder/new_index.html','w') as f:
f.write(soup)

Related

How do insert a HTML file in the content of a chapter when using ebooklib?

I'm making an EPUB using the EbookLib library and I'm following along their documentation. I am trying to set the content of a chapter to be the content of a HTML file. The only way I got it to work was giving plain HTML when setting the content.
c1 = epub.EpubHtml(title='Chapter one', file_name='ch1.xhtml', lang='en')
c1.set_content(u'<html><body><h1>Introduction</h1><p>Introduction paragraph.</p></body></html>')'
Is it possible to give a HTML file to be the content of the chapter?
I've tried things like c1.set_content(file_name='ch1.xhtml') but that didn't work, it only accepts plain HTML.
I figured it out! I'm opening and reading the file in a variable and then passing that variable to the set_content function. Posting this so it could be of use to someone in the future.
file = open('ch1.xhtml', 'r')
lines = file.read()
c2.set_content(lines)
file.close()

Loop for automatically webscrape data from few pages

Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists

Downloading a web page and searching a text with python

I'm trying to scrape specific text from a website. Because I'm new in Python, I find it difficult to scrape a text with a single script, so I used this code first:
import urllib
import requests
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event /282576?lang=el").read()
data = htmltext
soup = BeautifulSoup(data)
f = open('/Desktop/text.txt', 'w')
f.write(data)
f.close()`
and next I'm trying to write a script for searching the text and print specific words.
with open("/Desktop/text.txt") as openfile:
for line in openfile:
for part in line.split():
if "odds=" in part:
print part
but the search script doesn't return the text I'm searching for. Any suggestions please?
If you simply want the values associated with the odds key, without any context at all, you could simply do the following:
import urllib
from json import loads # JSON parser
jsontext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event/282576?lang=el").read()
data = loads(jsontext) # Parse the JSON
odds = [[b['odds'] for b in a['children']] for a in data['children']]
The nested list comprehension takes advantage of the structure of the data. An advantage of using the data structure is that you can do quite rich analytics without too much effort. If you wanted other info in addition to the odds then this would probably better implemented as a nested for-loop.
How about:
import sys
from bs4 import Beautiful Soup
import mechanize
def viewPage(url):
browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('user-agent','MozillaMozilla/5.0')]
page=browser.open(url)
source_code=page.read()
soup=BeautifulSoup(source_code)
info=soup.findAll("insert what you want to locate")
print(info)
viewPage("www.xkcd.com")
I have a program that when you choose a webpage it reads off all the links, chooses one at random and goes to it, doing the same. It basically crawls across the interweb. The code above is a modified excerpt.

Using python regular expressions to sub between two files

Basically I'm trying to read text from a text file, use a regular expression to sub it into something else and then write it to a html file.
here's a snippet of what i have:
from re import sub
def markup():
##sub code here
sub('[a-z]+', 'test', file_contents)
the problem seems to be with that sub line.
The below code (part of the same function) needs to make a html file with the subbed text.
## write the HTML file
opfile = open(output_file, 'w')
opfile.write('<html>\n')
opfile.write('<head>\n')
opfile.write('<title>')
opfile.write(file_title)
opfile.write('</title>\n')
opfile.write('</head>\n')
opfile.write('<body>\n')
opfile.write(file_contents)
opfile.write('</body>\n')
opfile.write('</html>')
opfile.close()
the function here is designed so i can take text out of multiple files. after calling the markup function i can copy everything after file_contents except for the stuff in brackets, which i would replace with the names of the other files.
def content_func():
global file_contents
global file_title
global output_file
file_contents = open('example.txt', 'U').read()
file_title = ('example')
output_file = ('example.html')
markup()
content_func()
Example.txt is just a text file containing the text "the quick brown fox jumps over the lazy dog". what i'm hoping to achieve is to search text for specific markup language and replace it with HTML markup, but I've simplified it here to help me try and figure it out.
running this code should theoretically create a html file called example.html with a title and text saying "test", however this is not the case. i'm not familiar with regular expressions and they are driving me crazy. can anyone please suggest what i should do with the regular expression 'sub'?
EDIT: the code doesn't produce any errors, but the output HTML file lacks any substituted text. so the sub is searching the external text file but isn't putting it into the output HTML file.
You never save the result of sub(). Replace
sub('[a-z]+', 'test', file_contents)
with this
file_contents = sub('[a-z]+', 'test', file_contents)

python save url list in txt file

Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.

Categories