Python BeautifulSoup Extracting PHP Links - python

I'm having a problem in Python with BeautifulSoup. I need to extract all files on the page that end in ".php", but they also have to be local files. They can't be from another website. This is what I have so far:
from bs4 import BeautifulSoup
import mechanize
import sys
url = sys.argv[1]
br = mechanize.Browser()
code = br.open(url)
html = code.read()
soup = BeautifulSoup(html)
This is where I get stuck on what to do. I imagine using soup.findall to get all the "a href" tags.

Try like this,
page=urllib2.urlopen(url)
soup=BeautifulSoup(page.read())
for a in soup.findAll('a'):
if a['href'].endswith('.php'):
print a['href']

import glob,os
path=input("Enter Your Path in "" =")+"//"
print path
for i in glob.glob(os.path.join(str(path),"*.php")):
print i

Related

How to extract the correct script out of all the scripts using BeautifulSoup

I'm currently using BS4 to extract some information from a Kickstarter webpage: https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour
The project information is located inside one of the script tags: (pseudo-code)
...
<script>...</script>
<script>
window.current_ip = ...
...
window.current_project = "<I want this part>"
</script>
...
My current code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import html
html_ = urlopen("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour").read()
soup = BeautifulSoup(html_, 'html.parser')
# why does this not work?
# soup.find('script', re.compile("window.current_project"))
# currently, I'm doing this:
all_string = html.unescape(soup.find_all('script')[4].get_text())
# then some regex here on all_string to extract the current_project information
Currently I can get the section I want using indexing [4], but as I am not sure if this is true in general, how can I extract out the text from the correct script tag?
Thanks!
You can gather all the script elements and loop. Access the response object content with requests
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup = BeautifulSoup(res.content, 'lxml')
scripts = soup.select('script')
scripts = [script for script in scripts]
for script in scripts:
if 'window.current_project' in script.text:
print(script)
This should work (Instead of dumping to json, you might be able to print the output instead if wanted, oh yeah and REMEMBER TO CHANGE THE VARIABLES WHERE I SAID "Choose a path" AND "if theres any class add it here"):
from bs4 import BeuatifulSoup
import requests
import json
website = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup= BeautifulSoup(website.content, 'lxml')
mytext = soup.findAll("script", {"class": "If theres any class add it here, or else delete this part"})
save_path = 'CHOOSE A PATH'
ogname = "kickstarter_text.json"
completename = os.path.join(save_path, ogname)
with open(completename, "w") as output:
json.dump(listofurls, output)

python with beautifulsoup - remove tags

I am doing some python program to extract lyrics
the code i use:
import urllib
from bs4 import BeautifulSoup
url = urllib.urlopen("http://www.lyricsnmusic.com/david-bowie/slip-away-lyrics/22143075")
soup = BeautifulSoup(url.read())
print soup.find('pre', itemprop='description')
the result gets me what i need but with the extra of the tag
for example : <pre item="description> then the lyrics
anyone know how to get only the lyrics?
the structure puts the lyrics between the pre tag
thanks in advance
Use the text attribute of the node that you've found
import urllib
from BeautifulSoup import BeautifulSoup
url = urllib.urlopen("http://www.lyricsnmusic.com/david-bowie/slip-away-lyrics/2
2143075")
soup = BeautifulSoup(url.read())
desc=soup.find('pre', itemprop='description')
print desc.text

Python, BeautifulSoup - print only links which contains <img> in content

Is it possible to set up in beautifulSoup that I can print only links that has <img> inside its content?
Currently my code looks like this:
import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
url = "http://www.nytimes.com"
htmlcontent = urllib.urlopen(url)
soup = BeautifulSoup(htmlcontent)
for link in soup.find_all('a'):
print link.contents
which print outs all content inside links. But my true need is to print links that has <img> tags inside it content and I don't know how to do that...
any help is welcome
Just try to find img tag inside the link:
for link in soup.find_all('a'):
if link.find('img'):
print link

Grabbing the info from beautiful soup and putting it into a text file?

I have started to learn how to scrape information from websites using urllib and beautifulsoup. I want to grab all the text from this page (in the code) and put it into a text file.
import urllib
from bs4 import BeautifulSoup as Soup
base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"
url = (base_url)
soup = Soup(urllib.urlopen(url))
print(soup.get_text())
When I run this it grabs the text although it outputs it with spaces between all the letters and still shows me HTML, unsure why though.
i n ' > Y u p . B u t d o n t f e e
Like that, any idea's?
Also what would I do to put this info into a text file for me?
(Using beautifulsoup4 and running ubuntu 12.04 and python 2.7)
Thank you :)
I had some trouble with the encoding, so I changed your code slightly, then added the piece to print the results to a file:
import urllib
from bs4 import BeautifulSoup as Soup
base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"
url = (base_url)
content = urllib.urlopen(url)
soup = Soup(content)
# print soup.original_encoding
theegg_text = soup.get_text().encode("windows-1252")
f = open("somefile.txt", "w")
f.write(theegg_text);
f.close()
You could try using html2text:
import html2text as htmlconverter
print htmlconverter.html2text('<HTML><BODY>HI</BODY></HTML>')

Rss Feed scraping with BeautifulSoup

I'm having trouble with my script. I am able to get the title and links but i cant seem to open the article and scrape the article. can somebody please help!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://www.marketingmag.com.au/feed/').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<a href="(.*)">')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 10)
for i in literate:
print find_title[i]
print find_link[i]
articlePage = urlopen(find_link[i]).read()
divBegin = articlePage.find('<div class="entry-content">')
article = articlePage[divBegin:(divBegin+1000)]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
for i in paragList:
print i
print ("\n")
Do not use regex to parse HTML. Just use Beautiful Soup and it's facilities like find_all to get the links and then you can use urllib2.urlopen to open the url and then read the contents.
Your Code strongly reminds me of: http://www.youtube.com/watch?v=Ap_DlSrT-iE
Why do you actually use BeautifulSoup for XML parsing? Its built for HTML-Sites and python itself has very good XML-Parsers. Example: http://docs.python.org/library/xml.dom.minidom.html

Categories