I am trying to use urllib to access a website and then strip the page source so I can collect some data from it. I know how to do this for public websites but I don't know how to use urllib to do this for password protected webpages. I know the username and password, I am just very confused about how to get urllib to put in the correct credentials then reroute me to the correct page that I want to strip the data from. Currently, my code looks like this. The problem is that it is bringing up the login page's source.
from tkinter import *
import csv
from re import findall
import urllib.request
def info():
file = filedialog.askopenfilename()
fileR = open(file, 'r')
hold = csv.reader(fileR, delimiter=',', quotechar='|')
aList=[]
for item in hold:
if item[1] and item[2] == "":
print(item[1])
url = "www.example.com/id=" + item[1]
request = urllib.request.urlopen(url)
html = request.read()
data = str(html)
person = findall('''\$MainContent\$txtRecipient\"\stype=\"text\"\svalue=\"([^\"]+)\"''',data)
else:
pass
fileR.close
Remember, I am using python 3.3.3. Any help would be appreciated!
Related
I am using the python webbrowser module to try and open a html file. I added a short thing to get code from a website to view, allowing me to store a web-page incase I ever need to view it without wifi, for instance a news article or something else.
The code itself is fairly short so far, so here it is:
import requests as req
from bs4 import BeautifulSoup as bs
import webbrowser
import re
webcheck = re.compile('^(https?:\/\/)?(www.)?([a-z0-9]+\.[a-z]+)([\/a-zA-Z0-9#\-_]+\/?)*$')
#Valid URL Check
while True:
url = input('URL (MUST HAVE HTTP://): ')
check = webcheck.search(url)
groups = list(check.groups())
if check != None:
for group in groups:
if group == 'https://':
groups.remove(group)
elif group.count('/') > 0:
groups.append(group.replace('/', '--'))
groups.remove(group)
filename = ''.join(groups) + '.html'
break
#Getting Website Data
reply = req.get(url)
soup = bs(reply.text, 'html.parser')
#Writing Website
with open(filename, 'w') as file:
file.write(reply.text)
#Open Website
webbrowser.open(filename)
webbrowser.open('https://www.youtube.com')
I added webbrowser.open('https://www.youtube.com') so that I knew the module was working, which it was, as it did open up youtube.
However, webbrowser.open(filename) doesn't do anything, yet it returns True if I define it as a variable and print it.
The html file itself has a period in the name, but I don't think that should matter as I have made a file without it as the name and it wont run.
Does webbrowser need special permissions to work?
I'm not sure what to do as I've removed characters from the filename and even showed that the module is working by opening youtube.
What can I do to fix this?
From the webbrowser documentation:
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
So it seems that webbrowser can't do what you want. Why did you expect that it would?
adding file:// + full path name does the trick for any wondering
I'm building this Shopify scraper to scraper the shop properties like address, phone, email, etc. and I'm receiving a urllib.error.HTTPError: HTTP Error 404: not found. The CSV is being created with the header but not scraping any of the information. Why isn't the address being scraped?
import csv
import json
from urllib.request import urlopen
import sys
base_url = sys.argv[1]
url = base_url + '/shopprops.json'
def get_page(page):
data = urlopen(url + '?page={}'.format(page)).read()
shopprops = json.loads(data)['shopprops']
return shopprops
with open('shopprops.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Address1'])
page = 1
shop = get_page(page)
while shopprops:
for shop in shopprops:
address1 = shop['address1']
row = [address1]
writer.writerow(row)
page += 1
shopprops = get_page(page)
It looks like the issue's with:
data = urlopen(url + '?page={}'.format(page)).read()
and:
shopprops = get_page(page)
That article is crappy for a few reasons, which might help you to move on. First off, you can't scrape a shop like that guy says just asking for products.json. You get a really small payload of a few products at best, with no really interesting information exposed. Shopify is wise to that.
So before you invest too much effort in your scraper, you might want to re-think what you're doing, and instead, maybe try a different approach than this one.
Basically, I'm trying to remove all the characters after the URL extension in a URL, but it's proving difficult. The application works off a list of various URLs with various extensions.
Here's my source:
import requests
from bs4 import BeautifulSoup
from time import sleep
#takes userinput for path of panels they want tested
import_file_path = input('Enter the path of the websites to be tested: ')
#takes userinput for path of exported file
export_file_path = input('Enter the path of where we should export the panels to: ')
#reads imported panels
with open(import_file_path, 'r') as panels:
panel_list = []
for line in panels:
panel_list.append(line)
x = 0
for panel in panel_list:
url = requests.get(panel)
soup = BeautifulSoup(url.content, "html.parser")
forms = soup.find_all("form")
action = soup.find('form').get('action')
values = {
soup.find_all("input")[0].get("name") : "user",
soup.find_all("input")[1].get("name") : "pass"
}
print(values)
r = requests.post(action, data=values)
print(r.headers)
print(r.status_code)
print(action)
sleep(10)
x += 1
What I'm trying to achieve is an application that automatically tests your username/password from a list of URLs provided in a text document. However, BeautifulSoup returns an incomplete URL when crawling for action tags, i.e instead of returning the full http://example.com/action.php it will return action.php as it would be in the code. The only way I can think to get past this would be to restate the 'action' variable as 'panel' with all characters after the url extension removed, followed by 'action'.
Thanks!
I am using 3 modules in this program, I don't know if what I'm trying to do is even possible! So I want to scrape some data off of twitter and write it in a text file using python, can somebody please guide me and tell me why my code isn't writing the data scrapped?
import urllib
import urllib.request
from os import path
from bs4 import BeautifulSoup
# here I define the url, I request the page, create my soup
theurl = "https://twitter.com/realDonaldTrump"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
def create_file(dest):
"""
Creates a file for the user to write data in!
:param dest:
:return:
"""
## FileName == Month_Day_Year
name = 'Data Scraped.txt'
if not(path.isfile(dest +name)):
f = open(dest + name, "w")
f.write(soup.title.text)
f.close()
if __name__ == '__main__':
destination = 'C:\\Users\\edwin\\' \
'Desktop\\WebScrappin\\'
create_file(destination)
print("Your file has been created!!")
You're only the writing the title of the document that you received.
f.write(soup.title.text)
Instead of scraping (which is against their ToS) you should gather your data from their RESTful API or use a library like Twython
I'm trying to figure out how to go about writing a website monitoring script (cron job in the end) to open up a given URL, check to see if a tag exists, and if the tag does not exist, or doesn't contain the expected data, then to write some to a log file, or to send an e-mail.
The tag would be something like or something relatively similar.
Anyone have any ideas?
Your best bet imo is to check out BeautifulSoup. Something like so:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://yoursite.com")
soup = BeautifulSoup(page)
# See the docs on how to search through the soup. I'm not sure what
# you're looking for so my example stops here :)
After that, emailing it or logging it is pretty standard fare.
This is a sample code (untested) that log and send mail:
#!/usr/bin/env python
import logging
import urllib2
import smtplib
#Log config
logging.basicConfig(filename='/tmp/yourscript.log',level=logging.INFO,)
#Open requested url
url = "http://yoursite.com/tags/yourTag"
data = urllib2.urlopen(url)
if check_content(data):
#Report to log
logging.info('Content found')
else:
#Send mail
send_mail('Content not found')
def check_content(data):
#Your BeautifulSoup logic here
return content_found
def send_mail(message_body):
server = 'localhost'
recipients = ['you#yourdomain.com']
sender = 'script#yourdomain.com'
message = 'From: %s \n Subject: script result \n\n %s' % (sender, message_body)
session = smtplib.SMTP(server)
session.sendmail(sender,recipients,message);
I would code check_content() function using beautifulSoup
The following (untested) code uses urllib2 to grab the page and re to search it.
import urllib2,StringIO
pageString = urllib2.urlopen('**insert url here**').read()
m = re.search(r'**insert regex for the tag you want to find here**',pageString)
if m == None:
#take action for NOT found here
else:
#take action for found here
The following (untested) code uses pycurl and StringIO to grab the page and re to search it.
import pycurl,re,StringIO
b = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.URL, '**insert url here**')
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
c.close()
m = re.search(r'**insert regex for the tag you want to find here**',b.getvalue())
if m == None:
#take action for NOT found here
else:
#take action for found here