python save url list in txt file

python save url list in txt file - python

Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()

This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.

Related

how should I automation of work with Python

I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python

An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")

I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.

Trying to HEAD request a list of websites from an Excel file

I have a large list of websites in a csv or xslx file that I need to check the error code each website spits out. I think using requests is my best bet but that's as far as I'm getting.
I'm basically all pseudocode at this point so I haven't tried to run anything
The example below is something I found that I'm trying to work on top of which works great for individual websites. I know I have to replace ("http://www.google.com") with the list from excel which is where I'm stuck.
import requests
resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers
I have some code snippets saved for getting stuff out of the excel file but it doesn't seem like what I need.

Load the csv into a list, csvList = list(csv.reader(open(path))) then iterate that list
for row in csvList:
for cell in row:
resp = requests.head(cell)
print (cell)
print (resp)

Open URLS from list and write data

I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?

You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)

Extracting links from HTML in Python

i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)

You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)

try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+

How to automate downloading .txt files from Scribd.com

This is a general question about whether it is possible, and if so how, to automate the download of a scribd.com search result document.
Scenario:
I have a Scribd account and find a document I want. I normally I then have to click the download button to start the download.
Any ideas for automating this? I'm using the scribd api and python to automatically extract document IDs based on automated queries, but once I get the doc_id's I have to physically go to each doc page and click the download button to get the physical txt/pdf file. I want to automate this step as well.
Any Ideas?

Looking at the python-scribd documentation or the scribd API reference, any object that can give you a document ID or website URL can also give you a download URL. Or, if you already have a document ID, you can just call get to get an object that can give you a download URL.
Most likely, you've got a Document object, which has this method:
get_download_url(self, doc_type='original')
Returns a link that can be used to download a static version of the document.
So, wherever you're calling get_scribd_url, just call get_download_url.
And then, to download the result, Python has urllib2 (2.x) or urllib.request (3.x) built into the standard library, or you can use requests or any other third-party library instead.
Putting it all together as an example:
# do all the stuff to set up the api_key, get a `User` object, etc.
def is_document_i_want(document):
return document.author == "Me"
urls = [document.get_download_url() for document in user.all()
if is_document_i_want(document)]
for url in urls:
path = urllib.parse.urlparse(url).path
name = os.path.basename(path)
u = urllib.request.urlopen(url)
with open(name, 'w') as f:
f.write(u.read())
print('Wrote {} as {}'.format(url, name))
Presumably you're going to want to use something like user.find instead of user.all. Or, if you've already written the code that gets the document IDs and don't want to change it, you can use user.get with each one.
And if you want to post-filter the results, you probably want to use attributes beyond the basic ones (or you would have just passed them to the query), which means you need to call load on each document before you can access them (so add document.load() at the top of the is_document_i_want function). But really, there's nothing complicated here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python save url list in txt file - python

Related

how should I automation of work with Python

Trying to HEAD request a list of websites from an Excel file

Open URLS from list and write data

Extracting links from HTML in Python

How to automate downloading .txt files from Scribd.com

Categories

Resources