I have a bunch of list of links which I'm doing a specific function on each link, the function takes about 25 sec, I use selenium to open each and get the page source of it then do my function, however whenever I build the program and cancel the build, I will have to start all over again.
Note:I get links from different webs sitemap.
Is there a way to save my progress and continue it later on?
this code will work. I assume you already have a function got getting links. I have just used a dummy one _get_links.
You will have to delete the content of links file and need to put 0 in index file after every successful run.
import time
def _get_links():
return ["a", "b", "c"]
def _get_links_from_file():
with open("links") as file:
return file.read().split(",")
def _do_something(link):
print(link)
time.sleep(30)
def _save_links_to_file(links):
with open("links", "w") as file:
file.write(",".join(links))
print("links saved")
def _save_index_to_file(index):
with open("index", "w") as file:
file.write(str(index))
print("index saved")
def _get_index_from_file():
with open("index",) as file:
return int(file.read().strip())
def process_links():
links=_get_links_from_file()
if len(links) == 0:
links = _get_links()
_save_links_to_file(links)
else:
links = _get_links_from_file()[_get_index_from_file():]
for index, link in enumerate(links):
_do_something(link)
_save_index_to_file(index+1)
if __name__ == '__main__':
process_links()
I would suggest that you write out the links to a file along with a date/time stamp of the last time it was processed. When you write links to the file, you will want to make sure that you don't write the same link twice. You will also want to date/time stamp a link after you are done processing it.
Once you have this list, when the script is started you read the entire list and start processing links that haven't been processed in X days (or whatever your criteria is).
Steps:
Load links file
Scrape links from sitemap, compare to existing links from file, write any new links to file
Find the first link that hasn't been processed in X days
Process that link then write date/time stamp next to link, e.g.
http://www.google.com,1/25/2019 12:00PM
Go back to Step 3
Now any time you kill the run, the process will pick up where you left off.
NOTE: Just writing out the date may be enough. It just depends on how often you want to refresh your list (hourly, etc.) or if you want that much detail.
You should save the links in a text file. You should also save the index numbers in another text file, probably initializing with 0.
In your code, you can then loop through the links using something like:
for link in links[index_number:]
At the end of every loop, add the index number to the text file holding the index numbers. This would help you continue from where you left off.
Related
tldr;
When writing to csv, how do I ensure the last element of a previous write is separated from the first element of a subsequent write by a comma and not a new line?
I am currently trying to collect followers data with a list of Twitter users using Tweepy. In the code below, you can see that I'm using pagination as some users have a lot of followers. I'm trying to put all the followers into a csv file for each user, however when I test this code and inspect the csv, I can see there's only a new line between page writes, but no commas. I do not want improper csv format to come back and bite me later in this project.
for page in tweepy.Cursor(api.followers_ids,screen_name=username).pages():
with open(f'output/{username}.csv', 'a') as outfile:
writer = csv.writer(outfile)
writer.writerow(page)
I've thought of enumerating the pages:
for i, page in enumerate(tweepy.Cursor(api.followers_ids,screen_name=username).pages()):
and doing something like if i > 0, add comma at end of the current file. This way feels inefficient, as I'd have to open, write a ',', and close the file each time this happens and I need every second I can save for this project.
My website just launched a new simple component that contains a series of links. Every 24 hours, the links update/change based on an algorithm. I'm wanting to see how long a particular link stays in the component (because, based on the algorithm, sometimes a particular link may stay in the component for multiple days, or sometimes maybe it will be present for just one day).
I'm working on building a Python crawler to crawl the frontend of the website where this new component is present, and I want to have a simple output likely in a CSV file with two columns:
Column 1: URL (the URL that was found within the component)
Column 2: #/days seen (The number of times the Python crawler saw that URL. If it crawls every day, this could be simply thought of as the #/days the crawler has seen that particular URL. So this number would be updated every time the crawler runs. Or, if it was the first time a particular URL was seen, the URL would simply be added to the bottom of the list with a "1" in this column)
How can this be achieved from an output perspective? I'm pretty new to Python, but I'm pretty sure I've got the crawling part covered to identify the links. I'm just not sure how to accomplish the output part, especially as it will update daily, and I want to keep the historical data of how many times the link has been seen.
You need to learn how to webscrape, I suggest using the beautiful soup package for that.
You scraping script should then iterate over your csv file, incrementing the number on each url it finds, or adding a new one if its not found.
Put this script in a cron job, to run it once every 24 hours.
For 2 you can do something like this
from tempfile import NamedTemporaryFile
import shutil
import csv
links_found = [] # find the links here
filename = 'temp.csv'
tempfile = NamedTemporaryFile(delete=False)
with open("myfile.csv") as csv_file, tempfile:
reader = csv.reader(csv_file)
writer = csv.writer(tempfile)
# Increment exising
existing_links = []
writer.write_row(reader.next())
for row in reader:
link = row[0]
existing_links.append(link)
times = int(row[1])
if link in links_found:
row[1] = str(row[1]+1)
writer.write_row(row)
# Add new links
for link in links_found:
if link not in existing_links:
writer.write_row([link, 1])
shutil.move(tempfile.name, filename)
I coded this scraper using Python 2.7 to fetch links from the first 3 pages of TrueLocal.com.au and write them to a text file.
When I run the program, only the first link is written in the text file. What can I do so that all the URLs returned are written on the file?
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.truelocal.com.au/find/car-rental/' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a', {'class':'name'}):
href = 'http://www.truelocal.com.au' + link.get('href')
fob = open('c:/test/true.txt', 'w')
fob.write(href + '\n')
fob.close()
print (href)
page += 1
#Run the function
tru_crawler(3)
Your problem is that for each link, you open the output file, write it, then close the file again. Not only is this inefficient, but unless you open the file in "append" mode each time, it will just get overwritten. What's happening is actually that the last link gets left in the file and everything prior is lost.
The quick fix would be to change the open mode from 'w' to 'a', but it'd be even better to slightly restructure your program. Right now the tru_crawler function is responsible for both crawling your site and writing output; instead it's better practice to have each function responsible for one thing only.
You can turn your crawl function into a generator that yields links one at a time, and then write the generated output to a file separately. Replace the three fob lines with:
yield href + '\n'
Then you can do the following:
lines = tru_crawler(3)
filename = 'c:/test/true.txt'
with open(filename, 'w') as handle:
handle.writelines(lines)
Also note the usage of the with statement; opening the file using with automatically closes it once that block ends, saving you from having to call close() yourself.
Taking the idea of generators and task-separation one step further, you may notice that the tru_crawler function is also responsible for generating the list of URLs to crawl. That too can be separated out, if your crawler accepts an iterable of URLs instead of creating them itself. Something like:
def make_urls(base_url, pages):
for page in range(1, pages+1):
yield base_url + str(page)
def crawler(urls):
for url in urls:
#fetch, parse, and yield hrefs
Then, instead of calling tru_crawler(3), it becomes:
urls = make_urls('http://www.truelocal.com.au/find/car_rental/', 3)
lines = crawler(urls)
and then proceed as above.
Now if you want to crawl other sites, you can just change your make_urls call, or create different generators for other URL-patterns, and the rest of your code doesn't need to change!
By default 'w' is truncating mode and you may need append mode. See: https://docs.python.org/2/library/functions.html#open.
Maybe appending your hrefs to a list in while loop and then write to file later would look readable. Or as suggested use yield for efficiency.
Something like
with open('c:/test/true.txt', 'w') as fob:
fob.writelines(yourlistofhref)
https://docs.python.org/2/library/stdtypes.html#file.writelines
I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?
You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)
Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.