Scraping and writing output to text file - python

I coded this scraper using Python 2.7 to fetch links from the first 3 pages of TrueLocal.com.au and write them to a text file.
When I run the program, only the first link is written in the text file. What can I do so that all the URLs returned are written on the file?
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.truelocal.com.au/find/car-rental/' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a', {'class':'name'}):
href = 'http://www.truelocal.com.au' + link.get('href')
fob = open('c:/test/true.txt', 'w')
fob.write(href + '\n')
fob.close()
print (href)
page += 1
#Run the function
tru_crawler(3)

Your problem is that for each link, you open the output file, write it, then close the file again. Not only is this inefficient, but unless you open the file in "append" mode each time, it will just get overwritten. What's happening is actually that the last link gets left in the file and everything prior is lost.
The quick fix would be to change the open mode from 'w' to 'a', but it'd be even better to slightly restructure your program. Right now the tru_crawler function is responsible for both crawling your site and writing output; instead it's better practice to have each function responsible for one thing only.
You can turn your crawl function into a generator that yields links one at a time, and then write the generated output to a file separately. Replace the three fob lines with:
yield href + '\n'
Then you can do the following:
lines = tru_crawler(3)
filename = 'c:/test/true.txt'
with open(filename, 'w') as handle:
handle.writelines(lines)
Also note the usage of the with statement; opening the file using with automatically closes it once that block ends, saving you from having to call close() yourself.
Taking the idea of generators and task-separation one step further, you may notice that the tru_crawler function is also responsible for generating the list of URLs to crawl. That too can be separated out, if your crawler accepts an iterable of URLs instead of creating them itself. Something like:
def make_urls(base_url, pages):
for page in range(1, pages+1):
yield base_url + str(page)
def crawler(urls):
for url in urls:
#fetch, parse, and yield hrefs
Then, instead of calling tru_crawler(3), it becomes:
urls = make_urls('http://www.truelocal.com.au/find/car_rental/', 3)
lines = crawler(urls)
and then proceed as above.
Now if you want to crawl other sites, you can just change your make_urls call, or create different generators for other URL-patterns, and the rest of your code doesn't need to change!

By default 'w' is truncating mode and you may need append mode. See: https://docs.python.org/2/library/functions.html#open.
Maybe appending your hrefs to a list in while loop and then write to file later would look readable. Or as suggested use yield for efficiency.
Something like
with open('c:/test/true.txt', 'w') as fob:
fob.writelines(yourlistofhref)
https://docs.python.org/2/library/stdtypes.html#file.writelines

Related

Replace part of a string with values from CSV, save all values in CSV PYTHON

I have a list of a million pins and one URL that has a pin within the URL
Example:
https://www.example.com/api/index.php?pin=101010&key=113494
I have to change the number "101010" for the pin with the list of a million values like 093939,493943,344454 that I have in a csv file and then save all of those new urls to a csv file.
Here's what I have tried doing so far that has not worked:
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url1 = url.split('=')
url2 = ''.join(url1[:-2] + [var] + [url1[-1]])
print(url2)
change('xxxxxxxxxx')
Also this is for an api request that goes to a json page. Would using python and then reiterating through these urls I save on a csv file be the best way to do this? I want to collect some information for all of the pins that I have and save it to a BigQuery database, or somewhere where I can connect to Google Data Studio in order to have the ability to create a dashboard using all of this data.
Any ideas? What do you think the best way of getting this done would be?
Answering the first part of the question, the changes to the change function returns a list of urls using f-strings.
This can then be applied via list comprehension.
The variable url_variables would be the list of integers for the variable you are reading in from the other file.
Then writing the url list to rows in a csv.
import csv
url_variables = [93939, 493943, 344454]
def change(var_data):
var_data = str(var_data)
url = 'https://www.example.com/api/index.php?pin='
key = 'key=113494'
new_url = f'{url}{var_data}&{key}'
return(new_url)
url_list = [change(x) for x in url_variables]
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
for val in url_list:
writer.writerow([val])
Output in output.csv
1. First part of the question: (replace the numbers between "pin=" and "&") I will use an answer from Change a text between two strings in Python with Regex post:
import re
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url2 = re.sub("(?<=pin=).*?(?=&)",var,url)
print(url2)
change('xxxxxxxxxx')
Here I use the sub method from the built-in package "re" and the RegEx Lookarounds sintax, where:
(?<=pin=) # Asserts that what immediately precedes the current position in the string is "pin="
.*? # is the occurrence of any character
(?=&) #Asserts that what immediately follows the current position in the string is "&"
Here is a formal explanation about the Lookarounds syntax.
2. Second part of the question: As another answer explains, you can register the urls in the csv file by rows but I recommend you to read this post about handling csv files with python and you can give yourself an idea of ​​the way you want to save them.
I am not very good at english but I hope that I have explained myself well.

Loop for automatically webscrape data from few pages

Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists

Continuing where I left off Python

I have a bunch of list of links which I'm doing a specific function on each link, the function takes about 25 sec, I use selenium to open each and get the page source of it then do my function, however whenever I build the program and cancel the build, I will have to start all over again.
Note:I get links from different webs sitemap.
Is there a way to save my progress and continue it later on?
this code will work. I assume you already have a function got getting links. I have just used a dummy one _get_links.
You will have to delete the content of links file and need to put 0 in index file after every successful run.
import time
def _get_links():
return ["a", "b", "c"]
def _get_links_from_file():
with open("links") as file:
return file.read().split(",")
def _do_something(link):
print(link)
time.sleep(30)
def _save_links_to_file(links):
with open("links", "w") as file:
file.write(",".join(links))
print("links saved")
def _save_index_to_file(index):
with open("index", "w") as file:
file.write(str(index))
print("index saved")
def _get_index_from_file():
with open("index",) as file:
return int(file.read().strip())
def process_links():
links=_get_links_from_file()
if len(links) == 0:
links = _get_links()
_save_links_to_file(links)
else:
links = _get_links_from_file()[_get_index_from_file():]
for index, link in enumerate(links):
_do_something(link)
_save_index_to_file(index+1)
if __name__ == '__main__':
process_links()
I would suggest that you write out the links to a file along with a date/time stamp of the last time it was processed. When you write links to the file, you will want to make sure that you don't write the same link twice. You will also want to date/time stamp a link after you are done processing it.
Once you have this list, when the script is started you read the entire list and start processing links that haven't been processed in X days (or whatever your criteria is).
Steps:
Load links file
Scrape links from sitemap, compare to existing links from file, write any new links to file
Find the first link that hasn't been processed in X days
Process that link then write date/time stamp next to link, e.g.
http://www.google.com,1/25/2019 12:00PM
Go back to Step 3
Now any time you kill the run, the process will pick up where you left off.
NOTE: Just writing out the date may be enough. It just depends on how often you want to refresh your list (hourly, etc.) or if you want that much detail.
You should save the links in a text file. You should also save the index numbers in another text file, probably initializing with 0.
In your code, you can then loop through the links using something like:
for link in links[index_number:]
At the end of every loop, add the index number to the text file holding the index numbers. This would help you continue from where you left off.

Open Browser with python along with a search command

My exact problem statement is:
I have:
sim_ques(a Text) & 2. options(list)
and I want to do this:
Open web browser like chrome and search with string search_query
press ctrl + F programatically and search for an element of list options
if step 2 is possible I want to search multiple strings in the same browser.
NOTE
I have already tried:
Using Google-search-api (https://github.com/abenassi/Google-Search-API) to get results and then retrieved link descriptions from the list. Then I searched the string in the that description.
Code is shown here:
print('Googling')
num_pages = 1
points = list()
content = ""
search_res = google.search(sim_ques, num_pages)
print('\nsearch results achieved\n')
page = ""
for re in search_res:
page = page+re.description
page = page.lower()
# link = search_res[0].link
# print('\nlink obtained\n')
#
# content = get_page(link)
# print('\ncontent recieved\n')
#
# soup = BeautifulSoup(content, "lxml")
# print('\nsoup initialized\n')
#
# # kill all script and style elements
# for script in soup(["script", "style"]):
# script.decompose() # rip it out
#
# # get text
# text = soup.get_text().lower()
#
# # break into lines and remove leading and trailing space on each
# lines = (line.strip() for line in text.splitlines())
# # break multi-headlines into a line each
# chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# # drop blank lines
# page = '\n'.join(chunk for chunk in chunks if chunk)
print('page retrieved' + page)
for o in options:
points.append(page.count(o.lower()))
return points
But I want the results in new browser and not inside python ide. Also results of google-search-api is very slow. Is there any any to make it fast.
Although you tried using Google-search-api https://github.com/abenassi/Google-Search-API, this is really just a wrapper around some scraping code and the poor performance you are experiencing is a result of the specific implementation.
To get performance similar to the performance you experience when using Google Search in a browser, you can set up a custom search on the actual Google Custom Search JSON API, more here https://developers.google.com/custom-search/v1/introduction
This requires that you get an API key and limits you to 10,000 searches daily, but the control panel does allow you to include web search and you can tweak the actual search in many ways. Many people ignore this option because they believe it is restricted to a single site.
Once you obtain an API key and set up your custom search engine, using it and getting good performance is as simple as basic http calls using the standard Python 3 package urllib.

Open URLS from list and write data

I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?
You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)

Categories