My exact problem statement is:
I have:
sim_ques(a Text) & 2. options(list)
and I want to do this:
Open web browser like chrome and search with string search_query
press ctrl + F programatically and search for an element of list options
if step 2 is possible I want to search multiple strings in the same browser.
NOTE
I have already tried:
Using Google-search-api (https://github.com/abenassi/Google-Search-API) to get results and then retrieved link descriptions from the list. Then I searched the string in the that description.
Code is shown here:
print('Googling')
num_pages = 1
points = list()
content = ""
search_res = google.search(sim_ques, num_pages)
print('\nsearch results achieved\n')
page = ""
for re in search_res:
page = page+re.description
page = page.lower()
# link = search_res[0].link
# print('\nlink obtained\n')
#
# content = get_page(link)
# print('\ncontent recieved\n')
#
# soup = BeautifulSoup(content, "lxml")
# print('\nsoup initialized\n')
#
# # kill all script and style elements
# for script in soup(["script", "style"]):
# script.decompose() # rip it out
#
# # get text
# text = soup.get_text().lower()
#
# # break into lines and remove leading and trailing space on each
# lines = (line.strip() for line in text.splitlines())
# # break multi-headlines into a line each
# chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# # drop blank lines
# page = '\n'.join(chunk for chunk in chunks if chunk)
print('page retrieved' + page)
for o in options:
points.append(page.count(o.lower()))
return points
But I want the results in new browser and not inside python ide. Also results of google-search-api is very slow. Is there any any to make it fast.
Although you tried using Google-search-api https://github.com/abenassi/Google-Search-API, this is really just a wrapper around some scraping code and the poor performance you are experiencing is a result of the specific implementation.
To get performance similar to the performance you experience when using Google Search in a browser, you can set up a custom search on the actual Google Custom Search JSON API, more here https://developers.google.com/custom-search/v1/introduction
This requires that you get an API key and limits you to 10,000 searches daily, but the control panel does allow you to include web search and you can tweak the actual search in many ways. Many people ignore this option because they believe it is restricted to a single site.
Once you obtain an API key and set up your custom search engine, using it and getting good performance is as simple as basic http calls using the standard Python 3 package urllib.
Related
So i am trying to make this bot with the help of chat gpt and yes i know that its not very good but i really like testing it to its limits and see what it can or cant do. So there is this website i use for my job where i type in some numbers and tells me if this number is active or not and i want it to return this to an excel and print it on the console so i dont have to check the excel every time (at least right now that im testing it). Right now i have made it search for some keywords on the website and it finds them then it should print what i want it to print to the console. (e.g if it finds the word "No results found" in greek because thats the language of the site then it should type in the excel beside that number "X" and print in the console "No" if it doesnt find it then it should print "yes" and put an "Active" in the excel etc.) For some reason though this code just doesnt seem to work, is it because of the greek characters? When i run it for test purposes it just prints "yes" even though the key word might not even be in there. I have no idea please help me. Thanks.
The code:
# Import necessary modules
import openpyxl
import requests
from lxml import html
import webbrowser
import time
# Open the Excel file and get the active worksheet
wb = openpyxl.load_workbook("numbers.xlsx")
sheet = wb.active
# Loop through the rows of the worksheet
for row in sheet.rows:
# Get the value of the first cell in the row
number = row[0].value
# Construct the URL using the number
url = "https://www.vrisko.gr/afm-etairies/" + str(number)
# Open the URL in the default web browser
webbrowser.open_new_tab(url)
# Send a request to the URL and get the response
response = requests.get(url)
# Parse the response using lxml
tree = html.fromstring(response.text)
# Look for the word "Δε βρέθηκαν αποτελέσματα" in the page
word = tree.xpath('//*[contains(text(), "Δε βρέθηκαν αποτελέσματα")]')
word1 = tree.xpath('//*[contains(text(), "Έχετε υπερβεί το μέγιστο αριθμό αναζητήσεων ΑΦΜ για σήμερα.")]')
# If the word is found, write the HTML element containing the word to the next cell in the row
if word:
sheet.cell(row=row[0].row, column=row[0].column+1).value = "X"
print("no")
elif word1:
print("spam")
else:
# If the word is not found, write an "ΕΝΕΡΓΟ" to the next cell in the row
sheet.cell(row=row[0].row, column=row[0].column+1).value = "ΕΝΕΡΓΟ"
print("yes")
# Wait for 1 minute before continuing to the next column
time.sleep(30)
# Save the changes to the Excel file
wb.save("numbers.xlsx")
Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists
I coded this scraper using Python 2.7 to fetch links from the first 3 pages of TrueLocal.com.au and write them to a text file.
When I run the program, only the first link is written in the text file. What can I do so that all the URLs returned are written on the file?
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.truelocal.com.au/find/car-rental/' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a', {'class':'name'}):
href = 'http://www.truelocal.com.au' + link.get('href')
fob = open('c:/test/true.txt', 'w')
fob.write(href + '\n')
fob.close()
print (href)
page += 1
#Run the function
tru_crawler(3)
Your problem is that for each link, you open the output file, write it, then close the file again. Not only is this inefficient, but unless you open the file in "append" mode each time, it will just get overwritten. What's happening is actually that the last link gets left in the file and everything prior is lost.
The quick fix would be to change the open mode from 'w' to 'a', but it'd be even better to slightly restructure your program. Right now the tru_crawler function is responsible for both crawling your site and writing output; instead it's better practice to have each function responsible for one thing only.
You can turn your crawl function into a generator that yields links one at a time, and then write the generated output to a file separately. Replace the three fob lines with:
yield href + '\n'
Then you can do the following:
lines = tru_crawler(3)
filename = 'c:/test/true.txt'
with open(filename, 'w') as handle:
handle.writelines(lines)
Also note the usage of the with statement; opening the file using with automatically closes it once that block ends, saving you from having to call close() yourself.
Taking the idea of generators and task-separation one step further, you may notice that the tru_crawler function is also responsible for generating the list of URLs to crawl. That too can be separated out, if your crawler accepts an iterable of URLs instead of creating them itself. Something like:
def make_urls(base_url, pages):
for page in range(1, pages+1):
yield base_url + str(page)
def crawler(urls):
for url in urls:
#fetch, parse, and yield hrefs
Then, instead of calling tru_crawler(3), it becomes:
urls = make_urls('http://www.truelocal.com.au/find/car_rental/', 3)
lines = crawler(urls)
and then proceed as above.
Now if you want to crawl other sites, you can just change your make_urls call, or create different generators for other URL-patterns, and the rest of your code doesn't need to change!
By default 'w' is truncating mode and you may need append mode. See: https://docs.python.org/2/library/functions.html#open.
Maybe appending your hrefs to a list in while loop and then write to file later would look readable. Or as suggested use yield for efficiency.
Something like
with open('c:/test/true.txt', 'w') as fob:
fob.writelines(yourlistofhref)
https://docs.python.org/2/library/stdtypes.html#file.writelines
I am writing a code which creates several URLs, which again are stored in a list.
The next step would be, open each URL, download the data (which is only text, formatted in XML or JSON) and save the downloaded data.
My code works fine thanks to the online community here up. It stuck at the point to open the URL and download the data. I want the url.request to loop through the list with my created urls and call each url seperately, open it, display it and move on to the next. But it only does the loop to create the urls, but then nothing. No feedback, nothing.
import urllib.request
.... some calculations for llong and llat ....
#create the URLs and store in list
urls = []
for lat,long,lat1,long1 in (zip(llat, llong,llat[1:],llong[1:])):
for pages in range (1,17):
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
print (urls)
#accessing the website
data = []
for amounts in urls:
response = urllib.request.urlopen(urls)
flickrapi = data.read()
data.append(+flickrapi)
data.close()
print (data)
What am I doing wrong`?
The next step would be, downloading the data and save them to a file or somewhere else for further processing.
Since I will receive heaps of data, like a lot lot lot, I am not sure what would be the best way to store it to precess it with R (or maybe Python? - need to do some statistical work on it). Any suggestions?
You're not appending your generated urls to the url list, you are printing them:
print ("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Should be:
urls.append("https://api.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=5.b&nojsoncallback=1&page={}&per_page=250&bbox={},{},{},{}&accuracy=1&has_geo=1&extras=geo,tags,views,description".format(pages,long,lat,long1,lat1))
Then you can iterate over the urls as planned.
But you'll run into the error on the following line:
response = urllib.request.urlopen(urls)
Here you are feeding the whole set of urls into urlopen, where you should be passing in a single url from urls which you have named amounts like so:
response = urllib.request.urlopen(amounts)
I'm doing some web scraping (actually geocoding using a web service) and writing the results to a file:
for i, row in enumerate(data):
data[0] = url
output = {}
try:
r = requests.get(url)
if r.status_code == 200:
results = r.json()
if results:
output['Lat'] = results['wgs84_lat']
output['Lng'] = results['wgs84_lon']
writer.writerow(output)
except:
pass
My problem is that if something goes wrong half-way through and I have to stop the process, I end up with a half-written file.
Then I have two choices: I can either restart from scratch (tedious with a million rows to geocode), or I can add boilerplate code to check whether the row already exists in the output file, and skip it if so.
I feel there must be a more convenient way to check whether the row has already been obtained. What's the neatest, most Pythonic way to do this? Perhaps pickle (which I've never used?