I am on Windows 8.1, Python 3.6.
Is it possible to get all currently open websites in the Latest version Of Chrome and save the websites to a text file in D:/.
I tried opening file:
C:\Users\username\AppData\Local\Google\Chrome\User Data\Default\Current Tabs
But I receive an error saying that the file is opened in another program.
There is another file named History that contains URLs that are opened but it also contain characters like NULL.
I tried reading the file in python but I received UndicodeDecodeError(Not sure About This Word).
then I tried opening file by the following code:
with open('C:/Users/username/AppData/Local/Google/Chrome/User Data/Default/History',"r+",encoding='latin') as file:
data = file.read()
print(data)
And it worked. But I got 1 or 2 URLs while in the text file, there were no URLs.
Maybe there's another way something like importing a module.
Something like:
import chrome
url = chrome.get_url()
print(url)
Maybe selenium can also do this. But I don't know how.
Maybe there's another way to read the file with all links in python.
Want I want with it is that it detect websites opened, if mywebsite.com is opened for more than 10 minutes, it will automatically be blocked. The system has its own file:
C:\Windows\System32\drivers\etc\hosts
It will add the following at the end:
127.0.0.1 www.mywebsite.com
And the website will no longer be available to use.
You can use this methodology to store the tab data and manipulate it:
windows = driver.window_handles
You can store the windows using the above method.
current_window = driver.current_window_handle
This method will give you the current window that is being handled. You can go through the list 'windows' and check if it is current_window to navigate between the tabs.
driver.switch_to.window(windows[5])
This method will switch to a desired tab but I assume you already have it.
Now how do you store the time spent after the tabs are opened?
There are two ways to do it:
Internally, by referring to a pandas dataframe or list
Reading and writing to a file.
First you need to import the 'time' library inside the script
current_time=time.time()
current_time is an int representation of the current time. It's a linux timestamp.
In either one of these scenarios, you will need a structure such as this:
data=[]
for i in range(0,len(windows)):
data.append([ windows[i] , time.time() ])
This will give a structure as below:
[[window[0],1234564879],
[window[1],1234567896],...]
Here's the thing you miss:
for i in range(0,len(data)):
if time.time()-data[i][1] > 600 # If new timestamp minus the old one is bigger than 600 seconds
driver.switch_to(data[i][0])
driver.close()
My personal advice is that you start with stable API services to get whatever data you want instead of selenium. I would recommend SerpApi since I work there. It has variety of scrapers including a google results scraper, and it has 5000 free calls for new accounts.
Related
I use Python 3.7 with Spyder in Anaconda. I don't have a lot of experience with Python so I might use the wrong technical terms in my problem description.
I use the requests library to read process data of a list of part numbers, from a database with a web page interface. I use the following code. I found most of it on StackOverflow.
# Libraries
import requests
import pandas as pd
import lxml.html as LH
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
html = requests.get(process_url).content
# Read data to dataframe
df_list = pd.read_html(html)
The for loop fetches the link for the next part number from the hyperlink list and then modifies the process_url to extract the data for that part number. The code above works well except that it takes more than twice as long (2.2 seconds) as my vba code that does the same. It looks like it opens and closes the link for every part number. Is there any way to open the url link and read many different web pages before closing the link.
I'm making the assumption that it opens and closes the link for every part based on the fact that I had the same time delay when I used Excel vba code that opened and closed internet explorer for every data read. When I changed the vba code to keep explorer open and read all the web pages, it took less than a second.
I managed to reduce the time by 0.5 seconds by removing requests.get(process_url).content
and using pandas to directly read the data with df_list = pd.read_html(process_url). It now takes around 1.7 seconds to read the 400 rows of data in the table for each part. This adds up to a good time saving when I have to read thousands of tables but is still slower than the vba script. Below is my new code
import pandas as pd
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
df_list = pd.read_html(process_url)
df = df_list[-1]
I have tens of thousands of URLs which I want to save their webpages to my computer.
I'm trying to open and save these webpages using Chrome automated by pywinauto.
I'm able to open the webpages using the following code:
from pywinauto.application import Application
import pyautogui
chrome_dir = 'C:\Program Files\Google\Chrome\Application\chrome.exe'
start_args = ' --force-renderer-accessibility --start-maximized https://pythonexamples.org/'
app = Application(backend="uia").start(chrome_dir+start_args)
I want to further send a shortcut to the webpage to save it as a mhtml. Ctrl+Shift+Y is the shortcut of a Chrome extension (called SingleFile) that saves a webpage as mhmtl. Then I want to close the tab by typing "Ctrl + F4", before I open another one and repeat the same process.
The keys are not successfully sent to Chrome.
# Sent shortcut (Ctrl+Shift+Y)
pyautogui.press(['ctrl', 'shift', 'y'])
# Close the current tab:
pyautogui.press(['ctrl', 'f4'])
I'm stuck at this step. What's the right way to do this? Thank you!
Tried other alternatives like Selenium, but it was blocked by the remote server.
Why are you using Chrome to get the website data? Generally, using an external application directly (ie. emulating a user) is a horrible and inefficient way to do anything. If your objective is to quickly get and store the data from a website, you should be talking directly to the website, using something like the requests module, which lets you quickly and easily send an HTTP request and get all of the website data. To get MHTML data, you can try something like this.
I wondered if it was possible to query multiple target urls at the same time in Python Selenium. My current code looks like the below which just calls one Url:
Target = 'https://www.skadden.com/professionals?skip=0&position=64063c2e-9576-4f66-91fa-166f4bede9b8&hassearched=true'
My body of code currently works and brings back the data I require. Am I able to call multiple URL's at same time. For example,
Target = ['https://www.skadden.com/professionals?skip=0&position=64063c2e-9576-4f66-91fa-166f4bede9b8&hassearched=true','https://www.skadden.com/professionals?skip=0&position=f1c78f66-a0a6-45e5-8d22-541bd65bb325&hassearched=true']
I have tried this code and it doesn't work. Any ideas?
Thanks
Chris
To do this you need to either start two separate browsers (i.e. request a new session twice) or open two separate tabs as described here: How to open a new tab using Selenium WebDriver with Java?
Is there a way, using urllib2 or something else, to check the time a file was uploaded to a URL? Or even the time the file on the server side was last modified?
At the moment I'm manually using urllib2.urlopen() to read data from a url address. The arguments for the address change each day. What I'd like to do is figure out when each file was first available, so that I can pick the best time for the job to automatically run overnight.
The time is stored in the server which is usually sent to your browser as HTTP headers. You can access this in Javascript using document.lastModified property. Here's a solution in Python that reads headers and parses the information using regular expression and prints the result.
def get_upload_datetime(myurl):
info = urllib2.urlopen(myurl).info()
datetime = re.search("Last-Modified: (.+)", str(info))
if datetime:
return datetime.groups()[0]
If you are also using contents of the webpage,use urlopen.info() and urlopen.read() on the same object (actually read only once) to avoid multiple fetches.
And if you want to do it manually, open webpage in the browser, open console (Ctrl+Shift+J) and type javascript:alert(document.lastModified). It should present an alert box with last modified time.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I want to write a Python script which can keep track of which webpages have been opened in my webbrowser(Mozilla Firefox 23). I don't know where to start. The standard webbrowser module of Python allows webpages to be opened but the standard documentation doesn't have anything about interacting with the webpage.
So do I need to write a plugin for my browser which can send the data to my Python script for am I missing functionality from the standard library?
I have looked at some related questions like this but they are all about simulating web-browser in Python using mechanize and/or selenium. I don't want to do that. I want to get data from my webbrowser in using standard Python libraries.
EDIT
Just to add some more clarity to the question, I want to keep track of the current webpages open in firefox.
This answer may be a bit fuzzy -- that is because the question is not extremely specific.
If I understand it well, you want to examine History of the visited pages. The problem is that it is not directly related to an HTML, nor to http protocol, nor to web services. The history (that you can observe in Firefox when pressing Ctrl-H) is the tool implemented in Firefox and as such, it is definitely implementation dependent. There can be no standard library that would be capable to extract the information.
As for the HTTP protocol and the content of the pages in HTML, there is nothing like interaction with the content of the pages. The protocol uses GET with URL as the argument, and the web server sends back the text body with some meta information. The caller (the browser) can do anything with the returned data. The browser uses the tagged text and interprets it as a readable document with parts rendered as nicely as possible. The interaction (clicking on a href) is implemented by the browser. It causes other GET commands of the http protocol.
To answer your question, you need to find how Mozilla Firefox 23 stores the history. It is likely that you can find it somewhere in the internal SQLite databases.
Update 2015-08-24: See the erasmortg's comment about the changes of placing the information in Firefox. (The text below is older than this one.)
Update: The list of open tabs is bound to the user. As you probably want it for Windows, you should first get the path like c:\Users\myname.mydomain\AppData\Roaming\Mozilla\Firefox\Profiles\yoodw5zk.default-1375107931124\sessionstore.js. The profile name should probably be extracted from the c:\Users\myname.mydomain\AppData\Roaming\Mozilla\Firefox\profiles.ini. I have just copied the sessionstore.js for trying to get the data. As it says javascript, I did use the standard json module to parse it. You basically get the dictionary. One of the items with the key 'windows' contains another dictionary, and its 'tabs' in turn contains information about the tabs.
Copy your sessionstore.js to a working directory and execute the following script there:
#!python3
import json
with open('sessionstore.js', encoding='utf-8') as f:
content = json.load(f)
# The loaded content is a dictionary. List the keys first (console).
for k in content:
print(k)
# Now list the content bound to the keys. As the console may not be capable
# to display all characters, write it to the file.
with open('out.txt', 'w', encoding='utf-8') as f:
# Write the overview of the content.
for k, v in content.items():
# Write the key and the type of the value.
f.write('\n\n{}: {}\n'.format(k, type(v)))
# The value could be of a list type, or just one item.
if isinstance(v, list):
for e in v:
f.write('\t{}\n'.format(e))
else:
f.write('\t{}\n'.format(v))
# Write the content of the tabs in each windows.
f.write('\n\n=======================================================\n\n')
windows = content['windows']
for n, w in enumerate(windows, 1): # the enumerate is used just for numbering the windows
f.write('\n\tWindow {}:\n'.format(n))
tabs = w['tabs']
for tab in tabs:
# The tab is a dictionary. Display only 'title' and 'url' from
# 'entries' subdictionary.
e = tab['entries'][0]
f.write('\t\t{}\n\t\t{}\n\n'.format(e['url'], e['title']))
The result is both displayed on the console (few lines), and written into the out.txt file in the working directory. The out.txt (at the end of file) contains something like that in my case:
Window 1:
http://www.cyrilmottier.com/
Cyril Mottier
http://developer.android.com/guide/components/fragments.html#CommunicatingWithActivity
Fragments | Android Developers
http://developer.android.com/guide/components/index.html
App Components | Android Developers
http://www.youtube.com/watch?v=ONaD1mB8r-A
▶ Introducing RoboSpice: A Robust Asynchronous Networking Library for Android - YouTube
http://www.youtube.com/watch?v=5a91dBLX8Qc
Rocking the Gradle with Hans Dockter - YouTube
http://stackoverflow.com/questions/18439564/how-to-keep-track-of-webpages-opened-in-web-browser-using-python
How to keep track of webpages opened in web-browser using Python? - Stack Overflow
https://www.google.cz/search?q=Mozilla+firefox+list+of+open+tabs&ie=utf-8&oe=utf-8&rls=org.mozilla:cs:official&client=firefox-a&gws_rd=cr
Mozilla firefox list of open tabs - Hledat Googlem
https://addons.mozilla.org/en-US/developers/docs/sdk/latest/dev-guide/tutorials/list-open-tabs.html
List Open Tabs - Add-on SDK Documentation
https://support.mozilla.org/cs/questions/926077
list all tabs button not showing | Fórum podpory Firefoxu | Podpora Mozilly
https://support.mozilla.org/cs/kb/scroll-through-your-tabs-quickly
Scroll through your tabs quickly | Nápověda k Firefox
You want to keep track of the web pages opened in FF through Python. So, why don't you write a web proxy in Python and configure FireFox to use that web proxy.
After that you can filter all the HTTP requests that are emitted from Firefox through regular expression and store them in a file or database.