How to run a search engine script in Python? - python

I am learning how to do webscraping, crawlers etc. and I came across this repo. I understand how the code works, what the input and outputs should be, but how do I run it in a terminal on Windows? How do I call the respective .txt files and test the search engine?
I saw that someone else asked that and the creator showed them this link here. But it still doesn't explain how to actually apply it to files.

The author of logicx24 has hard coded the target text files in querytexts.py. See line 122 which reads:
q = Query(['pg135.txt', 'pg76.txt', 'pg5200.txt'])
The list input to Query are all references to files that exist in the corpus directory. Try changing that to include a different file in their corpus directory. Better yet, add a new target text file of your own and use that.
Good luck!

Why are you using text files? I don't get it. Either way, you could just use Python itself to do that. Use the selenium library for Python. There's a tutorial to installing this here. Once that's done, just use this code if you're using Google:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.google.com")
search = driver.find_element_by_css_selector(".gLFyf.gsfi")
time.sleep(5)
search.send_keys("Desired Input Text Goes Here")
search.send_keys(Keys.RETURN)
Don't worry if it takes a while to load. It usually does that. If you want to reduce the amount of time it takes, use a lower number for the parameter on line 8 (time.sleep(5)). Assuming you've gone ahead and learned a bit more about Selenium, there isn't really much else to talk about apart from one thing. That is, line 7 (search = driver.find_element_by_css_selector(".gLFyf.gsfi"). Assuming you've learned advanced CSS selectors already (if you have literally no experience in web development, specifically HTML and CSS, you can just copy-paste the code), the .gLFyf.gsfi is simply the CSS selector for the search bar in Google. You can find the selector for the search bar in any engine by just looking through the source code using Ctrl + Shift + I on Windows. You can use any other Selenium element selector for this as long as it works. Make sure to also change the URL on line 6 (driver.get("https://www.google.com")) to match that of your search engine if you're not using Google.
Sorry if this seemed a bit vague or strange. If you don't really care, feel free to download Selenium, copy-paste the code, and move on. Otherwise, I suggest also learning Selenium and HTML/CSS if you haven't already.

Related

Internet Shortcut in python

I have a problem. Let's say I have a website (e.g. www.google.com). Is there any way to create a file with a .url extension linking to this website in python? (I am currently looking for a flat, and I am trying to save shortcuts on my hard drive only to apartment offers posted online matching my expectations ) I've tried to use the os and requests module to create such files, but with no success. I would really appreciate the help. (I am using python 3.9.6 on Windows 10)
This is pretty straightforward. I had no idea what .URL files were before seeing this post, so I decided to drag its URL to my desktop. It created a file with the following contents which I viewed in Notepad:
[InternetShortcut]
URL=https://stackoverflow.com/questions/68304057/internet-shortcut-in-python
So, you just need to write out the same thing via Python, except replace the URL with the one you want:
test_url = r'https://www.google.com/'
with open('Google.url','w') as f:
f.write(f"""[InternetShortcut]
URL={test_url}
""")
With regards to your current attempts:
I've tried to use os and requests module to create such file
It's not clear what you're using requests or os for, since you didn't provide a Minimal Reproduceable Example of what you'd tried so far; so, if there's a more complex element to this that you didn't specify, such as automatically generating the file while you're in your browser, or something like that, then you need to update your question to include all of your requirements.

Downloading file names with commas in them using Selenium?

So I'm doing a very simple click on link to download file in selenium. It looks something like this:
driver.find_element_by_xpath('element_xpath{0}'.format(i)).click()
Which works just fine. My problem is sometimes chrome throws a ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION.
I googled that to find this question and basically chrome throws that error when there is a comma in the file name and I have verified that this is exactly what is happening in my case as well. Now I realize I might be able to fix this with perhaps the requests library using the same suggestions as the ones in the question above; namely wrapping the file name in quotes or replace the comma with another character.
But my question is, is there any way to handle this issue in selenium? Chrome throws the same error when I manually try to download the file, IE works fine. Switching the selenium driver to IE is something I would like to avoid because it creates a whole lot of other problems.
Any help is appreciated. Thanks.

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

Interact with other programs using Python

I'm having the idea of writing a program using Python which shall find a lyric of a song whose name I provided. I think the whole process should boil down to couple of things below. These are what I want the program to do when I run it:
prompt me to enter a name of a song
copy that name
open a web browser (google chrome for example)
paste that name in the address bar and find information about the song
open a page that contains the lyrics
copy that lyrics
run a text editor (like Microsoft Word for instance)
paste the lyrics
save the new text file with the name of the song
I am not asking for code, of course. I just want to know the concepts or ideas about how to use python to interact with other programs
To be more specific, I think I want to know, fox example, just how we point out where is the address bar in Google Chrome and tell python to paste the name there. Or how we tell python how to copy the lyrics as well as paste it into the Microsof Word's sheet then save it.
I've been reading (I'm still reading) several books on Python: Byte of python, Learn python the hard way, Python for dummies, Beginning Game Development with Python and Pygame. However, I found out that it seems like I only (or almost only) learn to creat programs that work on itself (I can't tell my program to do things I want with other programs that are already installed on my computer)
I know that my question somehow sounds rather silly, but I really want to know how it works, the way we tell Python to regconize that this part of the Google chrome browser is the address bar and that it should paste the name of the song in it. The whole idea of making python interact with another program is really really vague to me and I just
extremely want to grasp that.
Thank you everyone, whoever spend their time reading my so-long question.
ttriet204
If what you're really looking into is a good excuse to teach yourself how to interact with other apps, this may not be the best one. Web browsers are messy, the timing is going to be unpredictable, etc. So, you've taken on a very hard task—and one that would be very easy if you did it the usual way (talk to the server directly, create the text file directly, etc., all without touching any other programs).
But if you do want to interact with other apps, there are a variety of different approaches, and which is appropriate depends on the kinds of apps you need to deal with.
Some apps are designed to be automated from the outside. On Windows, this nearly always means they a COM interface, usually with an IDispatch interface, for which you can use pywin32's COM wrappers; on Mac, it means an AppleEvent interface, for which you use ScriptingBridge or appscript; on other platforms there is no universal standard. IE (but probably not Chrome) and Word both have such interfaces.
Some apps have a non-GUI interface—whether that's a command line you can drive with popen, or a DLL/SO/DYLIB you can load up through ctypes. Or, ideally, someone else has already written Python bindings for you.
Some apps have nothing but the GUI, and there's no way around doing GUI automation. You can do this at a low level, by crafting WM_ messages to send via pywin32 on Windows, using the accessibility APIs on Mac, etc., or at a somewhat higher level with libraries like pywinauto, or possibly at the very high level of selenium or similar tools built to automate specific apps.
So, you could do this with anything from selenium for Chrome and COM automation for Word, to crafting all the WM_ messages yourself. If this is meant to be a learning exercise, the question is which of those things you want to learn today.
Let's start with COM automation. Using pywin32, you directly access the application's own scripting interfaces, without having to take control of the GUI from the user, figure out how to navigate menus and dialog boxes, etc. This is the modern version of writing "Word macros"—the macros can be external scripts instead of inside Word, and they don't have to be written in VB, but they look pretty similar. The last part of your script would look something like this:
word = win32com.client.dispatch('Word.Application')
word.Visible = True
doc = word.Documents.Add()
doc.Selection.TypeText(my_string)
doc.SaveAs(r'C:\TestFiles\TestDoc.doc')
If you look at Microsoft Word Scripts, you can see a bunch of examples. However, you may notice they're written in VBScript. And if you look around for tutorials, they're all written for VBScript (or older VB). And the documentation for most apps is written for VBScript (or VB, .NET, or even low-level COM). And all of the tutorials I know of for using COM automation from Python, like Quick Start to Client Side COM and Python, are written for people who already know about COM automation, and just want to know how to do it from Python. The fact that Microsoft keeps changing the name of everything makes it even harder to search for—how would you guess that googling for OLE automation, ActiveX scripting, Windows Scripting House, etc. would have anything to do with learning about COM automation? So, I'm not sure what to recommend for getting started. I can promise that it's all as simple as it looks from that example above, once you do learn all the nonsense, but I don't know how to get past that initial hurdle.
Anyway, not every application is automatable. And sometimes, even if it is, describing the GUI actions (what a user would click on the screen) is simpler than thinking in terms of the app's object model. "Select the third paragraph" is hard to describe in GUI terms, but "select the whole document" is easy—just hit control-A, or go to the Edit menu and Select All. GUI automation is much harder than COM automation, because you either have to send the app the same messages that Windows itself sends to represent your user actions (e.g., see "Menu Notifications") or, worse, craft mouse messages like "go (32, 4) pixels from the top-left corner, click, mouse down 16 pixels, click again" to say "open the File menu, then click New".
Fortunately, there are tools like pywinauto that wrap up both kinds of GUI automation stuff up to make it a lot simpler. And there are tools like swapy that can help you figure out what commands you want to send. If you're not wedded to Python, there are also tools like AutoIt and Actions that are even easier than using swapy and pywinauto, at least when you're getting started. Going this way, the last part of your script might look like:
word.Activate()
word.MenuSelect('File->New')
word.KeyStrokes(my_string)
word.MenuSelect('File->Save As')
word.Dialogs[-1].FindTextField('Filename').Select()
word.KeyStrokes(r'C:\TestFiles\TestDoc.doc')
word.Dialogs[-1].FindButton('OK').Click()
Finally, even with all of these tools, web browsers are very hard to automate, because each web page has its own menus, buttons, etc. that aren't Windows controls, but HTML. Unless you want to go all the way down to the level of "move the mouse 12 pixels", it's very hard to deal with these. That's where selenium comes in—it scripts web GUIs the same way that pywinauto scripts Windows GUIs.
The following script uses Automa to do exactly what you want (tested on Word 2010):
def find_lyrics():
print 'Please minimize all other open windows, then enter the song:'
song = raw_input()
start("Google Chrome")
# Disable Google's autocompletion and set the language to English:
google_address = 'google.com/webhp?complete=0&hl=en'
write(google_address, into="Address")
press(ENTER)
write(song + ' lyrics filetype:txt')
click("I'm Feeling Lucky")
press(CTRL + 'a', CTRL + 'c')
press(ALT + F4)
start("Microsoft Word")
press(CTRL + 'v')
press(CTRL + 's')
click("Desktop")
write(song + ' lyrics', into="File name")
click("Save")
press(ALT + F4)
print("\nThe lyrics have been saved in file '%s lyrics' "
"on your desktop." % song)
To try it out for yourself, download Automa.zip from its Download page and unzip into, say, c:\Program Files. You'll get a folder called Automa 1.1.2. Run Automa.exe in that folder. Copy the code above and paste it into Automa by right-clicking into the console window. Press Enter twice to get rid of the last ... in the window and arrive back at the prompt >>>. Close all other open windows and type
>>> find_lyrics()
This performs the required steps.
Automa is a Python library: To use it as such, you have to add the line
from automa.api import *
to the top of your scripts and the file library.zip from Automa's installation directory to your environment variable PYTHONPATH.
If you have any other questions, just let me know :-)
Here's an implementation in Python of #Matteo Italia's comment:
You are approaching the problem from a "user perspective" when you
should approach it from a "programmer perspective"; you don't need to
open a browser, copy the text, open Word or whatever, you need to
perform the appropriate HTTP requests, parse the relevant HTML,
extract the text and write it to a file from inside your Python
script. All the tools to do this are available in Python (in
particular you'll need urllib2 and BeautifulSoup).
#!/usr/bin/env python
import codecs
import json
import sys
import urllib
import urllib2
import bs4 # pip install beautifulsoup4
def extract_lyrics(page):
"""Extract lyrics text from given lyrics.wikia.com html page."""
soup = bs4.BeautifulSoup(page)
result = []
for tag in soup.find('div', 'lyricbox'):
if isinstance(tag, bs4.NavigableString):
if not isinstance(tag, bs4.element.Comment):
result.append(tag)
elif tag.name == 'br':
result.append('\n')
return "".join(result)
# get artist, song to search
artist = raw_input("Enter artist:")
song = raw_input("Enter song:")
# make request
query = urllib.urlencode(dict(artist=artist, song=song, fmt="realjson"))
response = urllib2.urlopen("http://lyrics.wikia.com/api.php?" + query)
data = json.load(response)
if data['lyrics'] != 'Not found':
# print short lyrics
print(data['lyrics'])
# get full lyrics
lyrics = extract_lyrics(urllib2.urlopen(data['url']))
# save to file
filename = "[%s] [%s] lyrics.txt" % (data['artist'], data['song'])
with codecs.open(filename, 'w', encoding='utf-8') as output_file:
output_file.write(lyrics)
print("written '%s'" % filename)
else:
sys.exit('not found')
Example
$ printf "Queen\nWe are the Champions" | python get-lyrics.py
Output
I've paid my dues
Time after time
I've done my sentence
But committed no crime
And bad mistakes
I've made a few
I've had my share of sand kicked [...]
written '[Queen] [We are the Champions] lyrics.txt'
If you really want to open a browser, etc, look at selenium. But that's overkill for your purposes. Selenium is used to simulate button clicks, etc for testing the appearance of websites on various browsers, etc. Mechanize is less of an overkill for this
What you really want to do is understand how a browser (or any other program) works under the hood i.e. when you click on the mouse or type on the keyboard or hit Save, what does the program do behind the scenes? It is this behind-the-scenes work that you want your python code to do.
So, use urllib, urllib2 or requests (or heck, even scrapy) to request a web page (learn how to put together the url to a google search or the php GET request of a lyrics website). Google also has a search API that you can take advantage of, to perform a google search.
Once you have your results from your page request, parse it with xml, beautifulsoup, lxlml, etc and find the section of the request result that has the information you're after.
Now that you have your lyrics, the simplest thing to do is open a text file and dump the lyrics in there and write to disk. But if you really want to do it with MS Word, then open a doc file in notepad or notepad++ and look at its structure. Now, use python to build a document with similar structure, wherein the content will be the downloaded lyrics.
If this method fails, you could look into pywinauto or such to automate the pasting of text into an MS Word doc and clicking on Save
Citation: Matteo Italia, g.d.d.c from the comments on the OP
You should look into a package called selenium for interacting with web browsers

Crawling all wikipedia pages for phrases in python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).
I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:
First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder)
EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?
Your help on either of the issues is greatly appreciated!
Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.
EDIT: I see that you have all the article on you computer, so this answer might not help much.
The snapshots of the pages have pictures and tables in them. How would
I extract solely the main text of the article?
If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):
from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)
BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:
list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)
Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.
bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.
point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))
You asked:
I have downloaded all the articles onto my hard drive, but I'm not
sure how I can tell the program to iterate through each one in the
folder
Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:
for line in open("filename"):
if "search_string" in line:
print line
Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

Categories