Crawling all wikipedia pages for phrases in python - python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).
I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:
First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder)
EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?
Your help on either of the issues is greatly appreciated!

Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.
EDIT: I see that you have all the article on you computer, so this answer might not help much.

The snapshots of the pages have pictures and tables in them. How would
I extract solely the main text of the article?
If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):
from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)
BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:
list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)
Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.

bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.
point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))

You asked:
I have downloaded all the articles onto my hard drive, but I'm not
sure how I can tell the program to iterate through each one in the
folder
Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:
for line in open("filename"):
if "search_string" in line:
print line
Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

Related

Database searches in separate files

I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!

How to run a search engine script in Python?

I am learning how to do webscraping, crawlers etc. and I came across this repo. I understand how the code works, what the input and outputs should be, but how do I run it in a terminal on Windows? How do I call the respective .txt files and test the search engine?
I saw that someone else asked that and the creator showed them this link here. But it still doesn't explain how to actually apply it to files.
The author of logicx24 has hard coded the target text files in querytexts.py. See line 122 which reads:
q = Query(['pg135.txt', 'pg76.txt', 'pg5200.txt'])
The list input to Query are all references to files that exist in the corpus directory. Try changing that to include a different file in their corpus directory. Better yet, add a new target text file of your own and use that.
Good luck!
Why are you using text files? I don't get it. Either way, you could just use Python itself to do that. Use the selenium library for Python. There's a tutorial to installing this here. Once that's done, just use this code if you're using Google:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.google.com")
search = driver.find_element_by_css_selector(".gLFyf.gsfi")
time.sleep(5)
search.send_keys("Desired Input Text Goes Here")
search.send_keys(Keys.RETURN)
Don't worry if it takes a while to load. It usually does that. If you want to reduce the amount of time it takes, use a lower number for the parameter on line 8 (time.sleep(5)). Assuming you've gone ahead and learned a bit more about Selenium, there isn't really much else to talk about apart from one thing. That is, line 7 (search = driver.find_element_by_css_selector(".gLFyf.gsfi"). Assuming you've learned advanced CSS selectors already (if you have literally no experience in web development, specifically HTML and CSS, you can just copy-paste the code), the .gLFyf.gsfi is simply the CSS selector for the search bar in Google. You can find the selector for the search bar in any engine by just looking through the source code using Ctrl + Shift + I on Windows. You can use any other Selenium element selector for this as long as it works. Make sure to also change the URL on line 6 (driver.get("https://www.google.com")) to match that of your search engine if you're not using Google.
Sorry if this seemed a bit vague or strange. If you don't really care, feel free to download Selenium, copy-paste the code, and move on. Otherwise, I suggest also learning Selenium and HTML/CSS if you haven't already.

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

Faster alternative to BeautifulSoup for parsing visible text in an html file

I have a directory with 50,000 html files in it. For each of the files I want to find 'visible' texts (text data that would actually be visible to a person viewing the file in a browser). I've seen some fine solutions using libraries such as BeautifulSoup, but I want something faster.
A regex based solution I wrote wasn't much faster.
Would I speed things up by using some kind of file stream reader in python? What other faster alternatives are there?
(I'm happy to lose some accuracy by not using trusted parsers like BeautifulSoup if the solution is faster).
EDIT:
Fast enough = 5 minutes. Faster <= 1.3 hours (That's if BeautifulSoup takes an average of a 10th of a second to parse each file, which seems optimistic based on previous work I've been using it for)
It sounds like you are just trying to render every HTML file in a directory. Why write your own renderer (in Python or in any other language) when there are plenty of others out there?
Here is an example using w3m (you could equally use Lynx, links, ...):
find . -name '*.html' -exec w3m -dump {} \;

want to add url links to .csv datafeed using python

ive looked through the current related questions but have not managed to find anything similar to my needs.
Im in the process of creating a affiliate store using zencart - now one of the issues is that zencart is not designed for redirects and affiliate stores but it can be done. I will be changing the store so it acts like a showcase store showing prices.
There is a mod called easy populate which allows me to upload datafeeds. This is all well and good however my affiliate link will not be in each product. I can do it manually after uploading the data feed and going to each product and then adding it as an image with a redirect link - However when there are over 500 items its going to be a long repetitive and time consuming job.
I have been told that I can add the links to the data feed before uploading it to zencart and this should be done using python. Ive been reading about python for several days now and feel im looking for the wrong things. I was wondering if someone could please advise the simplest way for me to get this done.
I hope the question makes sense
thanks
abs
You could craft a python script using csv module like this:
>>> import csv
>>> cartWriter = csv.writer(open('yourcart.csv', 'wb'))
>>> cartWriter.writerow(['Product', 'yourinfo', 'yourlink'])
You need to know how link should be formatted hoping that it could be composed using the other parameters present on csv file.
First, use the CSV module as systempuntoout told you, secondly, you will want to change your header to:
mimetype='text/csv'
Content-Disposition = 'attachment; filename=name_of_your_file.csv'
The way to do it depends very much of your website implementation. In pure Python you would probably do that with an HttpResponse object. In django, as well, but there are some shortcuts.
You can find a video demonstrating how to create CSV files with Python on showmedo. It's not free however.
Now, to provide a link to download the CSV, this depends of your Website. What is the technology behinds it : pure Python, Django, Pylons, Tubogear ?
If you can't answer the question, you should ask your boss a training about your infrastructure before trying to make change to it.

Categories