Find text in gtk.TextView

Find text in gtk.TextView - python

I have a gtk.Textview. I want to find and select some of the text in this TextView programmatically.
I have this code but it's not working correctly.
search_str = self.text_to_find.get_text()
start_iter = textbuffer.get_start_iter()
match_start = textbuffer.get_start_iter()
match_end = textbuffer.get_end_iter()
found = start_iter.forward_search(search_str,0, None)
if found:
textbuffer.select_range(match_start,match_end)
If the text is found, then it selects all the text in the TextView, but I need it to select only the found text.

start_iter.forward_search returns a tuple of the start and end matches so your found variable has both match_start and match_end in it
this should make it work:
search_str = self.text_to_find.get_text()
start_iter = textbuffer.get_start_iter()
# don't need these lines anymore
#match_start = textbuffer.get_start_iter()
#match_end = textbuffer.get_end_iter()
found = start_iter.forward_search(search_str,0, None)
if found:
match_start,match_end = found #add this line to get match_start and match_end
textbuffer.select_range(match_start,match_end)

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.

There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Python tkinter - Searching for key word that is not a sub string

import keyword
from tkinter import END
def highlight(text):
keywords = keyword.kwlist
for kw in keywords:
text.tag_remove(kw, 1.0, END)
first = 1.0
while True:
first = text.search(
kw, first,
nocase=False,
stopindex=END
)
if first is None or first == "":
break
first_splitted = first.split(".")
if len(first_splitted) == 1:
break
last = f"{first_splitted[0]}.{int(first_splitted[1]) + len(kw)}"
character_position_before_first = f"{first_splitted[0]}.{int(first_splitted[1]) - 1}"
character_before_first = text.get(character_position_before_first)
last_splitted = last.split(".")
character_position_after_last = f"{last_splitted[0]}.{int(last_splitted[1])}"
character_after_last = text.get(character_position_after_last)
if not character_before_first.isspace() and not character_after_last.isspace():
break
text.tag_add(kw, first, last)
first = last
text.tag_config(
kw,
foreground="#aa71eb"
)
Given the following code, I'm trying to highlight key words in a text. The issue is that sub strings are being marked.
Example:
hello this is a test open works too lmao lol lol lol
Would mark is from this and is
I only want it to mark the second is as the first is is a sub string of this
I have no clue why the code above is not working. Help would be appreciated.

Add text in word based on content

I have a batch of .doc documents, in the first line of each document I have the name of a person written. I would like to add in each document the email adress of the person, based on a list I have. How can I use python or vba to program something that does the job for me?
I tried to do this vba code, that finds the name of the person and then writes the email, was thinking to loop it over. However even this minumum working example does not actually work. What am I doing wrong?
Sub email()
Selection.find.ClearFormatting
Selection.find.Replacement.ClearFormatting
If Selection.find.Text = "Chiara Gatta" Then
With Selection.find
.Text = "E-mail:"
.Replacement.Text = "E-mail: chiara.gatta#gmail.com"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchByte = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.find.Execute replace:=wdReplaceAll
End If
End Sub

The question lacks minimum details & code required for help. However I am trying to give you a code that would pickup person names & email addresses from one table in a document containing the code. the table should have 3 columns, 1st col contain Name of the person, 2nd col should contain Email address with 3rd column blank for remarks from code. See image
On running the code you would be prompted to select the word files that would be replaced by the email address. On trial use only copy files and may try only a handful of files at a time (if file sizes are large). It is assumed that files will contain Name and word “E-mail:” (if "E-mail:" word is not in the file try to modify the code as commented)
Code:
Sub test2()
Dim Fldg As FileDialog, Fl As Variant
Dim Thdoc As Document, Edoc As Document
Dim Tbl As Table, Rw As Long, Fnd As Boolean
Dim xName As String, xEmail As String
Set Thdoc = ThisDocument
Set Tbl = Thdoc.Tables(1)
Set Fldg = Application.FileDialog(msoFileDialogFilePicker)
With Fldg
.Filters.Clear
.Filters.Add "Word Documents ", "*.doc,*.dot,*docx,*.docm,*.dotm", 1
.AllowMultiSelect = True
.InitialFileName = "C:\users\user\desktop\folder1\*.doc*" 'use your choice of folder
If .Show <> -1 Then Exit Sub
End With
'Search for each Name in Table 1 column 1
For Rw = 1 To Tbl.Rows.Count
xName = Tbl.Cell(Rw, 1).Range.Text
xEmail = Tbl.Cell(Rw, 2).Range.Text
If Len(xName) > 2 And Len(xEmail) > 2 Then
xName = Left(xName, Len(xName) - 2) 'Clean special characters in word cell text
xEmail = Left(xEmail, Len(xEmail) - 2) 'Clean special characters in word cell text
'open each Document selected & search for names
For Each Fl In Fldg.SelectedItems
Set Edoc = Documents.Open(Fl)
Fnd = False
With Edoc.Content.Find
.ClearFormatting
.Text = xName
.Replacement.Text = xName & vbCrLf & "E-mail: " & xEmail
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceNone
'.Execute Replace:=wdReplaceOne
Fnd = .Found
End With
'if Word "E-mail is not already in the file, delete next if Fnd Branch"
' And use .Execute Replace:=wdReplaceOne instead of .Execute Replace:=wdReplaceNone
If Fnd Then ' If Name is found then Search for "E-Mail:"
Fnd = False
With Edoc.Content.Find
.ClearFormatting
.Text = "E-mail:"
.Replacement.Text = "E-mail: " & xEmail
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceOne
Fnd = .Found
End With
End If
If Fnd Then
Edoc.Save
Tbl.Cell(Rw, 3).Range.Text = "Found & Replaced in " & Fl
Exit For
Else
Tbl.Cell(Rw, 3).Range.Text = "Not found in any selected document"
End If
Edoc.Close False
Next Fl
End If
Next Rw
End Sub
it's operation would be like this. Try to understand each action in the code and modify to your requirement.

Highlighting certain characters in Tkinter

I'm creating a simple text editor in Python 3.4 and Tkinter. At the moment I'm stuck on the find feature.
I can find characters successfully but I'm not sure how to highlight them. I've tried the tag method without success, error:
str object has no attribute 'tag_add'.
Here's my code for the find function:
def find(): # is called when the user clicks a menu item
findin = tksd.askstring('Find', 'String to find:')
contentsfind = editArea.get(1.0, 'end-1c') # editArea is a scrolledtext area
findcount = 0
for x in contentsfind:
if x == findin:
findcount += 1
print('find - found ' + str(findcount) + ' of ' + findin)
if findcount == 0:
nonefound = ('No matches for ' + findin)
tkmb.showinfo('No matches found', nonefound)
print('find - found 0 of ' + findin)
The user inputs text into a scrolledtext field, and I want to highlight the matching strings on that scrolledtext area.
How would I go about doing this?

Use tag_add to add a tag to a region. Also, instead of getting all the text and searching the text, you can use the search method of the widget. I will return the start of the match, and can also return how many characters matched. You can then use that information to add the tag.
It would look something like this:
...
editArea.tag_configure("find", background="yellow")
...
def find():
findin = tksd.askstring('Find', 'String to find:')
countVar = tk.IntVar()
index = "1.0"
matches = 0
while True:
index = editArea.search(findin, index, "end", count=countVar)
if index == "": break
matches += 1
start = index
end = editArea.index("%s + %s c" % (index, countVar.get()))
editArea.tag_add("find", start, end)
index = end

How to get Text coordinates using PyUNO with OpenOffice writer

I have a python script that successfully does a search and replace in an OpenOffice Writer document using PyUNO. I wanna to ask how to get the coordinate of found text?
import string
search = document.createSearchDescriptor()
search.SearchString = unicode('find')
#search.SearchCaseSensitive = True
#search.SearchWords = True
found = document.findFirst(search)
if found:
#log.debug('Found %s' % find)
## any code here help to get the coordinate of found text?
pass

This is some StarBASIC code to find the page number of a search expression in a Writer document:
SUB find_page_number()
oDoc = ThisComponent
oViewCursor = oDoc.getCurrentController().getViewCursor()
oSearchFor = "My Search example"
oSearch = oDoc.createSearchDescriptor()
With oSearch
.SearchRegularExpression = False
.SearchBackwards = False
.setSearchString(oSearchFor)
End With
oFirstFind = oDoc.findFirst(oSearch)
If NOT isNull(oFirstFind) Then
oViewCursor.gotoRange(oFirstFind, False)
MsgBox oViewCursor.getPage()
Else
msgbox "not found: " & oSearchFor
End If
Hope this helps you

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find text in gtk.TextView - python

Related

Extract Text from a word document

Python tkinter - Searching for key word that is not a sub string

Add text in word based on content

Highlighting certain characters in Tkinter

How to get Text coordinates using PyUNO with OpenOffice writer

Categories

Resources