How to use Python to find all isbn in a text file? - python

I have a text file text_isbn with loads of ISBN in it. I want to write a script to parse it and write it to a new text file with each ISBN number in a new line.
Thus far I could write the regular expression for finding the ISBN, but could not process any further:
import re
list = open("text_isbn", "r")
regex = re.compile('(?:[0-9]{3}-)?[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,6}-[0-9]')
I tried to use the following but got an error (I guess the list is not in proper format...)
parsed = regex.findall(list)
How to do the parsing and write it to a new file (output.txt)?
Here is a sample of the text in text_isbn
Praxisguide Wissensmanagement - 978-3-540-46225-5
Programmiersprachen - 978-3-8274-2851-6
Effizient im Studium - 978-3-8348-8108-3

How about
import re
isbn = re.compile("(?:[0-9]{3}-)?[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,6}-[0-9]")
matches = []
with open("text_isbn") as isbn_lines:
for line in isbn_lines:
matches.extend(isbn.findall(line))

try this regex (from regular expression cookbook ):
import re
data = open("text_isbn", "r")
regex = "(?:ISBN(?:-1[03])?:? )?(?=[-0-9 ]{17}$|[-0-9X ]{13}$|[0-9X]{10}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?(?:[0-9]+[- ]?){2}[0-9X]$"
for l in data.readlines():
match = re.search(regex, l)
isbn = match.group()
outfile.write('%s\n' % isbn)
tested with your sample data. assume that each line contain only one isbn number

Related

Extract only the title of an article from a TXT file in Python

I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), bulk convert to TXT (done), and I am stuck in the last phase.
The structure of the TXTs is as follows:
--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations.
A review of the literature
Authors list, etc, etc ---
In need only the title (in bold), from each file. I can do the iteration, that is not a problem.
With the code below I tried to identify paragraph 1:
data = file.read()
array1 = []
sp = data.split("\n\n")
for number, paragraph in enumerate(sp, 1):
if number == 1:
array1 += [paragraph]
print (array1)
No results whatsoever...
The idea is that I need to save only the titles in a file (could be TXT) as I need this list for another purpose.
Many thanks!
You might read the whole file using .read() and use a pattern with a capture group to match from JML to Authors.
^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b
The pattern matches:
^ Start of string
JML\s*\| match JML, optional whitespace chars and |
.*\s*\r?\n Match the rest of the line, optional whitespace chars and a newline
( Capture group 1
(?:.*\r?\n)*? Match all lines as least as possible
) Close group 1
Authors\b Authors
Regex demo
For example:
import os
import re
pattern = r"^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b"
array1 = []
for file in os.listdir():
with open(file, "r") as data:
array1 = array1 + re.findall(pattern, data.read(), re.MULTILINE)
print(array1)

how to get the number of occurrence of an expression in a file using python

I have a code that read files and find the matching expression with the user input and highlight it, using findall function in regular expression.
also i am trying to save in json file several information based on this matching.
like :
file name
matching expression
number of occurrence
the problem is that the program read the file and display the text with highlighted expression but in the json file it save the number of occurrence as the number of lines.
in this example the word this is the searched word it exist in the text file twice
the result in the json file is = 12 ==> that is the number of text lines
result of the json file and the highlighted text
code:
def MatchFunc(self):
self.textEdit_PDFpreview.clear()
x = self.lineEditSearch.text()
TextString=self.ReadingFileContent(self.FileListSelected())
d = defaultdict(list)
filename = os.path.basename(self.FileListSelected())
RepX='<u><b style="color:#FF0000">'+x+'</b></u>'
for counter , myLine in enumerate(filename):
self.textEdit_PDFpreview.clear()
thematch=re.sub(x,RepX,TextString)
thematchFilt=re.findall(x,TextString,re.M|re.I)
if thematchFilt:
d[thematchFilt[0]].append(counter + 1)
self.textEdit_PDFpreview.insertHtml(str(thematch))
else:
self.textEdit_PDFpreview.insertHtml('No Match Found')
OutPutListMetaData = []
for match , positions in d.items():
print ("this is match {}".format(match))
print("this is position {}".format(positions))
listMetaData = {"File Name":filename,"Searched Word":match,"Number Of Occurence":len(positions)}
OutPutListMetaData.append(listMetaData)
for p in positions:
print("on line {}".format(p))
jsondata = json.dumps(OutPutListMetaData,indent=4)
print(jsondata)
folderToCreate = "search_result"
today = time.strftime("%Y%m%d__%H-%M")
jsonFileName = "{}_searchResult.json".format(today)
if not(os.path.exists(os.getcwd() + os.sep + folderToCreate)):
os.mkdir("./search_result")
fpJ = os.path.join(os.getcwd()+os.sep+folderToCreate,jsonFileName)
print(fpJ)
with open(fpJ,"a") as jsf:
jsf.write(jsondata)
print("finish writing")
It's straightforward using Counter. Once you pass an iterable, it returns each one of them along with the number of occurrences as tuples.
As the re.findall function returns a list you can just do len(result).

XML format in Python

I needed to take an XML file and replace certain values with other values.
This was easy enough parsing through the xml (as text) and replacing the old values with the new.
The issue is the new txt file is in the wrong format.
It's all encased in square brackets and has "/n" characters instead of linebreaks.
I did try the xml.dom.minidom lib but it's not working ...
I could parse the resulting file aswell and remove the "/n" and square brackets but don't want to do that as I am not sure this is the only thing that has been added in this format.
source code :
import json
import shutil
import itertools
import datetime
import time
import calendar
import sys
import string
import random
import uuid
import xml.dom.minidom
inputfile = open('data.txt')
outputfile = open('output.xml','w')
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
my_text = str(inputfile.readlines())
my_text2 = ''
#print (my_text)
#Just replicate the session logs x times ...
#print ("UUID random : " + str(uuid.uuid4()).replace("-","")[0:28])
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
#xml = xml.dom.minidom.parseString(my_text2)
outputfile.write(my_text2)
print (my_text)
inputfile.close()
outputfile.close()
The original text is XML format but the output is like
time is it</span></div><div
class=\\"di_transcriptAvatarAnswerEntry\\"><span
class=\\"di_transcriptAvatarTitle\\">[AVATAR] </span> <span
class=\\"di_transcriptAvatarAnswerText\\">My watch says it\'s 6:07
PM.<br/>Was my answer helpful? No Yes</span></div>\\r\\n"
</variable>\n', '</element>\n', '</path>\n', '</transaction>\n',
'</session>\n', '<session>']
You are currently using readlines(). This will read each line of your file and return you a Python list, one line per entry (complete with \n on the end of each entry). You were then using str() to convert the list representation into a string, for example:
text = str(['line1\n', 'line2\n', line3\n'])
text would now be a string looking like a your list, complete with all the [ and quote characters. Rather than using readlines(), you probably need to just use read() which would return the whole file contents as a single text string for you to work with.
Try using the following type approach which also uses the preferred with context manager for dealing with files (it closes them automatically for you).
import uuid
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
with open('data.txt') as inputfile, open('output.xml','w') as outputfile:
my_text = inputfile.read()
my_text2 = ''
#Just replicate the session logs x times ...
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
outputfile.write(my_text2)

Regex to capture a Wordpress v2.6.0/2.6.1 hash

I have a hash that sits inside of a file:
Wordpress list:
eÝ>l;\x90\x9e(qNCDbf]_LÇa\x04\x0bTZZ
\x8aW\x8flëL\x03Z[esc]\x8e`R*\x87\,LÇa\x04\x0bTZZ
a\x9c[ff]\x04RýÎ\x03,ÔJANetáLÇa\x04\x0bTZZ
$H$91234567894GGuEJpikxPnFVUioh9e.
What I need to do is match the wordpress hash with a regex and output what line it was on. I've already done so, but I'm looking for a better regex if possible, the one I am using right now looks like this: ^\$H\$\d+\w+.$. Here is how I am using it:
import re
file_string = """
Wordpress list:
eÝ>l;\x90\x9e(qNCDbf]_LÇa\x04\x0bTZZ
\x8aW\x8flëL\x03Z[esc]\x8e`R*\x87\,LÇa\x04\x0bTZZ
a\x9c[ff]\x04RýÎ\x03,ÔJANetáLÇa\x04\x0bTZZ
$H$91234567894GGuEJpikxPnFVUioh9e."""
wordpress_re = re.compile(r"^\$H\$\d+\w+.$")
def find_wordpress_hash(data=file_string):
for count, line in enumerate(data.split("\n"), start=1):
if wordpress_re.match(line):
print("Matched at line: {}".format(count))
print(line)
if __name__ == '__main__':
find_wordpress_hash()
# <= Matched at line: 6
# <= $H$91234567894GGuEJpikxPnFVUioh9e.
Is there a more accurate regex I could be using, or a better way to use this regex, or even a better way to capture a wordpress 2.6.0/2.6.1 hash?

python regex unicode - extracting data from an utf-8 file

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.
This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克
I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']
Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

Categories