Extract only the specific value from string with Regex Using Python - python

I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error.
Looking out to extract the values of the keywords starts with.
Tried using PyPDF2 and pdfminer but getting the Error.
fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()
OutPut : ['Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']
I am looking out to capture Ack No, Date of Issue, CIN from the above output
Using the script:
regex_ack_no = re.compile(r"Ack No(\d+)")
regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")
ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)
return[ack_no, due_date, cin]
Error:
AttributeError: 'NoneType' object has no attribute 'group'
When using the same script with the another PDF file having data in the table format its working.

You need to change the regexp patterns to match the data format. The keywords are followed by spaces and :, you have to match them. The format of the date is not what you have in your pattern, neither is the format of CIN.
Before calling .group(1), check that the match was successful. In my code below I return default values when there's no match.
import re
data = 'Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....'
regex_ack_no = re.compile(r"Ack No\s*:\s*(\d+)")
regex_due_date = re.compile(r"Date of Issue\s*:\s*(\d\d\.\d\d\.\d{4})")
regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:")
ack_no = re.search(regex_ack_no, data)
if ack_no:
ack_no = ack_no.group(1)
else:
ack_no = 'Ack No not found'
due_date = re.search(regex_due_date, data)
if due_date:
due_date = due_date.group(1)
else:
due_date = 'Due date not found'
cin = re.search(regex_CIN, data)
if cin:
cin = cin.group(1)
else:
cin = 'CIN not found'
print([ack_no, due_date, cin])
DEMO

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Extracting information from Strings in Python

The format is
"FirstName LastName Type_Date-Time_ref_PhoneNumber"
All this is a single string
Example: "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
I want to extract Name, Type, Date, Time, ref, Phonenumber from this string.
You can do
a = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
name = " ".join(a.split(" ")[:2])
Type, Date, ref, Phonenumber = a.replace(name, "").strip().split("_")
Time = Date[Date.find("-")+ 1:]
Date = Date.replace(f"-{Time}", "")
print(name, Type, Date, Time, ref, Phonenumber)
That will output
('Yasir Pirkani', 'MCD', '20201105', '134700', 'abc123', '12345678')
There are multiple ways to do so, one of them is using regex.
s = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
splited_s = re.split('[\s_]+', s)
# splited_s -> ['Yasir', 'Pirkani', 'MCD', '20201105-134700', 'abc123', '12345678']
Then you can access each element of splited_s and adjust it properly

TypeError: Byte-like object, not string

I have this code, but keep running into versions of the title error. Can anyone help me get past these? Traceback hits on the newfilingDate line (4th from bottom), but I suspect that's not where the actual error is?
def getIndexLink(tickerCode,FormType):
csvOutput = open(IndexLinksFile,"a+b") # "a+b" indicates that we are adding lines rather than replacing lines
csvWriter = csv.writer(csvOutput, quoting = csv.QUOTE_NONNUMERIC)
urlLink = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK="+tickerCode+"&type="+FormType+"&dateb=&owner=exclude&count=100"
pageRequest = urllib.request.Request(urlLink)
with urllib.request.urlopen(pageRequest) as url:
pageRead = url.read()
soup = BeautifulSoup(pageRead,"html.parser")
#Check if there is a table to extract / code exists in edgar database
try:
table = soup.find("table", { "class" : "tableFile2" })
except:
print("No tables found or no matching ticker symbol for ticker symbol for"+tickerCode)
return -1
docIndex = 1
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells)==5:
if cells[0].text.strip() == FormType:
link = cells[1].find("a",{"id": "documentsbutton"})
docLink = "https://www.sec.gov"+link['href']
description = cells[2].text.encode('utf8').strip() #strip take care of the space in the beginning and the end
filingDate = cells[3].text.encode('utf8').strip()
newfilingDate = filingDate.replace("-","_") ### <=== Change date format from 2012-1-1 to 2012_1_1 so it can be used as part of 10-K file names
csvWriter.writerow([tickerCode, docIndex, docLink, description, filingDate,newfilingDate])
docIndex = docIndex + 1
csvOutput.close()
byte-like objects can have .replace called on it so long as the replace args are also byte-like. (special thanks to juanpa.arrivillaga for pointing this out)
foo = b'hi-mom'
foo = foo.replace(b"-", b"_")
print(foo)
Alternatively, you can recast to a string and then back to byte-like but that is messy and inefficient.
foo = b'hi-mom'
foo = str(foo).replace("-","_").encode('utf-8')
print(foo)

Delete a repeating pattern in a string using Python

I have a JavaScript file with an array of data.
info = [ {
Date = "YR-MM-DDT00:00:10"
}, ....
What I'm trying to do is remove T and on in the Date field.
Here's what I've tried:
import re
with open ("info.js","r") as myFile:
data= myFile.read();
data= re.sub('\0-9T,'',data);
Desired output for each Date field in the array:
Date = "YR-MM-DD"
You should match the T and the characters that come after it, This works for a single timestamp:
import re
print(re.sub('T.*$', '', 'YR-MM-DDT00:00:10'))
Or if you have text containing a bunch of timestamps, match the closing double quote as well, and replace with a double quote:
import re
text = """
info = [ {
Date = "YR-MM-DDT00:00:10",
Date = "YR-MM-DDT01:02:03",
Date = "YR-MM-DDT11:22:33"
}
"""
new_text = re.sub('T.*"', '"', text)
print(new_text)

python code returns none type object has no attribute error sometimes and works perfectly the other time

def dcrawl(link):
#importing the req. libraries & modules
from bs4 import BeautifulSoup
import urllib
#fetching the document
op = urllib.FancyURLopener({})
f = op.open(link)
h_doc = f.read()
#trimming for the base document
idoc1 = BeautifulSoup(h_doc)
idoc2 = str(idoc1.find(id = "bwStory"))
bdoc = BeautifulSoup(idoc2)
#extract the date as a string
dat = str(bdoc.div.div.string)[0:13]
date = dst(dat)
#extract the title as a string
title = str(bdoc.b.string)
#extract the full report as a string
freport = str(bdoc.find_all("p"))
#extract the place as a string
plc = bdoc.find(id = "bwStoryBody")
puni = plc.p.string
#encoding to ascii to eliminate discrepancies
pasi = puni.encode('ascii', 'ignore')
com = pasi.find("-")
place = pasi[:com]
the same conversion "bdoc.b.string" works here:
#extract the full report as a string
freport = str(bdoc.find_all("p"))
In the line:
plc = bdoc.find(id = "bwStoryBody")
plc returns some data. and plc.p returns the first <p>....<p>, but converting it to string doesn't work.
because puni returned a string object earlier, I stumbled upon unicode errors and so had to use the encode to handle the pasi result.
.find() returns None when an object was not found. Evidently some pages do not have the elements that you are looking for.
Test for it explicitly if you want to prevent attribute errors:
plc = bdoc.find(id = "bwStoryBody")
if plc is not None:
puni = plc.p.string
#encoding to ascii to eliminate discrepancies
#By default python processes in unicode
pasi = puni.encode('ascii', 'ignore')
com = pasi.find("-")
place = pasi[:com]

Categories