The format is
"FirstName LastName Type_Date-Time_ref_PhoneNumber"
All this is a single string
Example: "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
I want to extract Name, Type, Date, Time, ref, Phonenumber from this string.
You can do
a = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
name = " ".join(a.split(" ")[:2])
Type, Date, ref, Phonenumber = a.replace(name, "").strip().split("_")
Time = Date[Date.find("-")+ 1:]
Date = Date.replace(f"-{Time}", "")
print(name, Type, Date, Time, ref, Phonenumber)
That will output
('Yasir Pirkani', 'MCD', '20201105', '134700', 'abc123', '12345678')
There are multiple ways to do so, one of them is using regex.
s = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
splited_s = re.split('[\s_]+', s)
# splited_s -> ['Yasir', 'Pirkani', 'MCD', '20201105-134700', 'abc123', '12345678']
Then you can access each element of splited_s and adjust it properly
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error.
Looking out to extract the values of the keywords starts with.
Tried using PyPDF2 and pdfminer but getting the Error.
fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()
OutPut : ['Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']
I am looking out to capture Ack No, Date of Issue, CIN from the above output
Using the script:
regex_ack_no = re.compile(r"Ack No(\d+)")
regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")
ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)
return[ack_no, due_date, cin]
Error:
AttributeError: 'NoneType' object has no attribute 'group'
When using the same script with the another PDF file having data in the table format its working.
You need to change the regexp patterns to match the data format. The keywords are followed by spaces and :, you have to match them. The format of the date is not what you have in your pattern, neither is the format of CIN.
Before calling .group(1), check that the match was successful. In my code below I return default values when there's no match.
import re
data = 'Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....'
regex_ack_no = re.compile(r"Ack No\s*:\s*(\d+)")
regex_due_date = re.compile(r"Date of Issue\s*:\s*(\d\d\.\d\d\.\d{4})")
regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:")
ack_no = re.search(regex_ack_no, data)
if ack_no:
ack_no = ack_no.group(1)
else:
ack_no = 'Ack No not found'
due_date = re.search(regex_due_date, data)
if due_date:
due_date = due_date.group(1)
else:
due_date = 'Due date not found'
cin = re.search(regex_CIN, data)
if cin:
cin = cin.group(1)
else:
cin = 'CIN not found'
print([ack_no, due_date, cin])
DEMO
I was trying to split combination of string, unicode in python. The split has to be made on the ResultSet object retrieved from web-site. Using the code below, I am able to get the details, actually it is user details:
from bs4 import BeautifulSoup
import urllib2
import re
url = "http://www.mouthshut.com/vinay_beriwal"
profile_user = urllib2.urlopen(url)
profile_soup = BeautifulSoup(profile_user.read())
usr_dtls = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
for dt in usr_dtls:
usr_dtls = " ".join(dt.text.split())
print(usr_dtls)
The output is as below:
i love yellow..
Name: Vinay Beriwal
Age: 39 years
Hometown: New Delhi, India
Country: India
Member since: Feb 11, 2016
What I need is to create distinct 5 variables as Name, Age, Hometown, Country, Member since and store the corresponding value after ':' for same.
Thanks
You can use a dictionary to store name-value pairs.For example -
my_dict = {"Name":"Vinay","Age":21}
In my_dict, Name and Age are the keys of the dictionary, you can access values like this -
print (my_dict["Name"]) #This will print Vinay
Also, it's nice and better to use complete words for variable names.
results = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
user_data={} #dictionary initialization
for result in results:
result = " ".join(result.text.split())
try:
var,value = result.strip().split(':')
user_data[var.strip()]=value.strip()
except:
pass
#If you print the user_data now
print (user_data)
'''
This is what it'll print
{'Age': ' 39 years', 'Country': ' India', 'Hometown': 'New Delhi, India', 'Name': 'Vinay Beriwal', 'Member since': 'Feb 11, 2016'}
'''
You can use a dictionary to store your data:
my_dict = {}
for dt in usr_dtls:
item = " ".join(dt.text.split())
try:
if ':' in item:
k, v = item.split(':')
my_dict[k.strip()] = v.strip()
except:
pass
Note: You should not use usr_dtls inside your for loop, because that's would override your original usr_dtls
I'd like to split a string with delimiters which are in a list.
The string has this pattern: Firstname, Lastname Email
The list of delimiters has this: [', ',' '] taken out of the pattern.
I'd like to split the string to get a list like this
['Firstname', 'Lastname', 'Email']
For a better understanding of my problem, this is what I'm trying to achieve:
The user shall be able to provide a source pattern: %Fn%, %Ln% %Mail% of data to be imported
and a target pattern how the data shall be displayed:
%Ln%%Fn%; %Ln%, %Fn; %Mail%
This is my attempt:
data = "Firstname, Lastname Email"
for delimiter in source_pattern_delimiter:
prog = re.compile(delimiter)
data_tuple = prog.split(data)
How do I 'merge' the data_tuple list(s)?
import re
re.split(re.compile("|".join([", ", " "])), "Firstname, Lastname Email")
hope it helps
Seems you want something like this,
>> s = "Firstname, Lastname Email"
>>> delim = [', ',' ']
>>> re.split(r'(?:' + '|'.join(delim) + r')', s)
['Firstname', 'Lastname', 'Email']
A solution without regexes and if you want to apply a particular delimiter at a particular position:
def split(s, delimiters):
for d in delimiters:
item, s = s.split(d, 1)
yield item
else:
yield s
>>> list(split("Firstname, Lastname Email", [", ", " "]))
["Firstname", "Lastname", "Email"]
What about splitting on spaces, then removing any trailing commas?
>>> data = "Firstname, Lastname Email"
>>> [s.rstrip(',') for s in data.split(' ')]
['Firstname', 'Lastname', 'Email']
You are asking for a template based way to reconstruct the split data. The following script could give you an idea how to progress. It first splits the data into the three parts and assigns each to a dictionary entry. This can then be used to give a target pattern:
import re
data = "Firstname, Lastname Email"
# Find a list of entries and display them
entries = re.findall("(\w+)", data)
print entries
# Convert the entries into a dictionary
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
# Use dictionary-based string formatting to provide a template system
print "%(Ln)s%(Fn)s; %(Ln)s, %(Fn)s; %(Mail)s" % dEntries
This displays the following:
['Firstname', 'Lastname', 'Email']
LastnameFirstname; Lastname, Firstname; Email
If you really need to use the exact template system you have provided then the following could be done to first convert your target pattern into one suitable for use with Python's dictionary system:
def display_with_template(data, target_pattern):
entries = re.findall("(\w+)", data)
dEntries = {"Fn": entries[0], "Ln": entries[1], "Mail": entries[2]}
for item in ["Fn", "Ln", "Mail"]:
target_pattern= target_pattern.replace("%%%s%%" % item, "%%(%s)s" % item)
return target_pattern % dEntries
print display_with_template("Firstname, Lastname Email", r"%Ln%%Fn%; %Ln%, %Fn%; %Mail%")
Which would display the same result, but uses a custom target pattern:
LastnameFirstname; Lastname, Firstname; Email
I try to build a regex with Python which must match on this :
STRING
STRING STRING
STRING (STRING) STRING (STRING)
STRING (STRING) STRING (STRING) STRING (STRING) STRING
I tried to do the job using metacharacter optionnal ? but for the second pattern STRING STRING it doesn't work : I have just the first character after the first string
\w+\s+\w+?
gives
STRING S
but should gives
STRING STRING
and match on
STRING
STRING STRING
Here is full code :
import csv
import re
import sys
fname = sys.argv[1]
r = r'(\w+) access = (\w+)\s+Vol ID = (\w+)\s+Snap ID = (\w+)\s+Inode = (\w+)\s+IP = ((\d|\.)+)\s+UID = (\w+)\s+Full Path = (\S+)\s+Handle ID: (\S+)\s+Operation ID: (\S+)\s+Process ID: (\d+)\s+Image File Name: (\w+\s+\w+\s+\w+)\s+Primary User Name: (\S+)\s+Primary Domain: (\S+)\s+Primary Logon ID: (.....\s+......)\s+Client User Name: (\S+)\s+Client Domain: (\S+)\s+Client Logon ID: (\S+)'
regex = re.compile(r)
out = csv.writer(sys.stdout)
f_hdl = open(fname, 'r')
csv_rdr = csv.reader(f_hdl)
header = True
for row in csv_rdr:
#print row
if header:
header = False
else:
field = row[-1]
res = re.search(regex, field)
if res:
audit_status = row[3]
device = row[7]
date_time = row[0]
event_id = row[2]
user = row[6]
access_source = res.group(1)
access_type = res.group(2)
volume = res.group(3)
snap = res.group(4)
inode = res.group(5)
ip = res.group(6)
uid = res.group(8)
path = res.group(9)
handle_id = res.group(10)
operation_id = res.group(11)
process_id = res.group(12)
image_file_name = res.group(13)
primary_user_name = res.group(14)
primary_domain = res.group(15)
primary_logon_id = res.group(16)
client_user_name = res.group(17)
client_domain = res.group(18)
client_logon_id = res.group(19)
print audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path
out.writerow(
[audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path, handle_id, operation_id, process_id, image_file_name, primary_user_name, primary_domain, primary_logon_id, client_user_name, client_domain, client_logon_id]
)
else:
print 'NOMATCH'
Any suggestions ?
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
If it's a csv file that uses space for separation and parenthesis for quoting, use
csv.reader(csvfile, delimiter=' ', quotechar='(')
If it's even a simpler case, use the split method on the string and expand it to fill all fields with an empty string:
fields = field.split(' ')
fields = [i or j for i, j in map(None, fields, ('',) * 7)]
Try this for your regex string:
r = '(\\w+) access = (\\w+)\\s+Vol ID = (\\w+)\\s+Snap ID = (\\w+)\\s+Inode = (\\w+)\\s+IP = ((\\d|\\.)+)\\s+UID = (\\w+)\\s+Full Path = (\\S+)\\s+Handle ID: (\\S+)\\s+Operation ID: (\\S+)\\s+Process ID: (\\d+)\\s+Image File Name: (\\w+\\s+\\w+\\s+\\w+)\\s+Primary User Name: (\\S+)\\s+Primary Domain: (\\S+)\\s+Primary Logon ID: (.....\\s+......)\\s+Client User Name: (\\S+)\\s+Client Domain: (\\S+)\\s+Client Logon ID: (\\S+)\\s+Accesses: (.*)'