Python Beginner - Build a regex with optionnal characters

Python Beginner - Build a regex with optionnal characters - python

I try to build a regex with Python which must match on this :
STRING
STRING STRING
STRING (STRING) STRING (STRING)
STRING (STRING) STRING (STRING) STRING (STRING) STRING
I tried to do the job using metacharacter optionnal ? but for the second pattern STRING STRING it doesn't work : I have just the first character after the first string
\w+\s+\w+?
gives
STRING S
but should gives
STRING STRING
and match on
STRING
STRING STRING
Here is full code :
import csv
import re
import sys
fname = sys.argv[1]
r = r'(\w+) access = (\w+)\s+Vol ID = (\w+)\s+Snap ID = (\w+)\s+Inode = (\w+)\s+IP = ((\d|\.)+)\s+UID = (\w+)\s+Full Path = (\S+)\s+Handle ID: (\S+)\s+Operation ID: (\S+)\s+Process ID: (\d+)\s+Image File Name: (\w+\s+\w+\s+\w+)\s+Primary User Name: (\S+)\s+Primary Domain: (\S+)\s+Primary Logon ID: (.....\s+......)\s+Client User Name: (\S+)\s+Client Domain: (\S+)\s+Client Logon ID: (\S+)'
regex = re.compile(r)
out = csv.writer(sys.stdout)
f_hdl = open(fname, 'r')
csv_rdr = csv.reader(f_hdl)
header = True
for row in csv_rdr:
#print row
if header:
header = False
else:
field = row[-1]
res = re.search(regex, field)
if res:
audit_status = row[3]
device = row[7]
date_time = row[0]
event_id = row[2]
user = row[6]
access_source = res.group(1)
access_type = res.group(2)
volume = res.group(3)
snap = res.group(4)
inode = res.group(5)
ip = res.group(6)
uid = res.group(8)
path = res.group(9)
handle_id = res.group(10)
operation_id = res.group(11)
process_id = res.group(12)
image_file_name = res.group(13)
primary_user_name = res.group(14)
primary_domain = res.group(15)
primary_logon_id = res.group(16)
client_user_name = res.group(17)
client_domain = res.group(18)
client_logon_id = res.group(19)
print audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path
out.writerow(
[audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path, handle_id, operation_id, process_id, image_file_name, primary_user_name, primary_domain, primary_logon_id, client_user_name, client_domain, client_logon_id]
)
else:
print 'NOMATCH'
Any suggestions ?

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
If it's a csv file that uses space for separation and parenthesis for quoting, use
csv.reader(csvfile, delimiter=' ', quotechar='(')
If it's even a simpler case, use the split method on the string and expand it to fill all fields with an empty string:
fields = field.split(' ')
fields = [i or j for i, j in map(None, fields, ('',) * 7)]

Try this for your regex string:
r = '(\\w+) access = (\\w+)\\s+Vol ID = (\\w+)\\s+Snap ID = (\\w+)\\s+Inode = (\\w+)\\s+IP = ((\\d|\\.)+)\\s+UID = (\\w+)\\s+Full Path = (\\S+)\\s+Handle ID: (\\S+)\\s+Operation ID: (\\S+)\\s+Process ID: (\\d+)\\s+Image File Name: (\\w+\\s+\\w+\\s+\\w+)\\s+Primary User Name: (\\S+)\\s+Primary Domain: (\\S+)\\s+Primary Logon ID: (.....\\s+......)\\s+Client User Name: (\\S+)\\s+Client Domain: (\\S+)\\s+Client Logon ID: (\\S+)\\s+Accesses: (.*)'

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.

There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Extracting information from Strings in Python

The format is
"FirstName LastName Type_Date-Time_ref_PhoneNumber"
All this is a single string
Example: "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
I want to extract Name, Type, Date, Time, ref, Phonenumber from this string.

You can do
a = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
name = " ".join(a.split(" ")[:2])
Type, Date, ref, Phonenumber = a.replace(name, "").strip().split("_")
Time = Date[Date.find("-")+ 1:]
Date = Date.replace(f"-{Time}", "")
print(name, Type, Date, Time, ref, Phonenumber)
That will output
('Yasir Pirkani', 'MCD', '20201105', '134700', 'abc123', '12345678')

There are multiple ways to do so, one of them is using regex.
s = "Yasir Pirkani MCD_20201105-134700_abc123_12345678"
splited_s = re.split('[\s_]+', s)
# splited_s -> ['Yasir', 'Pirkani', 'MCD', '20201105-134700', 'abc123', '12345678']
Then you can access each element of splited_s and adjust it properly

Mysql json.dump() line break indentation for viewing

I am using the following line of code for executing and printing data from my sql database. For some reason that is the only command that works for me.
json_string = json.dumps(location_query_1)
My question is that when I print json_string it shows data in the following format:
Actions.py code:
class FindByLocation(Action):
def name(self) -> Text:
return "action_find_by_location"
def run (self, dispatcher: CollectingDispatcher,
tracker: Tracker,
doman: Dict[Text, Any])-> List[Dict[Text,Any]]:
global flag
location = tracker.get_slot("location")
price = tracker.get_slot("price")
cuisine = tracker.get_slot("cuisine")
print("In find by Location")
print(location)
location_query = "SELECT Name FROM Restaurant WHERE Location = '%s' LIMIT 5" % location
location_count_query = "SELECT COUNT(Name) FROM Restaurant WHERE Location = '%s'" % location
location_query_1 = getData(location_query)
location_count_query_1 = getData(location_count_query)
if not location_query_1:
flag = 1
sublocation_view_query = "CREATE VIEW SublocationView AS SELECT RestaurantID, Name, PhoneNumber, Rating, PriceRange, Location, Sublocation FROM Restaurant WHERE Sublocation = '%s'"%(location)
sublocation_view = getData(sublocation_view_query)
dispatcher.utter_message(text="یہ جگہ کس ایریا میں ہے")
else:
flag = 0
if cuisine is None and price is None:
json_string = json.dumps(location_query_1)
print(isinstance(json_string, str))
print("Check here")
list_a=json_string.split(',')
remove=["'",'"','[',']']
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
dispatcher.utter_message(text="Restaurants in Location only: ")
dispatcher.utter_message(list_a)
What should I do se that the data is showed in a vertical list format (new line indentation) and without the bracket and quotation marks? Thank you

First of all, have you tried reading your data into a pandas object? I have done some programs with a sqlite database and this worked for me:
df = pd.read_sql_query("SELECT * FROM {}".format(self.tablename), conn)
But now to the string formatting part:
# this code should do the work for you
# first of all we have our string a like yours
a="[['hallo'],['welt'],['kannst'],['du'],['mich'],['hoeren?']]"
# now we split the string into a list on every ,
list_a=a.split(',')
# this is our list with chars we want to remove
remove=["'",'"','[',']']
# now we replace all elements step by step with nothing
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
print(list_a)
for z in list_a:
print(z)
The output is then:
['hallo', 'welt', 'kannst', 'du', 'mich', 'hoeren?']
hallo
welt
kannst
du
mich
hoeren?
I hope I could help.

Zapier Action Code: Python will not run with input_data variable

I am using Zapier to catch a webhook and use that info for an API post. The action code runs perfectly fine with "4111111111111111" in place of Ccnum in doSale. But when I use the input_data variable and place it in doSale it errors.
Zapier Input Variable:
Zapier Error:
Python code:
import pycurl
import urllib
import urlparse
import StringIO
class gwapi():
def __init__(self):
self.login= dict()
self.order = dict()
self.billing = dict()
self.shipping = dict()
self.responses = dict()
def setLogin(self,username,password):
self.login['password'] = password
self.login['username'] = username
def setOrder(self, orderid, orderdescription, tax, shipping, ponumber,ipadress):
self.order['orderid'] = orderid;
self.order['orderdescription'] = orderdescription
self.order['shipping'] = '{0:.2f}'.format(float(shipping))
self.order['ipaddress'] = ipadress
self.order['tax'] = '{0:.2f}'.format(float(tax))
self.order['ponumber'] = ponumber
def setBilling(self,
firstname,
lastname,
company,
address1,
address2,
city,
state,
zip,
country,
phone,
fax,
email,
website):
self.billing['firstname'] = firstname
self.billing['lastname'] = lastname
self.billing['company'] = company
self.billing['address1'] = address1
self.billing['address2'] = address2
self.billing['city'] = city
self.billing['state'] = state
self.billing['zip'] = zip
self.billing['country'] = country
self.billing['phone'] = phone
self.billing['fax'] = fax
self.billing['email'] = email
self.billing['website'] = website
def setShipping(self,firstname,
lastname,
company,
address1,
address2,
city,
state,
zipcode,
country,
email):
self.shipping['firstname'] = firstname
self.shipping['lastname'] = lastname
self.shipping['company'] = company
self.shipping['address1'] = address1
self.shipping['address2'] = address2
self.shipping['city'] = city
self.shipping['state'] = state
self.shipping['zip'] = zipcode
self.shipping['country'] = country
self.shipping['email'] = email
def doSale(self,amount, ccnumber, ccexp, cvv=''):
query = ""
# Login Information
query = query + "username=" + urllib.quote(self.login['username']) + "&"
query += "password=" + urllib.quote(self.login['password']) + "&"
# Sales Information
query += "ccnumber=" + urllib.quote(ccnumber) + "&"
query += "ccexp=" + urllib.quote(ccexp) + "&"
query += "amount=" + urllib.quote('{0:.2f}'.format(float(amount))) + "&"
if (cvv!=''):
query += "cvv=" + urllib.quote(cvv) + "&"
# Order Information
for key,value in self.order.iteritems():
query += key +"=" + urllib.quote(str(value)) + "&"
# Billing Information
for key,value in self.billing.iteritems():
query += key +"=" + urllib.quote(str(value)) + "&"
# Shipping Information
for key,value in self.shipping.iteritems():
query += key +"=" + urllib.quote(str(value)) + "&"
query += "type=sale"
return self.doPost(query)
def doPost(self,query):
responseIO = StringIO.StringIO()
curlObj = pycurl.Curl()
curlObj.setopt(pycurl.POST,1)
curlObj.setopt(pycurl.CONNECTTIMEOUT,30)
curlObj.setopt(pycurl.TIMEOUT,30)
curlObj.setopt(pycurl.HEADER,0)
curlObj.setopt(pycurl.SSL_VERIFYPEER,0)
curlObj.setopt(pycurl.WRITEFUNCTION,responseIO.write);
curlObj.setopt(pycurl.URL,"https://secure.merchantonegateway.com/api/transact.php")
curlObj.setopt(pycurl.POSTFIELDS,query)
curlObj.perform()
data = responseIO.getvalue()
temp = urlparse.parse_qs(data)
for key,value in temp.iteritems():
self.responses[key] = value[0]
return self.responses['response']
# NOTE: your username and password should replace the ones below
Ccnum = input_data['Ccnum'] #this variable I would like to use in
#the gw.doSale below
gw = gwapi()
gw.setLogin("demo", "password");
gw.setBilling("John","Smith","Acme, Inc.","123 Main St","Suite 200", "Beverly Hills",
"CA","90210","US","555-555-5555","555-555-5556","support#example.com",
"www.example.com")
r = gw.doSale("5.00",Ccnum,"1212",'999')
print gw.responses['response']
if (int(gw.responses['response']) == 1) :
print "Approved"
elif (int(gw.responses['response']) == 2) :
print "Declined"
elif (int(gw.responses['response']) == 3) :
print "Error"
Towards the end is where the problems are. How can I pass the variables from Zapier into the python code?

David here, from the Zapier Platform team. A few things.
First, I think your issue is the one described here. Namely, I believe input_data's values are unicode. So you'll want to call str(input_data['Ccnum']) instead.
Alternatively, if you want to use Requests, it's also supported and is a lot less finicky.
All that said, I would be remiss if I didn't mention that everything in Zapier code steps gets logged in plain text internally. For that reason, I'd strongly recommend against putting credit card numbers, your password for this service, and any other sensitive data through a Code step. A private server that you control is a much safer option.
Let me know if you've got any other questions!

Python 3 remove empty list

I am finding a line from a text that has 10 lines.
desc = re.findall(r'#description (.*)', comment.strip())
What happens is it returns the #description but it also has 9 empty lists.
print(desc)
returns:
[]
[]
[]
[]
[]
[]
[]
[]
['the desc is here']
[]
So how do I get rid of those empty [] and make desc=['the desc is here']?
update
I tried list filter and still the same return
The comment contains:
/**
* #param string username required the username of the registering user
* #param string password required
* #param string first_name required
* #param string last_name required
* #param string email required
* #package authentication
* #info user registration
* #description register a new user into the groupjump platform
*/
update
comment is a full string, so I split it like this so I can read line by line
comments = route['comment']
comments = list(filter(None, comments.split('\n')))
actual code
#!/usr/bin/env python3
import re
routes = []
description = ''
with open('troutes.php', 'r') as f:
current_comment = ''
in_comment = False
for line in f:
line = line.lstrip()
if line.startswith('/**'):
in_comment = True
if in_comment:
current_comment += line
if line.startswith('*/'):
in_comment = False
if line.startswith('Route::'):
matches = re.search(r"Route::([A-Z]+)\('(.*)', '(.*)'\);", line)
groups = matches.groups()
routes.append({
'comment': current_comment,
'method': groups[0],
'path': groups[1],
'handler': groups[2],
});
current_comment = '' # reset the comment
for route in routes:
# get comments
comments = route['comment']
comments = list(filter(None, comments.split('\n')))
for comment in comments:
params = re.findall(r'#param (.*)', comment.strip())
object = re.findall(r'#package (.*)', comment.strip())
info = re.findall(r'#info (.*)', comment.strip())
desc = re.search(r'#description (.*)', comment.strip())
print(comment[15:])
data being read:
<?php
/**
* #param string username required the username of the registering user
* #param string password required
* #param string first_name required
* #param string last_name required
* #param string email required
* #package authentication
* #info user registration
* #description register a new user into the groupjump platform
*/
Route::POST('v3/register', 'UserController#Register');
/**
* #param string username required the username of the registering user
* #param string password required
*/
Route::GET('v3/login', 'UserController#login');

A condition for a single list is just:
if desc:
print(desc)
It is a shorthand version of:
if len(desc) > 0:
print(desc)
For a list of lists it's:
desc = [d for d in desc if d]
To get only the string do this:
if desc:
print(desc[0])

It seems like you're matching the pattern line by line. Why don't you match against the whole comment?
>>> comment = '''/**
... * #param string username required the username of the registering user
... * #param string password required
... * #param string first_name required
... * #param string last_name required
... * #param string email required
... * #package authentication
... * #info user registration
... * #description register a new user into the groupjump platform
... */'''
>>>
>>> import re
>>> desc = re.findall(r'#description (.*)', comment)
>>> desc
['register a new user into the groupjump platform']

You can filter list with an empty string from a list of lists with a list comprehension:
desc = re.findall(r'#description (.*)', comment.strip())
desc = [d for d in desc if len(d[0]) > 0]
Another solution is to print the element only if the first element contains something:
desc = re.findall(r'#description (.*)', comment.strip())
for d in desc:
if len(d) > 0 and d[0]: # check if there's a first element and if this element isn't empty
print d

To get your code to work you need to work on a single string, if you have 10 lines do like this:
joined = "\n".join(lines)
for i in re.findall(r'#description (.*)', joined):
print (i)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Beginner - Build a regex with optionnal characters - python

Related

Extract Text from a word document

Extracting information from Strings in Python

Mysql json.dump() line break indentation for viewing

Zapier Action Code: Python will not run with input_data variable

Python 3 remove empty list

Categories

Resources