I am finding a line from a text that has 10 lines.
desc = re.findall(r'#description (.*)', comment.strip())
What happens is it returns the #description but it also has 9 empty lists.
print(desc)
returns:
[]
[]
[]
[]
[]
[]
[]
[]
['the desc is here']
[]
So how do I get rid of those empty [] and make desc=['the desc is here']?
update
I tried list filter and still the same return
The comment contains:
/**
* #param string username required the username of the registering user
* #param string password required
* #param string first_name required
* #param string last_name required
* #param string email required
* #package authentication
* #info user registration
* #description register a new user into the groupjump platform
*/
update
comment is a full string, so I split it like this so I can read line by line
comments = route['comment']
comments = list(filter(None, comments.split('\n')))
actual code
#!/usr/bin/env python3
import re
routes = []
description = ''
with open('troutes.php', 'r') as f:
current_comment = ''
in_comment = False
for line in f:
line = line.lstrip()
if line.startswith('/**'):
in_comment = True
if in_comment:
current_comment += line
if line.startswith('*/'):
in_comment = False
if line.startswith('Route::'):
matches = re.search(r"Route::([A-Z]+)\('(.*)', '(.*)'\);", line)
groups = matches.groups()
routes.append({
'comment': current_comment,
'method': groups[0],
'path': groups[1],
'handler': groups[2],
});
current_comment = '' # reset the comment
for route in routes:
# get comments
comments = route['comment']
comments = list(filter(None, comments.split('\n')))
for comment in comments:
params = re.findall(r'#param (.*)', comment.strip())
object = re.findall(r'#package (.*)', comment.strip())
info = re.findall(r'#info (.*)', comment.strip())
desc = re.search(r'#description (.*)', comment.strip())
print(comment[15:])
data being read:
<?php
/**
* #param string username required the username of the registering user
* #param string password required
* #param string first_name required
* #param string last_name required
* #param string email required
* #package authentication
* #info user registration
* #description register a new user into the groupjump platform
*/
Route::POST('v3/register', 'UserController#Register');
/**
* #param string username required the username of the registering user
* #param string password required
*/
Route::GET('v3/login', 'UserController#login');
A condition for a single list is just:
if desc:
print(desc)
It is a shorthand version of:
if len(desc) > 0:
print(desc)
For a list of lists it's:
desc = [d for d in desc if d]
To get only the string do this:
if desc:
print(desc[0])
It seems like you're matching the pattern line by line. Why don't you match against the whole comment?
>>> comment = '''/**
... * #param string username required the username of the registering user
... * #param string password required
... * #param string first_name required
... * #param string last_name required
... * #param string email required
... * #package authentication
... * #info user registration
... * #description register a new user into the groupjump platform
... */'''
>>>
>>> import re
>>> desc = re.findall(r'#description (.*)', comment)
>>> desc
['register a new user into the groupjump platform']
You can filter list with an empty string from a list of lists with a list comprehension:
desc = re.findall(r'#description (.*)', comment.strip())
desc = [d for d in desc if len(d[0]) > 0]
Another solution is to print the element only if the first element contains something:
desc = re.findall(r'#description (.*)', comment.strip())
for d in desc:
if len(d) > 0 and d[0]: # check if there's a first element and if this element isn't empty
print d
To get your code to work you need to work on a single string, if you have 10 lines do like this:
joined = "\n".join(lines)
for i in re.findall(r'#description (.*)', joined):
print (i)
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
I am using the following line of code for executing and printing data from my sql database. For some reason that is the only command that works for me.
json_string = json.dumps(location_query_1)
My question is that when I print json_string it shows data in the following format:
Actions.py code:
class FindByLocation(Action):
def name(self) -> Text:
return "action_find_by_location"
def run (self, dispatcher: CollectingDispatcher,
tracker: Tracker,
doman: Dict[Text, Any])-> List[Dict[Text,Any]]:
global flag
location = tracker.get_slot("location")
price = tracker.get_slot("price")
cuisine = tracker.get_slot("cuisine")
print("In find by Location")
print(location)
location_query = "SELECT Name FROM Restaurant WHERE Location = '%s' LIMIT 5" % location
location_count_query = "SELECT COUNT(Name) FROM Restaurant WHERE Location = '%s'" % location
location_query_1 = getData(location_query)
location_count_query_1 = getData(location_count_query)
if not location_query_1:
flag = 1
sublocation_view_query = "CREATE VIEW SublocationView AS SELECT RestaurantID, Name, PhoneNumber, Rating, PriceRange, Location, Sublocation FROM Restaurant WHERE Sublocation = '%s'"%(location)
sublocation_view = getData(sublocation_view_query)
dispatcher.utter_message(text="یہ جگہ کس ایریا میں ہے")
else:
flag = 0
if cuisine is None and price is None:
json_string = json.dumps(location_query_1)
print(isinstance(json_string, str))
print("Check here")
list_a=json_string.split(',')
remove=["'",'"','[',']']
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
dispatcher.utter_message(text="Restaurants in Location only: ")
dispatcher.utter_message(list_a)
What should I do se that the data is showed in a vertical list format (new line indentation) and without the bracket and quotation marks? Thank you
First of all, have you tried reading your data into a pandas object? I have done some programs with a sqlite database and this worked for me:
df = pd.read_sql_query("SELECT * FROM {}".format(self.tablename), conn)
But now to the string formatting part:
# this code should do the work for you
# first of all we have our string a like yours
a="[['hallo'],['welt'],['kannst'],['du'],['mich'],['hoeren?']]"
# now we split the string into a list on every ,
list_a=a.split(',')
# this is our list with chars we want to remove
remove=["'",'"','[',']']
# now we replace all elements step by step with nothing
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
print(list_a)
for z in list_a:
print(z)
The output is then:
['hallo', 'welt', 'kannst', 'du', 'mich', 'hoeren?']
hallo
welt
kannst
du
mich
hoeren?
I hope I could help.
Was trying to figure out a way to get simple salesforce to just give me all the field names in a list. I want to create soql query that pretty much does the same thing as a Select * does in sql.
for obj in objects:
fields = [x["name"] for x in sf[obj].describe()["fields"]]
thanks
A list of field names in an object can be achieved as follow:
def getObjectFields(obj):
fields = getattr(sf,obj).describe()['fields']
flist = [i['name'] for i in fields]
return flist
getObjectFields('Contact')
Your query to get the effect of SELECT * would then look something like this:
sf.query_all('SELECT {} FROM Contact LIMIT 10'.format(','.join(getObjectFields('Contact'))))
On a related note:
In case it is helpful, a dictionary of label/name pairs can be achieved as follows:
def getObjectFieldsDict(obj):
fields = getattr(sf,obj).describe()['fields']
fdict = {}
for i in fields:
fdict[i['label']] = i['name']
return fdict
getObjectFieldsDict('Contact')
I find this can be useful for figuring out the names of fields with labels that do not follow the standard format (i.e. "My Favorite Website" field label for "Favorite_Website__c" field name)
This method will return a query string with all fields for the object passed in. Well all the fields the user has access to.
public static string getFullObjectQuery(String sObjectName){
Schema.SObjectType convertType = Schema.getGlobalDescribe().get(sObjectName);
Map<String,Schema.sObjectField> fieldMap = convertType.getDescribe().Fields.getMap();
Set<String> fields = fieldMap.keySet();
String Query = 'SELECT ';
for(String field: fields){
Schema.DescribeFieldResult dfr = fieldMap.get(field).getDescribe();
if(dfr.isAccessible()){
Query += field + ',';
}
}
Query = query.SubString(0,Query.length() - 1);
Query += ' FROM ' + sObjectName;
return Query;
}
#!/usr/bin/env python3
import argparse
import os
import simple_salesforce
parser = argparse.ArgumentParser()
parser.add_argument('--sandbox', action='store_true',
help='Use a sandbox')
parser.add_argument('sfobject', nargs='+', action='store',
help=('Salesforce object to query (e.g. Contact)'))
args = parser.parse_args()
sf = simple_salesforce.Salesforce(
username = os.getenv('USERNAME'),
password = os.getenv('PASSWORD'),
security_token = os.getenv('SECURITY_TOKEN'),
sandbox = args.sandbox)
for sfobject in args.sfobject:
print(sfobject)
fields = [x['name'] for x in getattr(sf, sfobject).describe()['fields']]
print(fields)
this is a strange question I know... I have a regular expression like:
rex = r"at (?P<hour>[0-2][0-9]) send email to (?P<name>\w*):? (?P<message>.+)"
so if I match that like this:
match = re.match(rex, "at 10 send email to bob: hi bob!")
match.groupdict() gives me this dict:
{"hour": "10", "name": "bob", "message": "hi bob!"}
My question is: given the dict above and rex, can I make a function that returns the original text? I know that many texts can match to the same dict (in this case the ':' after the name is optional) but I want one of the infinite texts that will match to the dict in input.
Using inverse_regex:
"""
http://www.mail-archive.com/python-list#python.org/msg125198.html
"""
import itertools as IT
import sre_constants as sc
import sre_parse
import string
# Generate strings that match a given regex
category_chars = {
sc.CATEGORY_DIGIT : string.digits,
sc.CATEGORY_SPACE : string.whitespace,
sc.CATEGORY_WORD : string.digits + string.letters + '_'
}
def unique_extend(res_list, list):
for item in list:
if item not in res_list:
res_list.append(item)
def handle_any(val):
"""
This is different from normal regexp matching. It only matches
printable ASCII characters.
"""
return string.printable
def handle_branch((tok, val)):
all_opts = []
for toks in val:
opts = permute_toks(toks)
unique_extend(all_opts, opts)
return all_opts
def handle_category(val):
return list(category_chars[val])
def handle_in(val):
out = []
for tok, val in val:
out += handle_tok(tok, val)
return out
def handle_literal(val):
return [chr(val)]
def handle_max_repeat((min, max, val)):
"""
Handle a repeat token such as {x,y} or ?.
"""
subtok, subval = val[0]
if max > 5000:
# max is the number of cartesian join operations needed to be
# carried out. More than 5000 consumes way to much memory.
# raise ValueError("To many repetitions requested (%d)" % max)
max = 5000
optlist = handle_tok(subtok, subval)
iterlist = []
for x in range(min, max + 1):
joined = IT.product(*[optlist]*x)
iterlist.append(joined)
return (''.join(it) for it in IT.chain(*iterlist))
def handle_range(val):
lo, hi = val
return (chr(x) for x in range(lo, hi + 1))
def handle_subpattern(val):
return list(permute_toks(val[1]))
def handle_tok(tok, val):
"""
Returns a list of strings of possible permutations for this regexp
token.
"""
handlers = {
sc.ANY : handle_any,
sc.BRANCH : handle_branch,
sc.CATEGORY : handle_category,
sc.LITERAL : handle_literal,
sc.IN : handle_in,
sc.MAX_REPEAT : handle_max_repeat,
sc.RANGE : handle_range,
sc.SUBPATTERN : handle_subpattern}
try:
return handlers[tok](val)
except KeyError, e:
fmt = "Unsupported regular expression construct: %s"
raise ValueError(fmt % tok)
def permute_toks(toks):
"""
Returns a generator of strings of possible permutations for this
regexp token list.
"""
lists = [handle_tok(tok, val) for tok, val in toks]
return (''.join(it) for it in IT.product(*lists))
########## PUBLIC API ####################
def ipermute(p):
return permute_toks(sre_parse.parse(p))
You could apply the substitutions given rex and data, and then use inverse_regex.ipermute to generate strings that match the original regex:
import re
import itertools as IT
import inverse_regex as ire
rex = r"(?:at (?P<hour>[0-2][0-9])|today) send email to (?P<name>\w*):? (?P<message>.+)"
match = re.match(rex, "at 10 send email to bob: hi bob!")
data = match.groupdict()
del match
new_regex = re.sub(r'[(][?]P<([^>]+)>[^)]*[)]', lambda m: data.get(m.group(1)), rex)
for s in IT.islice(ire.ipermute(new_regex), 10):
print(s)
yields
today send email to bob hi bob!
today send email to bob: hi bob!
at 10 send email to bob hi bob!
at 10 send email to bob: hi bob!
Note: I modified the original inverse_regex to not raise a ValueError when the regex contains *s. Instead, the * is changed to be effectively like {,5000} so you'll at least get some permutations.
This is one of the texts that will match the regex:
'at {hour} send email to {name}: {message}'.format(**match.groupdict())'
I try to build a regex with Python which must match on this :
STRING
STRING STRING
STRING (STRING) STRING (STRING)
STRING (STRING) STRING (STRING) STRING (STRING) STRING
I tried to do the job using metacharacter optionnal ? but for the second pattern STRING STRING it doesn't work : I have just the first character after the first string
\w+\s+\w+?
gives
STRING S
but should gives
STRING STRING
and match on
STRING
STRING STRING
Here is full code :
import csv
import re
import sys
fname = sys.argv[1]
r = r'(\w+) access = (\w+)\s+Vol ID = (\w+)\s+Snap ID = (\w+)\s+Inode = (\w+)\s+IP = ((\d|\.)+)\s+UID = (\w+)\s+Full Path = (\S+)\s+Handle ID: (\S+)\s+Operation ID: (\S+)\s+Process ID: (\d+)\s+Image File Name: (\w+\s+\w+\s+\w+)\s+Primary User Name: (\S+)\s+Primary Domain: (\S+)\s+Primary Logon ID: (.....\s+......)\s+Client User Name: (\S+)\s+Client Domain: (\S+)\s+Client Logon ID: (\S+)'
regex = re.compile(r)
out = csv.writer(sys.stdout)
f_hdl = open(fname, 'r')
csv_rdr = csv.reader(f_hdl)
header = True
for row in csv_rdr:
#print row
if header:
header = False
else:
field = row[-1]
res = re.search(regex, field)
if res:
audit_status = row[3]
device = row[7]
date_time = row[0]
event_id = row[2]
user = row[6]
access_source = res.group(1)
access_type = res.group(2)
volume = res.group(3)
snap = res.group(4)
inode = res.group(5)
ip = res.group(6)
uid = res.group(8)
path = res.group(9)
handle_id = res.group(10)
operation_id = res.group(11)
process_id = res.group(12)
image_file_name = res.group(13)
primary_user_name = res.group(14)
primary_domain = res.group(15)
primary_logon_id = res.group(16)
client_user_name = res.group(17)
client_domain = res.group(18)
client_logon_id = res.group(19)
print audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path
out.writerow(
[audit_status, device, date_time, event_id, user, access_source, access_type, volume, snap, inode, ip, uid, path, handle_id, operation_id, process_id, image_file_name, primary_user_name, primary_domain, primary_logon_id, client_user_name, client_domain, client_logon_id]
)
else:
print 'NOMATCH'
Any suggestions ?
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
If it's a csv file that uses space for separation and parenthesis for quoting, use
csv.reader(csvfile, delimiter=' ', quotechar='(')
If it's even a simpler case, use the split method on the string and expand it to fill all fields with an empty string:
fields = field.split(' ')
fields = [i or j for i, j in map(None, fields, ('',) * 7)]
Try this for your regex string:
r = '(\\w+) access = (\\w+)\\s+Vol ID = (\\w+)\\s+Snap ID = (\\w+)\\s+Inode = (\\w+)\\s+IP = ((\\d|\\.)+)\\s+UID = (\\w+)\\s+Full Path = (\\S+)\\s+Handle ID: (\\S+)\\s+Operation ID: (\\S+)\\s+Process ID: (\\d+)\\s+Image File Name: (\\w+\\s+\\w+\\s+\\w+)\\s+Primary User Name: (\\S+)\\s+Primary Domain: (\\S+)\\s+Primary Logon ID: (.....\\s+......)\\s+Client User Name: (\\S+)\\s+Client Domain: (\\S+)\\s+Client Logon ID: (\\S+)\\s+Accesses: (.*)'