Here is a step-by-step code to debug the error I'm getting:
from nltk.tree import ParentedTree
teststr = 'a) Any service and handling fee imposed on customers by Used Motor Vehicle Dealers subject to these Rules: \r\n 1) shall be charged uniformly to all retail customers; \r\n 2) may not be presented as mandatory in writing, electronically, verbally, via American Sign Language, or via other media as mandatory; nor presented as mandatory or mandated by any entity, other than the Arkansas Used Motor Vehicle Dealer who or dealership which is legally permitted to invoice, charge and collect the service and handling fee established by these Rules; \r\n 3) must follow the procedures for disclosure set out by these Rules.'
#Using ALLENNLP's parser
from allennlp.predictors.predictor import Predictor
conspredictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/elmo-constituency-parser-2018.03.14.tar.gz")
treestr = conspredictor.predict(sentence=teststr)['trees']
ptree = ParentedTree.fromstring(treestr)
Here is the error I'm receiving with the traceback:
<ipython-input-391-f600cbe3ff5e> in <module>
10 treestr = conspredictor.predict(sentence=teststr)['trees']
11
---> 12 ptree = ParentedTree.fromstring(treestr)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tree.py in fromstring(cls, s, brackets, read_node, read_leaf, node_pattern, leaf_pattern, remove_empty_top_bracketing)
616 if token[0] == open_b:
617 if len(stack) == 1 and len(stack[0][1]) > 0:
--> 618 cls._parse_error(s, match, 'end-of-string')
619 label = token[1:].lstrip()
620 if read_node is not None: label = read_node(label)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tree.py in _parse_error(cls, s, match, expecting)
677 offset = 13
678 msg += '\n%s"%s"\n%s^' % (' '*16, s, ' '*(17+offset))
--> 679 raise ValueError(msg)
680
681 #////////////////////////////////////////////////////////////
ValueError: ParentedTree.read(): expected 'end-of-string' but got '(:'
at index 273.
"...es))))))) (: :) (S (..."
^
Related
I created python script to scrape facebook help page. I wanted to scrape cms_object_id, cmsID, name. so these values are in a script tag then firstly tried to find all <script> tags then tried to iterate over this and then there is __bbox inside the tags which contains the values wanted to scrape.
so this is my script:
import json
import requests
import bs4
from Essentials import Static
class CmsIDs:
def GetIDs():
# cont = requests.get(""https://www.facebook.com:443/help"", headers=Static.headers) # syntax error
cont = requests.get("https://www.facebook.com:443/help", headers=Static.headers)
soup = bs4.BeautifulSoup(cont.content, "html5lib")
text = soup.find_all("script")
start = ""
txtstr = ""
for i in range(len(text)):
mystr = text[i]
# mystr = text[i]
print("this is: ", mystr.find('__bbox":'))
if text[i].get_text().find('__bbox":') != -1:
# print(i, text[i].get_text())
txtstr += text[i].get_text()
start = text[i].get_text().find('__bbox":') + len('__bbox":')
print('start:', start)
count = 0
for end, char in enumerate(txtstr[start:], start):
if char == '{':
count += 1
if char == '}':
count -= 1
if count == 0:
break
print('end:', end)
# --- convert JSON string to Python structure (dict/list) ---
data = json.loads(txtstr[start:end+1])
# pp.pprint(data)
print('--- search ---')
CmsIDs.search(data)
# --- use recursion to find all 'cms_object_id', 'cmsID', 'name' ---
def search(data):
if isinstance(data, dict):
found = False
if 'cms_object_id' in data:
print('cms_object_id', data['cms_object_id'])
found = True
if 'cmsID' in data:
print('cmsID', data['cmsID'])
found = True
if 'name' in data:
print('name', data['name'])
found = True
if found:
print('---')
for val in data.values():
CmsIDs.search(val)
if isinstance(data, list):
for val in data:
CmsIDs.search(val)
if __name__ == '__main__':
CmsIDs.GetIDs()
the page contains cms_object_id, cmsID, name. so wanted to scrape all these 3 values but I am getting an error:
for end, char in enumerate(txtstr[start:], start):
TypeError: slice indices must be integers or None or have an __index__ method
so how can I solve this error and reach ultimate goal?
Note: Since I'm unfamiliar with and failed to install Essentials, and also because ""https://www.facebook.com:443/help"" raises a syntax error (there should only be one quote on each side of the string), I changed the requests line in my code.
cont = requests.get('https://www.facebook.com:443/help', headers={'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'})
TypeError: slice indices must be integers or None or have an __index__ method
You've initiated start as a string [start = ""] and it needs to be an integer. Unless the if text[i].get_text().find('__bbox":') != -1.... block is entered, start remains a string.
if you just want to not get this error, you could just exit the program if start hasn't been updated [indicating that __bbox": wasn't found in any of the script tags].
print('start:', start)
if start == "":
print('{__bbox":} not found')
return
count = 0
But that still leaves the problem of __bbox": not being found; I'm not sure why, but the issue is resolved for me if I don't use the html5lib parser - just changing to BeautifulSoup(cont.content) resolved this issue.
# soup = bs4.BeautifulSoup(cont.content, "html5lib")
soup = bs4.BeautifulSoup(cont.content) # don't define the parser
# soup = bs4.BeautifulSoup(cont.content, "html.parser") # you could also try other parsers
Other suggestions
Your code will probably work without these, but you might want to consider these suggested improvements for error-handling:
Filter the script Tags
If the text ResultSet only has script tags that contain __bbox":, you avoid looping unnecessarily through the 100+ other scripts, and you won't have to check with if....find('__bbox":') anymore.
text = soup.find_all(lambda t: t.name == 'script' and '"__bbox"' in t.get_text())
for mystr in [s.get_text() for s in text]:
print("this is: ", mystr.find('__bbox":'))
txtstr += mystr
start = mystr.find('__bbox":') + len('__bbox":')
Initiate end
You should initiate the end variable [like end = 0] before the for end, char ... loop because you're using it after the loop as well.
print('end:', end)
data = json.loads(txtstr[start:end+1])
If txtstr[start:] is empty somehow, these lines will raise error/s since end would not be defined yet.
Use a JavaScript Parser
This will make the previous suggestions unnecessary, but as it is json.loads will raise an error if txtstr[start:end+1] is empty somehow, or if it contains any unpaired [and likely escaped] }or{. So, it might be more reliable to use a parser rather than just trying to walk through the string.
I have this function that uses slimit to find values from strings containing JavaScript code. (It's far from perfect, but it seems to for this script, at least.) GetIDs() could be re-written as below.
# import json, requests, bs4, slimit
# from slimit.visitors import nodevisitor
# def findObj_inJS... ## PASTE FROM https://pastebin.com/UVcLniSG
# class CmsIDs:
def GetIDs():
# cont=requests.get('https://www.facebook.com:443/help',headers=Static.headers)
cont = requests.get('https://www.facebook.com:443/help', headers={
'accept': ';'.join(
[ 'text/html,application/xhtml+xml,application/xml',
'q=0.9,image/avif,image/webp,image/apng,*/*',
'q=0.8,application/signed-exchange',
'v=b3', 'q=0.9' ])})
## in case of request errors ##
try: cont.raise_for_status()
except Exception as e:
print('failed to fetch page HTML -', type(e), e)
return
print('fetched', cont.url, 'with', cont.status_code, cont.reason)
soup = bs4.BeautifulSoup(cont.content)
scrCond = lambda t: t.name == 'script' and '"__bbox"' in t.get_text()
jScripts = [s.get_text() for s in soup.find_all(scrCond)]
print(f'Found {len(jScripts)} script tags containing {{"__bbox"}}')
data = [findObj_inJS(s,'"__bbox"') for s in jScripts]
print('--- search ---')
CmsIDs.search(data)
# def search(data)....
Return the Data
This isn't for error-handling, but if you return the data printed by CmsIDs.search you could save it for further use.
def search(data):
rList, dKeys = [], ['cms_object_id', 'cmsID', 'name']
if isinstance(data, dict):
dObj = {k: data[k] for k in dKeys if k in data}
rList += [dObj] if dObj else []
for k, v in dObj.items(): print(k, v)
if dObj: print('---')
for val in data.values(): rList += CmsIDs.search(val)
if isinstance(data, list):
for val in data: rList += CmsIDs.search(val)
return rList
The printed result will be the same as before, but if you change the last line of GetIDs to
return CmsIDs.search(data)
and then define a variable cmsList = CmsIDs.GetIDs() then cmsList will be a list of dctionaries, which you could then [for example] save to csv with pandas and view as a table on a spreadsheet.
# import pandas
pandas.DataFrame(cmsList).to_csv('CmsIDs_GetIDs.csv', index=False)
or print the markdown for the table [of the results I got] below
print(pandas.DataFrame(cmsList, dtype=str).fillna('').to_markdown())
[index]
cms_object_id
name
cmsID
0
Використання Facebook
1
570785306433644
Створення облікового запису
2
396528481579093
Your Profile
3
Додати й редагувати інформацію у своєму профілі
1017657581651994
4
Ваші основна світлина й обкладинка
1217373834962306
5
Поширення дописів у своєму профілі та керування ними
1640261589632787
6
Усунення проблем
191128814621591
7
1540345696275090
Додавання в друзі
8
Додавання друзів
246750422356731
9
Люди, яких ви можете знати
336320879782850
10
Control Who Can Friend and Follow You
273948399619967
11
Upload Your Contacts to Facebook
1041444532591371
12
Видалення з друзів чи блокування користувача
1000976436606344
13
312959615934334
Facebook Dating
14
753701661398957
Ваша головна сторінка
15
How Feed Works
1155510281178725
16
Control What You See in Feed
964154640320617
17
Like and React to Posts
1624177224568554
18
Пошук
821153694683665
19
Translate Feed
1195058957201487
20
Memories
1056848067697293
21
1071984682876123
Повідомлення
22
Надсилання повідомлень
487151698161671
23
Переглянути повідомлення й керувати ними
1117039378334299
24
Поскаржитися на повідомлення
968185709965912
25
Відеовиклики
287631408243374
26
Fix a Problem
1024559617598844
27
753046815962474
Reels
28
Watching Reels
475378724739085
29
Creating Reels
867690387846185
30
Managing Your Reels
590925116168623
31
862926927385914
Розповіді
32
Як створити розповідь і поширити її
126560554619115
33
View and Reply to Stories
349797465699432
34
Page Stories
425367811379971
35
1069521513115444
Світлини й відео
36
Світлини
1703757313215897
37
Відео
1738143323068602
38
Going Live
931327837299966
39
Albums
490693151131920
40
Додавання позначок
267689476916031
41
Усунення проблеми
507253956146325
42
1041553655923544
Відео у Watch
43
Перегляд шоу та відео
401287967326510
44
Fix a Problem
270093216665260
45
2402655169966967
Gaming
46
Gaming on Facebook
385894640264219
47
Платежі в іграх
248471068848455
48
282489752085908
Сторінки
49
Interact with Pages
1771297453117418
50
Створити сторінку й керувати нею
135275340210354
51
Імена й імена користувачів
1644118259243888
52
Керування налаштуваннями сторінки
1206330326045914
53
Customize a Page
1602483780062090
54
Publishing
1533298140275888
55
Messaging
994476827272050
56
Insights
794890670645072
57
Banning and Moderation
248844142141117
58
Усунути проблему
1020132651404616
59
1629740080681586
Групи
60
Join and Choose Your Settings
1210322209008185
61
Post, Participate and Privacy
530628541788770
62
Create, Engage and Manage Settings
408334464841405
63
Керування групою для адміністраторів
1686671141596230
64
Community Chats
3397387057158160
65
Pages in Groups
1769476376397128
66
Fix a Problem
1075368719167893
67
1076296042409786
Події
68
Create and Manage an Event
572885262883136
69
View and Respond to Events
1571121606521970
70
Facebook Classes
804063877226739
71
833144153745643
Fundraisers and Donations
72
Creating a Fundraiser
356680401435429
73
Пожертва в рамках збору коштів
1409509059114623
74
Особисті збори коштів
332739730519432
75
For Nonprofits
1640008462980459
76
Fix a Problem
2725517974129416
77
1434403039959381
Meta Pay
78
Платежі в іграх
248471068848455
79
Payments in Messages
863171203733904
80
Пожертва в рамках збору коштів
1409509059114623
81
Квитки на заходи
1769557403280350
82
Monetization and Payouts
1737820969853848
83
1713241952104830
Marketplace
84
Як працює Marketplace
1889067784738765
85
Buying on Marketplace
272975853291364
86
Продаж на Marketplace
153832041692242
87
Sell with Shipping on Marketplace
773379109714742
88
Using Checkout on Facebook
1411280809160810
89
Групи з купівлі й продажу
319768015124786
90
Get Help with Marketplace
1127970530677256
91
1642635852727373
Додатки
92
Manage Your Apps
942196655898243
93
Видимість і конфіденційність додатка
1727608884153160
94
866249956813928
Додатки Facebook для мобільних пристроїв
95
Додаток для Android
1639918076332350
96
iPhone and iPad Apps
1158027224227668
97
Facebook Lite App
795302980569545
98
273947702950567
Спеціальні можливості
99
Керування обліковим записом
100
1573156092981768
Вхід і пароль
101
Вхід в обліковий запис
1058033620955509
102
Змінення пароля
248976822124608
103
Виправлення проблеми із входом
283100488694834
104
Завантаження посвідчення особи
582999911881572
105
239070709801747
Налаштування облікового запису
106
Як змінити налаштування облікового запису
1221288724572426
107
Ваше ім’я користувача
1740158369563165
108
Спадкоємці
991335594313139
109
1090831264320592
Імена у Facebook
110
1036755649750898
Сповіщення
111
Push, Email and Text Notifications
530847210446227
112
Виберіть, про що отримувати сповіщення
269880466696699
113
Усунення проблем
1719980288275077
114
109378269482053
Налаштування реклами
115
Як працює реклама у Facebook
516147308587266
116
Контроль реклами, яку ви бачите
1075880512458213
117
Ваша інформація та реклама у Facebook
610457675797481
118
1701730696756992
Доступ до вашої інформації та її завантаження
119
250563911970368
Деактивація або видалення облікового запису
120
Конфіденційність і безпека
121
238318146535333
Ваша конфіденційність
122
Керуйте тим, хто може переглядати контент, який ви поширюєте у Facebook
1297502253597210
123
Керування своїми дописами
504765303045427
124
Control Who Can Find You
1718866941707011
125
592679377575472
Безпека
126
Джерела щодо боротьби з жорстоким поводженням
726709730764837
127
Ресурси з допомоги для протидії самогубству та самоушкодженню
1553737468262661
128
Crisis Response
141874516227713
129
Ресурси з правил безпеки для допомоги батькам
1079477105456277
130
Інформація для правоохоронних органів
764592980307837
131
235353253505947
Захист облікового запису
132
Функції безпеки та поради з її забезпечення
285695718429403
133
Сповіщення про вхід і двоетапна перевірка
909243165853369
134
Уникайте спаму та шахрайства
1584206335211143
135
236079651241697
Безпека під час здійснення покупок
136
Розпізнавання шахрайства
1086141928978559
137
Уникнення шахрайства
2374002556073992
138
Купівля на Marketplace
721562085854101
139
Поради щодо безпечної купівлі
123884166448529
140
Купуйте впевнено
1599248863596914
141
Політики та скарги
142
1753719584844061
Скарга на порушення
143
Як поскаржитися на щось?
1380418588640631
144
Don't Have an Account?
1723400564614772
145
1126628984024935
Як повідомити про проблему у Facebook
146
186614050293763
Being Your Authentic Self on Facebook
147
1561472897490627
Повідомлення про порушення конфіденційності
148
1216349518398524
Зламані та фальшиві облікові записи
149
275013292838654
Керування обліковим записом померлої людини
150
About Memorialized Accounts
1017717331640041
151
Request to Memorialize or Remove an Account
1111566045566400
152
399224883474207
Інтелектуальна власність
153
Авторське право
1020633957973118
154
Торговельна марка
507663689427413
155
1735443093393986
Про наші політики
I am trying to have each line be stored in a different element of the list. The text file is as follows...
244
Large Cake Pan
7
19.99
576
Assorted Sprinkles
3
12.89
212
Deluxe Icing Set
6
37.97
827
Yellow Cake Mix
3
1.99
194
Cupcake Display Board
2
27.99
285
Bakery Boxes
7
8.59
736
Mixer
5
136.94
I am trying to have 244, 576, etc. be in ID. And "Large Cake Pan","Assorted Sprinkles", etc. in Name. You get the idea, but it's storing everything in ID, and I don't know how to make it store the information in its corresponding element.
Here is my code so far:
import Inventory
def process_inventory(filename, inventory_dict):
inventory_dict = {}
inventory_file = open(filename, "r")
for line in inventory_file:
line = line.split('\n')
ID = line[0]
Name = line[1]
Quantity = line[2]
Price = line[3]
my_inventory = Inventory.Inventory(ID, Name, Quantity, Price)
inventory_dict[ID] = my_inventory
inventory_file.close()
return inventory_dict
def main():
inventory1={}
process_inventory("Inventory.txt", inventory1)
I have a URL DataFrame. My coding aims to use machine learning to classify the url into benign or malicious.
I wanna use Host-Based features to get the the url creation_date, last_updated_date and expired_date with whois package. But it shows error.
Could somebody help me to fix it?
Here is my code and the error as following.
# URL DataFrame
URL Lable
0 http://ovaismirza-politicalthoughts.blogspot.com/ 0
1 http://www.bluemoontea.com/ 0
2 http://www.viettiles.com/public/default/ckedit... 1
3 http://173.212.217.250/hescientiststravelled/o... 1
4 http://www.hole-in-the-wall.com/ 0
### Code
date = []
for i in range(len(df)):
item = df["URL"].loc[i]
domain = urlparse(item).netloc
cr = whois.query(domain).creation_date
up = whois.query(domain).last_updated
exp = whois.query(domain).expiration_date
if cr is not None and up is not None and exp is not None:
date.append(0)
else:
date.append(1)
### ErrorException
Traceback (most recent call last)
<ipython-input-26-0d7930e66020> in <module>
3 item = df["URL"].loc[i]
4 domain = urlparse(item).netloc
----> 5 cr = whois.query(domain).creation_date
6 up = whois.query(domain).last_updated
7 exp = whois.query(domain).expiration_date
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/whois/__init__.py in query(domain, force, cache_file, slow_down, ignore_returncode)
48
49 while 1:
---> 50 pd = do_parse(do_query(d, force, cache_file, slow_down, ignore_returncode), tld)
51 if (not pd or not pd['domain_name'][0]) and len(d) > 2: d = d[1:]
52 else: break
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/whois/_1_query.py in do_query(dl, force, cache_file, slow_down, ignore_returncode)
42 CACHE[k] = (
43 int(time.time()),
---> 44 _do_whois_query(dl, ignore_returncode),
45 )
46 if cache_file: cache_save(cache_file)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/whois/_1_query.py in _do_whois_query(dl, ignore_returncode)
59 r = p.communicate()[0]
60 r = r.decode() if PYTHON_VERSION == 3 else r
---> 61 if not ignore_returncode and p.returncode != 0: raise Exception(r)
62 return r
63
Exception: whois: connect(): Operation timed out
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object
refer: whois.verisign-grs.com
domain: COM
organisation: VeriSign Global Registry Services
address: 12061 Bluemont Way
address: Reston Virginia 20190
address: United States
contact: administrative
name: Registry Customer Service
organisation: VeriSign Global Registry Services
address: 12061 Bluemont Way
address: Reston Virginia 20190
address: United States
phone: +1 703 925-6999
fax-no: +1 703 948 3978
e-mail: info#verisign-grs.com
contact: technical
name: Registry Customer Service
organisation: VeriSign Global Registry Services
address: 12061 Bluemont Way
address: Reston Virginia 20190
address: United States
phone: +1 703 925-6999
fax-no: +1 703 948 3978
e-mail: info#verisign-grs.com
nserver: A.GTLD-SERVERS.NET 192.5.6.30 2001:503:a83e:0:0:0:2:30
nserver: B.GTLD-SERVERS.NET 192.33.14.30 2001:503:231d:0:0:0:2:30
nserver: C.GTLD-SERVERS.NET 192.26.92.30 2001:503:83eb:0:0:0:0:30
nserver: D.GTLD-SERVERS.NET 192.31.80.30 2001:500:856e:0:0:0:0:30
nserver: E.GTLD-SERVERS.NET 192.12.94.30 2001:502:1ca1:0:0:0:0:30
nserver: F.GTLD-SERVERS.NET 192.35.51.30 2001:503:d414:0:0:0:0:30
nserver: G.GTLD-SERVERS.NET 192.42.93.30 2001:503:eea3:0:0:0:0:30
nserver: H.GTLD-SERVERS.NET 192.54.112.30 2001:502:8cc:0:0:0:0:30
nserver: I.GTLD-SERVERS.NET 192.43.172.30 2001:503:39c1:0:0:0:0:30
nserver: J.GTLD-SERVERS.NET 192.48.79.30 2001:502:7094:0:0:0:0:30
nserver: K.GTLD-SERVERS.NET 192.52.178.30 2001:503:d2d:0:0:0:0:30
nserver: L.GTLD-SERVERS.NET 192.41.162.30 2001:500:d937:0:0:0:0:30
nserver: M.GTLD-SERVERS.NET 192.55.83.30 2001:501:b1f9:0:0:0:0:30
ds-rdata: 30909 8 2 E2D3C916F6DEEAC73294E8268FB5885044A833FC5459588F4A9184CFC41A5766
whois: whois.verisign-grs.com
status: ACTIVE
remarks: Registration information: http://www.verisigninc.com
created: 1985-01-01
changed: 2017-10-05
source: IANA
Domain Name: VIETTILES.COM
Registry Domain ID: 1827514943_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.pavietnam.vn
Registrar URL: http://www.pavietnam.vn
Updated Date: 2018-09-07T01:13:32Z
Creation Date: 2013-09-14T04:35:12Z
Registry Expiry Date: 2019-09-14T04:35:12Z
Registrar: P.A. Viet Nam Company Limited
Registrar IANA ID: 1649
Registrar Abuse Contact Email: abuse#pavietnam.vn
Registrar Abuse Contact Phone: +84.19009477
Domain Status: clientTransferProhibited https://icann.org/epp#clientTransferProhibited
Name Server: NS1.PAVIETNAM.VN
Name Server: NS2.PAVIETNAM.VN
Name Server: NSBAK.PAVIETNAM.NET
DNSSEC: unsigned
URL of the ICANN Whois Inaccuracy Complaint Form: https://www.icann.org/wicf/
Last update of whois database: 2018-12-25T13:33:54Z <<<
By the way, can I use other methods to get the URL creation_date, updated_date and expired_date instead of Whois in Python3?
Thanks in advance!
Regex sub-string
I want to extract Phone, Fax, Mobile I get from string if not It can return null string. I want 3 list of Phone, Fax, Mobile from any given text string string example are given below.
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
It is possible with regex like this:
phone_regex = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]
Result from ex1:
phone = 6035550160
fax = 6035550161
mobile = 6035550178
ex2 does not have a mobile entry, so I get:
Traceback (most recent call last):
phone = [re.sub("[^0-9]", "", x) for x in phone_regex.groups()][0]
AttributeError: 'NoneType' object has no attribute 'groups'
Question
I need, either a better regex solution, as I am new to regex,
or, a solution, to catch AttributeError and assign null string.
You may use a simple re.findall like this:
dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
The regex will look like
\b(phone|fax|mobile)\s*(\d+)
See the regex demo online.
Pattern details
\b - a word boundary
(phone|fax|mobile) - Group 1: one of the words listed
\s* - 0+ whitespaces
(\d+) - Group 2: one or more digits
See the Python demo:
import re
exs = ["miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom",
"david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu",
"stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
print(res)
Output:
{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}
Use re.search
Demo:
import re
ex1 = "miramar road margie shoop san diego ca 12793 manager phone 6035550160 fax 6035550161 mobile 6035550178 marsgies travel wwwmarpiestravelcom"
ex2 = "david packard electrical engineering 350 serra mall room 170 phone 650 7259327 stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford electrical engineering vijay chandrasekhar electrical engineering 17 comstock circle apt 101 stanford ca 94305 phone 9162210411"
for i in [ex1, ex2, ex3]:
phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
if phone:
print "Phone: ", phone.group("phone")
fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
if fax:
print "Fax: ", fax.group("fax")
mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
if mob:
print "mob: ", mob.group("mob")
print("-----")
Output:
Phone: 6035550160
Fax: 6035550161
mob: 6035550178
-----
Phone: 650 7259327
Fax: 650 723 1882
-----
Phone: 9162210411
-----
I think I understand what you want.. and it has to do with getting exactly the first match after a keyword. What you need in such a case is the question mark ?:
" '?' is also a quantifier. Is short for {0,1}. It means "Match zero or one of the group preceding this question mark." It can also be interpreted as the part preceding the question mark is optional"
And here is some code that should work, in case the definition wasnt enough
import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
res_dict[i_key] = temp_res
I think the following regexes should work fine:
mobile = re.findall('mobile([0-9]*)', ex1.replace(" ",""))[0]
fax = re.findall('fax([0-9]*)', ex1.replace(" ",""))[0]
phone = re.findall('phone([0-9]*)', ex1.replace(" ",""))[0]
I am attempting to write a script that will loop through list items and query the google places api.
The problem is that some of the queries will return no results, while other queries will.
The query results are gathered into lists. For every query that returns no results I would like to insert 'no results' string into list.
This is the script I have so far (API Key is fake):
companies = ['company A', 'company B', 'company C']
#create list items to store API search results
google_name = []
place_id = []
formatted_address = []
#function to find company id and address from company names
def places_api_id():
api_key = 'AIzaSyAKCp1kN0cHvO7t_NlqMagergrghhehtsrht'
url = 'https://maps.googleapis.com/maps/api/place/textsearch/json'
#replace spaces within list items with %20
company_replaced = company.replace(' ', '%20')
final_url = url + '?query=' + company_replaced +'&key=' + api_key
json_obj = urllib2.urlopen(final_url)
data = json.loads(json_obj)
#if no results, insert 'no results'
if data['status'] == 'ZERO RESULTS':
google_name.append('no results')
place_id.append('no results')
formatted_address('no results')
#otherwise, insert the result into list
else:
for item in data['results']:
google_name.append(item['name'])
place_id.append(item['place_id'])
formatted_address.append(item['formatted_address'])
#run the script
for company in companies:
places_api_id()
Unfortunately when I run the script python produces the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-159-eadf5f84e27f> in <module>()
1 for company in companies:
----> 2 places_api_id()
3
<ipython-input-153-f0e25b871a0e> in places_api_id()
6 final_url = url + '?query=' + company_replaced +'&key=' + api_key
7 json_obj = urllib2.urlopen(final_url)
----> 8 data = json.loads(json_obj)
9 if data['status'] == 'ZERO RESULTS':
10 google_name.append('no results')
/usr/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder
/usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364
365 """
--> 366 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
367 end = _w(s, end).end()
368 if end != len(s):
`TypeError: expected string or buffer
I would really appreciate your help and advice on how to get this script working, I've been staring at it for hours.
Thank you
Kamil
UPDATE
I am now loooping the following list through the script:
companies = ['MARINE AND GENERAL MUTUAL LIFE ASSURANCE SOCIETY',
'KENTSTONE PROPERTIES LIMITED',
'ASHFORD CATTLE MARKET COMPANY LIMITED(THE)',
'ORIENTAL GAS COMPANY, LIMITED(THE)',
'BRITISH INDIA STEAM NAVIGATION COMPANY LIMITED',
'N & C BUILDING PRODUCTS LIMITED',
'UNION MARINE AND GENERAL INSURANCE COMPANY LIMITED,(THE)',
'00000258 LIMITED',
'METHODIST NEWSPAPER COMPANY LIMITED',
'LONDON AND SUBURBAN LAND AND BUILDING COMPANY LIMITED(THE)']
after I run the script this is what Google Places API returns in the google name list:
[u'The Ashford Cattle Market Co Ltd',
u'Orient Express Hotels',
u'British-India Steam-Navigation Co Ltd',
u'N-Of-One, Inc.',
u'In-N-Out Burger',
u'In-N-Out Burger Distribution Center',
u"Wet 'n Wild Orlando",
u'In-N-Out Burger',
u'Alt-N Technologies (MDaemon)',
u'Model N Inc',
u"Pies 'n' Thighs",
u"Bethany Women's Center",
u"Jim 'N Nick's Bar-B-Q",
u"Steak 'n Shake",
u'New Orleans Ernest N. Morial Convention Center',
u"Jim 'N Nick's Bar-B-Q",
u"Jim 'N Nick's Bar-B-Q",
u"Jim 'N Nick's Bar-B-Q",
u'Theatre N at Nemours',
u'Model N',
u"Jim 'N Nick's Bar-B-Q",
u'Memphis Rock n Soul Museum',
u"Eat'n Park - Squirrel Hill",
u'Travelers',
u'American General Life Insurance Co',
u'258 Ltd Rd',
u'The Limited',
u'258, New IPCL Rd',
u'London Metropolitan Archives',
u'Hampstead Garden Suburb Trust Ltd']
Majority of the company names returned by Google are not even on the companies list and also there are many more of them. I am really confused now.
The error is not at the if-line, but before.
json_obj is a file-like object, not a string, therefore you have to use load:
data = json.load(json_obj)
PS: if the status is not what you expect, you can just test if data['results'] is empty or not:
import urllib2
from collections import namedtuple
API_KEY = 'AIzaSyAKCp1kN0cHvO7t_NlqMagergrghhehtsrht'
URL = 'https://maps.googleapis.com/maps/api/place/textsearch/json?query={q}&key={k}'
Place = namedtuple("Place", "google_name,place_id,formatted_address")
#function to find company id and address from company names
def places_api_id(company):
places = []
url = URL.format(q=urllib2.quote(company), k=API_KEY)
json_obj = urllib2.urlopen(url)
data = json.loads(json_obj)
if not data['results']:
places.append(Place("no results", "no results", "no results"))
else:
for item in data['results']:
places.append(Place(item['name'], item['place_id'], item['formatted_address']))
return places
companies = ['company A', 'company B', 'company C']
places = []
for company in companies:
places.extend(places_api_id(company))