Extract information out of strings - python

given in a string the following information:
[:T102684-1 coord="107,20,885,18":]27.[:/T102684-1:] [:T102684-2
coord="140,16,885,18":]A.[:/T102684-2:] [:T102684-3
coord="162,57,885,18":]Francke[:/T102684-3:][:T102684-4
coord="228,5,885,18":]:[:/T102684-4:] [:T102684-5
coord="240,27,885,18":]Die[:/T102684-5:] [:T102684-6
coord="274,42,885,18":]alpine[:/T102684-6:] [:T102684-7
coord="325,64,885,18":]Literatur[:/T102684-7:] [:T102684-8
coord="398,25,885,18":]des[:/T102684-8:] [:T102684-9
coord="427,46,885,18":]Jahres[:/T102684-9:] [:T102684-10
coord="480,33,885,18":]1888[:/T102684-10:] [:T102684-11
coord="527,29,885,18":]475[:/T102684-11:]
How can I extract the Tab-ID (here: T102684), the Token-ID (the number after the "-"), the coordinates (107,20,885,18) and the token itself ("27.") ?
I used simple find-methods, but it doesn't work...
for tok in ele.text.split():
print tok.find("[:T")
print tok.rfind(":]")
print tok[(tok.find("[:T")+2):tok.rfind("-")]
Thanks for any help!

You can use regex for this:
>>> import re
>>> s = '[:T102684-1 coord="107,20,885,18":]27.[:/T102684-1:] [:T102684-2 coord="140,16,885,18":]A.[:/T102684-2:] [:T102684-3 coord="162,57,885,18":]Francke[:/T102684-3:][:T102684-4 coord="228,5,885,18":]:[:/T102684-4:] [:T102684-5 coord="240,27,885,18":]Die[:/T102684-5:] [:T102684-6 coord="274,42,885,18":]alpine[:/T102684-6:] [:T102684-7 coord="325,64,885,18":]Literatur[:/T102684-7:] [:T102684-8 coord="398,25,885,18":]des[:/T102684-8:] [:T102684-9 coord="427,46,885,18":]Jahres[:/T102684-9:] [:T102684-10 coord="480,33,885,18":]1888[:/T102684-10:] [:T102684-11 coord="527,29,885,18":]475[:/T102684-11:]'
>>> r = re.compile(r'''\[:/?T(?P<token_id>\d+)-(?P<id>\d+)\s+coord="
(?P<coord>(\d+,\d+,\d+,\d+))":\](?P<token>\w+)''', flags=re.VERBOSE)
>>> for m in r.finditer(s):
print m.groupdict()
{'token_id': '102684', 'token': '27', 'id': '1', 'coord': '107,20,885,18'}
{'token_id': '102684', 'token': 'A', 'id': '2', 'coord': '140,16,885,18'}
{'token_id': '102684', 'token': 'Francke', 'id': '3', 'coord': '162,57,885,18'}
{'token_id': '102684', 'token': 'Die', 'id': '5', 'coord': '240,27,885,18'}
{'token_id': '102684', 'token': 'alpine', 'id': '6', 'coord': '274,42,885,18'}
{'token_id': '102684', 'token': 'Literatur', 'id': '7', 'coord': '325,64,885,18'}
{'token_id': '102684', 'token': 'des', 'id': '8', 'coord': '398,25,885,18'}
{'token_id': '102684', 'token': 'Jahres', 'id': '9', 'coord': '427,46,885,18'}
{'token_id': '102684', 'token': '1888', 'id': '10', 'coord': '480,33,885,18'}
{'token_id': '102684', 'token': '475', 'id': '11', 'coord': '527,29,885,18'}

Related

Compare values in list of dicts in Python

I'm a newbie in Python. I have a list of members and a list of meetings (containing the member id):
memberList = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meetingList = [{'meetingId': '20', 'hostId' : '1'},
{'meetingId': '21', 'hostId' : '1'},
{'meetingId': '22', 'hostId' : '2'},
{'meetingId': '23', 'hostId' : '2'}]
Where the id of the member and the hostId of meeting is the same value.
Result: a list of member ids which has no meetings ['3'] or the list of dicts [{'id': '3', 'name': 'Billy'}]
What's the best and most readable way to do it?
You could build a set of hosts and then use a list comprehension to filter out the members:
member_list = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meeting_list = [{'meetingId': '20', 'hostId': '1'},
{'meetingId': '21', 'hostId': '1'},
{'meetingId': '22', 'hostId': '2'},
{'meetingId': '23', 'hostId': '2'}]
# create a set of hosts
hosts = set(meeting['hostId'] for meeting in meeting_list) # or { meeting['hostId'] for meeting in meeting_list }
# filter out the members that are in hosts
res = [member['id'] for member in member_list if member['id'] not in hosts]
print(res)
Output
[{'id': '3', 'name': 'Billy'}]
For the id only output, do:
res = [member['id'] for member in member_list if member['id'] not in hosts]
print(res)
Output
['3']
I'd extract out the id's from both lists of dictionaries and compare them directly.
First I'm just rewriting your list variables to assign them with =.
Using : won't save the variable.
memberList = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meetingList = [{'meetingId': '20', 'hostId' : '1'},
{'meetingId': '21', 'hostId' : '1'},
{'meetingId': '22', 'hostId' : '2'},
{'meetingId': '23', 'hostId' : '2'}]
Then use list comprehension to extract out the id's from each list of dicts.
member_id_list = [i["id"] for i in memberList]
meeting_hostid_list = [i["hostId"] for i in meetingList]
You could also use list comprehension here but if you aren't familiar with it, this for loop and if logic will print out any member id who isn't a host.
for i in member_id_list:
if i not in meeting_hostid_list:
print(i)
>> 3

Turn a text file into a dictionary with Python

I have a text file with a pattern:
[Badges_373382]
Deleted=0
Button2=0 1497592154
Button1=0 1497592154
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1509194246
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1509120279
Name=asd
Neuron=7F0027BF2D
Owner=373381
LostSince=1509120774
Index1=218
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497592154
BatteryLow=0
PrevReader=10703
Reader=357862
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:15 PM
OwnerOldName=asd
[Badges_373384]
Deleted=0
Button2=0 1497538610
Button1=0 1497538610
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1513872678
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1513872684
Name=dsa
Neuron=7F0027CC1C
Owner=373383
LostSince=1513872723
Index1=219
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497538610
BatteryLow=0
PrevReader=357874
Reader=357873
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:51 PM
OwnerOldName=dsa
[Badges_373386]
Deleted=0
Button2=0 1497780768
Button1=0 1497780768
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1514124910
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1514124915
Name=ss
Neuron=7F0027B5FD
Owner=373385
LostSince=1514124950
Index1=220
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497780768
BatteryLow=0
PrevReader=357872
Reader=357871
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:49:24 PM
OwnerOldName=ss
Every new "Badge" info starts with [Badges_number] and end with blank line.
Using Python 3.6, I would like to turn this file into a dictionary so that I could easily access that information.
It should look like this:
content = {"Badges_373382:{"Deleted:0,.."},"Badges_371231":{"Deleted":0,..}"}
I'm pretty confused on how to do that, I'd love to get some help.
Thanks!
This is basically an INI file, and Python provides the configparser module to parse such files.
import configparser
config = configparser.ConfigParser()
config.readfp(open('badges.ini'))
r = {section: dict(config[section]) for section in config.sections()}
You can loop through each line and keep track if you have seen a header in the format [Badges_373382]:
import re
import itertools
with open('filename.txt') as f:
f = filter(lambda x:x, [i.strip('\n') for i in f])
new_data = [(a, list(b)) for a, b in itertools.groupby(f, key=lambda x:bool(re.findall('\[[a-zA-Z]+_+\d+\]', x)))]
final_data = {new_data[i][-1][-1]:dict(c.split('=') for c in new_data[i+1][-1]) for i in range(0, len(new_data), 2)}
Output:
{'[Badges_373384]': {'OwnerOldName': 'dsa', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Reader': '357873', 'LostSince': '1513872723', 'LastMotion': '1497538610', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1513872678', 'BatteryLow': '0', 'Index1': '219', 'Name': 'dsa', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357874', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'Params': '10106'}, '[Badges_373382]': {'OwnerOldName': 'asd', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Reader': '357862', 'LostSince': '1509120774', 'LastMotion': '1497592154', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1509194246', 'BatteryLow': '0', 'Index1': '218', 'Name': 'asd', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '10703', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'Params': '10106'}, '[Badges_373386]': {'OwnerOldName': 'ss', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Reader': '357871', 'LostSince': '1514124950', 'LastMotion': '1497780768', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1514124910', 'BatteryLow': '0', 'Index1': '220', 'Name': 'ss', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357872', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'Params': '10106'}}
You can just go through each line of the file and add what you need. Their are three cases of lines you can come across:
1. The is a header, it will be a key final dictionary. You can just check if a line starts with "[Badges" here, and store the current header with a temporary variable while reading the file.
2. The line is a blank line, marking the end of the current badge data being read. All you need to do here is add the information collected from the current badge and add it to the dictionary, with the correct corresponding key. Depending on your implementation, you can delete these beforehand, or keep them when reading the lines.
3. Otherwise, the line has some info that needs to be stored. You first need to split this info on "=", and store it in your dictionary.
With these suggestions, you can write something like this to accomplish this task:
from collections import defaultdict
# dictionary of dictionary values
data = defaultdict(dict)
with open('pattern.txt') as file:
lines = [line.strip('\n') for line in file]
# keeps track of current header
header = None
# case 2, deletes empty lines before hand
valid_lines = [line for line in lines if line]
for line in valid_lines:
# case 1, for headers
if line.startswith('[Badges'):
# updates current header, and deletes square brackets
header = line.replace('[', '').replace(']', '')
# case 3, data has been found
else:
# split and add the data
info = line.split('=')
key, value = info[0], info[1]
data[header][key] = value
print(dict(data))
Which outputs:
{'Badges_373382': {'Deleted': '0', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1509194246', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Name': 'asd', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'LostSince': '1509120774', 'Index1': '218', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497592154', 'BatteryLow': '0', 'PrevReader': '10703', 'Reader': '357862', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'OwnerOldName': 'asd'}, 'Badges_373384': {'Deleted': '0', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1513872678', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Name': 'dsa', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'LostSince': '1513872723', 'Index1': '219', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497538610', 'BatteryLow': '0', 'PrevReader': '357874', 'Reader': '357873', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'OwnerOldName': 'dsa'}, 'Badges_373386': {'Deleted': '0', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1514124910', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Name': 'ss', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'LostSince': '1514124950', 'Index1': '220', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497780768', 'BatteryLow': '0', 'PrevReader': '357872', 'Reader': '357871', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'OwnerOldName': 'ss'}}
Note: The above code is just a possibility, feel free to adapt it to your needs, or even improve it.
I also used collections.defaultdict to add the data, since its easier to use. You can also wrap dict() at the end to convert it to a normal dictionary, which is optional.
You can try regex and split the result of output:
pattern='^\[Badges.+?OwnerOldName=\w+'
import re
with open('file.txt','r') as f:
match=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)
new=[]
for kk in match:
if kk.group()!='\n':
new.append(kk.group())
print({i.split()[0]:i.split()[1:] for i in new})
output:
{'[Badges_373384]': ['Deleted=0', 'Button2=0', '1497538610', 'Button1=0', '1497538610', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1513872678', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1513872684', 'Name=dsa', 'Neuron=7F0027CC1C', 'Owner=373383', 'LostSince=1513872723', 'Index1=219', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497538610', 'BatteryLow=0', 'PrevReader=357874', 'Reader=357873', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:51', 'PM', 'OwnerOldName=dsa'], '[Badges_373382]': ['Deleted=0', 'Button2=0', '1497592154', 'Button1=0', '1497592154', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1509194246', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1509120279', 'Name=asd', 'Neuron=7F0027BF2D', 'Owner=373381', 'LostSince=1509120774', 'Index1=218', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497592154', 'BatteryLow=0', 'PrevReader=10703', 'Reader=357862', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:15', 'PM', 'OwnerOldName=asd'], '[Badges_373386]': ['Deleted=0', 'Button2=0', '1497780768', 'Button1=0', '1497780768', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1514124910', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1514124915', 'Name=ss', 'Neuron=7F0027B5FD', 'Owner=373385', 'LostSince=1514124950', 'Index1=220', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497780768', 'BatteryLow=0', 'PrevReader=357872', 'Reader=357871', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:49:24', 'PM', 'OwnerOldName=ss']}

How to get values in script using python

I'm creating a crawler with python + beautiful soup.
I have to access the tag to get some data in the dataLayer.
I did a search with beatifulsoup and managed to return the tag that I need but I can not turn it into a json to access the information.
This is the code that I made to get the :
page = get_html('URL')
dataLayer = page.findAll('script')[NUMBER OF SCRIPT]
And this is my return:
<script type="text/javascript">
dataLayer = [{
'site': {
'isMobile': false
},
'page': {
'pageType': 'ad_detail',
'detail': {
'parent_category_id': '2000',
'category_id': '2020',
'state_id': '2',
'region_id': '31',
'ad_id': '293231982',
'list_id': '250941507',
'city_id': '9208',
'zipcode':'34710620',
},
'adDetail': {
'adID': '293231982',
'listID': '250941507',
'sellerName': 'Marr',
'adDate': '2016-11-30 20:52:11',
},
},
'session': {
'user': {
'userID': '',
'loginType': ''
}
},
'pageType': 'Ad_detail',
'abtestingEnable' : '1',
// Listing information
'listingCategory': '2020',
// Ad information
'adId': '293231982',
'state': '2',
'region': '31',
'category': '2020',
'pictures': '8',
'listId': '250941507',
//Account Information
'loggedUser':'0',
'referrer': '',
//User Information
}];
</script>
I would like to get the data as adDate and zipcode.
s = soup.script.text.replace('\'', '"') # replace ' with "
s = re.search(r'\{.+\}', s, re.DOTALL).group() # get json data
s = re.sub(r'//.+\n', '', s) # replace comment
s = re.sub(r'\s+', '', s) # strip whitspace
s = re.sub(r',}', '}', s) # get rid of last , in the dict
json.loads(s)
out:
{'abtestingEnable': '1',
'adId': '293231982',
'category': '2020',
'listId': '250941507',
'listingCategory': '2020',
'loggedUser': '0',
'page': {'adDetail': {'adDate': '2016-11-3020:52:11',
'adID': '293231982',
'listID': '250941507',
'sellerName': 'Marr'},
'detail': {'ad_id': '293231982',
'category_id': '2020',
'city_id': '9208',
'list_id': '250941507',
'parent_category_id': '2000',
'region_id': '31',
'state_id': '2',
'zipcode': '34710620'},
'pageType': 'ad_detail'},
'pageType': 'Ad_detail',
'pictures': '8',
'referrer': '',
'region': '31',
'session': {'user': {'loginType': '', 'userID': ''}},
'site': {'isMobile': False},
'state': '2'}
Your json is using single quotes instead of double quotes.
you should replace all single quotes with doubles quotes to make you dataLayer variable json compliant.
A simple .replace("'", '"') should do the trick.
Note : you also have to remove the commented line with a second regex.

Scraping information from hoverbox

As background, I am scraping a webpage in Python and using BeautifulSoup.
Some of the information that I need to access is a little box about user profiles that pops up when the mouse hovers over the user's profile picture. The problem, is that this information is not available in the html, instead, I get the following:
""div class="username mo"
span class="expand_inline scrname mbrName_1586A02614A388AEE215B4A3139A2C18" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Sapphire-Ed
""
(I have deleted some of the >s so that the html will show up in the question, sorry!)
Can anyone tell me how to do this? Thank you for the help!!
Here is the webpage if that is helpful:
view-source:http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html
The information I am trying to access is the review distribution.
Below is the complete working code that outputs a dictionary where the keys are usernames and the values are review distributions. To understand how the code works, here are the key things to take into an account:
the information in the overlay appearing on the mouse over is loaded dynamically with a HTTP GET request with a number of user-specific parameters - the most important are uid and src
the uid and src values can be extracted with a regular expression from the id attribute for every user profile element
the response to this GET request is HTML which you need to parse with BeautifulSoup also
you should maintain the web-scraping session with requests.Session
The code:
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
data = {}
# this pattern would help us to extract uid and src needed to make a GET request
pattern = re.compile(r"UID_(\w+)-SRC_(\w+)")
# making a web-scraping session
with requests.Session() as session:
response = requests.get("http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html")
soup = BeautifulSoup(response.content, "lxml")
# iterating over usernames on the page
for member in soup.select("div.member_info div.memberOverlayLink"):
# extracting uid and src from the `id` attribute
match = pattern.search(member['id'])
if match:
username = member.find("div", class_="username").text.strip()
uid, src = match.groups()
# making a GET request for the overlay information
response = session.get("http://www.tripadvisor.com/MemberOverlay", params={
"uid": uid,
"src": src,
"c": "",
"fus": "false",
"partner": "false",
"LsoId": ""
})
# getting the grades dictionary
soup_overlay = BeautifulSoup(response.content, "lxml")
data[username] = {grade_type: soup_overlay.find("span", text=grade_type).find_next_sibling("span", class_="numbersText").text.strip(" ()")
for grade_type in ["Excellent", "Very good", "Average", "Poor", "Terrible"]}
pprint(data)
Prints:
{'Anna T': {'Average': '2',
'Excellent': '0',
'Poor': '0',
'Terrible': '0',
'Very good': '2'},
'Arlyss T': {'Average': '0',
'Excellent': '6',
'Poor': '0',
'Terrible': '0',
'Very good': '1'},
'Bf B': {'Average': '1',
'Excellent': '22',
'Poor': '0',
'Terrible': '0',
'Very good': '17'},
'Charmingnl': {'Average': '15',
'Excellent': '109',
'Poor': '4',
'Terrible': '4',
'Very good': '45'},
'Jackie M': {'Average': '2',
'Excellent': '10',
'Poor': '0',
'Terrible': '0',
'Very good': '4'},
'Jonathan K': {'Average': '69',
'Excellent': '90',
'Poor': '6',
'Terrible': '0',
'Very good': '154'},
'Sapphire-Ed': {'Average': '8',
'Excellent': '47',
'Poor': '2',
'Terrible': '0',
'Very good': '49'},
'TundraJayco': {'Average': '14',
'Excellent': '59',
'Poor': '0',
'Terrible': '1',
'Very good': '49'},
'Versrii': {'Average': '2',
'Excellent': '8',
'Poor': '0',
'Terrible': '0',
'Very good': '10'},
'tripavisor83': {'Average': '12',
'Excellent': '9',
'Poor': '1',
'Terrible': '0',
'Very good': '20'}}

How to updating dict items inside a list of dicts

I have this list of dicts that I'm maintaining as a master list:
orig_list = [
{ 'cpu': '4', 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' }
{ 'cpu': '1', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' }
{ 'cpu': '2', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
However, I need to perform actions on things inside this list of dicts, like:
def modifyVM(local_list)
local_temp_list = []
for item in local_list :
'''
Tons of VM processy things happen here.
'''
item['cpu'] = 4
item['notes'] = 'updated cpu'
local_temp_list.append(item)
return local_temp_list
temp_list []
for item in orig_list :
if item['cpu'] < 4
temp_list.append(item)
result_list = modifyVM(temp_list)
At this point, result_list contains:
result_list = [
{ 'cpu': '4', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' }
{ 'cpu': '4', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
So my questions are:
1) What is the most efficient way to update orig_list with the results of result_list? I'm hoping to end up with:
orig_list = [
{ 'cpu': '4', 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' }
{ 'cpu': '4', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' 'notes': 'updated cpu' }
{ 'cpu': '4', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' 'notes': 'updated cpu' }
]
2) Is there a way to update orig_list without ever creating secondary lists?
Thank you in advance.
Collections store references to the objects.
So the code you posted is already modifying the items in "orig_list" as well, cause all the lists are storing references to the same original dictionaries.
As for the second part of your question, you don't need to create a new list. You can modify the objects directly, and next time you iterate the list you'll see the updated values.
Like for example:
orig_list = [
{ 'cpu': 4, 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' },
{ 'cpu': 1, 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' },
{ 'cpu': 2, 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
print orig_list
for item in orig_list :
if item['cpu'] < 4:
item['cpu'] = 4
print orig_list
Output of first print:
[{'mem': '4', 'nics': '1', 'drives': '4', 'cpu': 4, 'name': 'server1'},
{'mem': '2', 'nics': '2', 'drives': '2', 'cpu': 1, 'name': 'server2'},
{'mem': '8', 'nics': '1', 'drives': '1', 'cpu': 2, 'name': 'server3'}]
And second print:
[{'mem': '4', 'nics': '1', 'drives': '4', 'cpu': 4, 'name': 'server1'},
{'mem': '2', 'nics': '2', 'drives': '2', 'cpu': 4, 'name': 'server2'},
{'mem': '8', 'nics': '1', 'drives': '1', 'cpu': 4, 'name': 'server3'}]
No, you don't need to create a separate list, just use list comprehension.
Just iterate through the list and check if value of cpu key is less than 4. If value is less than 4, then update value the cpu key to 4 and add an extra key notes having value as 'updated_cpu'. Value of orig_list after iteration finishes is the desired result.
>>> orig_list = [{'cpu': 4, 'drives': '4', 'mem': '4', 'name': 'server1', 'nics': '1'},
{'cpu': 1, 'drives': '2', 'mem': '2', 'name': 'server2', 'nics': '2'},
{'cpu': 2, 'drives': '1', 'mem': '8', 'name': 'server3', 'nics': '1'}]
>>> for item in orig_list:
if item['cpu']<4:
item['cpu']=4
item['notes'] = 'updated cpu'
>>> orig_list
[{'cpu': 4, 'drives': '4', 'mem': '4', 'name': 'server1', 'nics': '1'},
{'cpu': 4, 'drives': '2', 'mem': '2', 'name': 'server2', 'nics': '2', 'notes': 'updated cpu'},
{'cpu': 4, 'drives': '1', 'mem': '8', 'name': 'server3', 'nics': '1', 'notes': 'updated cpu'}]
Thank you for all the input! I flagged eugenioy's post as the answer because he posted first. Both the answer from him and from Rahul Gupta are very efficient ways to update a list of dictionaries.
However, I kept trying other ways because these answers, as efficient as they are, still do one other thing I've always been told is taboo: modifying the list you're iterating over.
Keep in mind, that I'm still learning Python. So if some of my "revelations" here are mundain, they are new and "wow" to me. To that effect, I'm adding the answer that I actually ended up implementing.
Here's the finished code:
def modifyVM(local_list, l_orig_list)
for item in local_list[:] :
l_orig_list.remove(item)
'''
Tons of VM processy things happen here.
'''
item['cpu'] = 4
item['notes'] = 'updated cpu'
l_orig_list.append(item)
temp_list []
for item in orig_list[:] :
if item['cpu'] < 4
temp_list.append(item)
modifyVM(temp_list, orig_list)
I change this line:
def modifyVM(local_list)
To this line:
def modifyVM(local_list, l_orig_list)
In this way, I'm passing in both the list I want to use as well as the list I want to update.
Next I changed:
for item in local_list :
To this line:
for item in local_list[:] :
This causes "item" to iterate through a slice (copy) of "local_list" that contains everything.
I also added:
l_orig_list.remove(item)
And:
l_orig_list.append(item)
This solved several problems for me.
1) This avoids the potential of modifying any list that's being iterated through.
2) This allows "orig_list" to be updated as processes are happening, which cuts down on the "secondary lists" that are created and maintained.
3) The "orig_list" that's passed into the function and "l_orig_list" are linked variables until a hard assignment (i.e. l_orig_list = 'anything') is made. (Again, thank you to everyone that answered! This was some great "secret sauce" learning for me, and y'all pointed it out.) So, avoiding the "=", I'm able to update "l_orig_list" and have "orig_list" updated as well.
4) This also allows the movement of items from one list to another if needed (i.e. list items that generate errors can be removed from "orig_list" and placed in any other list, like "bad_list" for example.
In closing, I'd like to give recognition to Steven Rumbalski. When I read your comment, I was like, "Of course!!!" However, I spent 2 days on it before realizing that dictionaries cannot be sorted. I had to narrow down the technical problem I was facing to ask a question here. Sorting was an unstated requirement for other parts of the script. So GREAT suggestion, and I'll probably use that for another script.

Categories