; commentary
[owner]
name=Justin Case
organization=Chilling Inc.
[database]
; more commentary
server=192.0.0.1
port=123
file=something.csv
[third section]
attribute=value,
that extends to
the third line,
but not the fourth
Given the above ini contents, have to construct a dictionary such that
{'owner' : {'name' : 'Justin Case','organization' : 'Chilling Inc.'},
'database' : {'server' : '192.0.0.1', 'port' : '123', 'file' : 'something.csv'},
'third section' : {'attribute' : 'multiline value'}}
I realize there is the configuration file parser, but not allowed for this assignment.
Progress at the moment:
with open('ini.txt', encoding='utf8') as data:
lines = [row for row in data]
lines_nocom = []
for row in lines:
if not row.startswith(';'):
lines_nocom.append(row)
dictt = {}
I removed the rows with commentary in them since they are unnecesary.
How can I make python recognize the sections and their respective attributes?
i.e section1 could have 2 attributes and section2 could have any number of attributes
If I do [row for row in lines_nocom] then how does it recognize where one section ends and another begins?
How to make python recognize a multiline value?
Track the current section and add your keys to that; each time you find a line using square brackets create a new section.
For continuation lines, do something similar; track the last used name:
with open('ini.txt', encoding='utf8') as data:
section = None # current section
name = None # current name being stored
result = {}
for line in data:
line = line.strip()
if not line or line.startswith(';'):
# skip comments and empty lines
continue
if line.startswith('[') and line.endswith(']'):
# new section
section_name = line.strip('[]')
section = result[section_name] = {}
continue
# add entries to the existing section
if '=' in line:
name, _, value = line.partition('=')
name = name.strip()
section[name] = value.strip()
else:
# adding to last-used name
section[name] += ' ' + line
Demo:
>>> from io import StringIO
>>> from pprint import pprint
>>> sample = StringIO('''\
... ; commentary
... [owner]
... name=Justin Case
... organization=Chilling Inc.
...
... [database]
... ; more commentary
... server=192.0.0.1
... port=123
... file=something.csv
...
... [third section]
... attribute=value,
... that extends to
... the third line,
... but not the fourth
... ''')
>>> section = None # current section
>>> name = None # current name being stored
>>> result = {}
>>> for line in sample:
... line = line.strip()
... if not line or line.startswith(';'):
... # skip comments and empty lines
... continue
... if line.startswith('[') and line.endswith(']'):
... # new section
... section_name = line.strip('[]')
... section = result[section_name] = {}
... continue
... # add entries to the existing section
... if '=' in line:
... name, _, value = line.partition('=')
... name = name.strip()
... section[name] = value.strip()
... else:
... # adding to last-used name
... section[name] += ' ' + line
...
>>> pprint(result)
{'database': {'file': 'something.csv', 'port': '123', 'server': '192.0.0.1'},
'owner': {'name': 'Justin Case', 'organization': 'Chilling Inc.'},
'third section': {'attribute': 'value, that extends to the third line, but '
'not the fourth'}}
Related
so I would like to from a input.txt file, create a two dictionaries
for example, here is sample of the input.txt file
#. VAR #first=Billy
#. VAR #last=Bob
#. PRINT VARS
#. VAR #petName=Gato
#. VAR #street="1234 Home Street"
#. VAR #city="New York"
#. VAR #state=NY
#. VAR #zip=21236
#. VAR #title=Dr.
#. PRINT VARS
#. FORMAT LM=5 JUST=LEFT
#. PRINT FORMAT
so VAR #varName=value
i.e in the case of #first=Billy you would get something like varDict = {"first": "Billy"} right?
Now I wanna know how to do that thru the entire file
There are two dictionaries that I would need to populate, one for the variables, and one for FORMAT, which just holds values, doesn't actually do anything for now.
As far as a desired output, In the input file, there are commands that when read, will trigger to either add variables to the directory, or print that directory, or add to the format directory. I would use the pprint function like this pprint.pprint(varDict , width=30) and would output something like this
{'first': 'Billy',
'last': 'Bob'}
{'city': 'New York',
'first': 'Billy',
'last': 'Bob',
'petName': 'Gato',
'state': 'NY',
'street': '1234 Home Street',
'title': 'Dr.',
'zip': '21236'}
{'BULLET': 'o',
'FLOW': 'YES',
'JUST': 'LEFT',
'LM': '5',
'RM': '80'}
Unfortunately i keep getting errors all over the place on the driver and source file
AttributeError: 'list' object has no attribute 'groups'
TypeError: expected string or buffer
Driver.py
input=(sys.argv[1])
# Group 1. VAR
# Group 2. #first=Mae or JUST=RIGHT FLOW=NO
# pass Group 2 as atString
regexSearch = re.compile(r'^#. ([A-Z]+) (.*)', re.MULTILINE)
regexPrintVAR = re.compile(r'^#\.\s*PRINT\s(VARS)', re.MULTILINE)
regexPrintFORMAT = re.compile(r'^#\.\s*PRINT\s(FORMAT)',re.MULTILINE)
regexERRCheck = re.compile(r'^#\.\s*FORMAT\s+BAD', re.MULTILINE)
varDictionary = dict()
formatDictionary = {"FLOW":"YES", "LM":"1", "RM":"80","JUST":"LEFT","BULLET":"o"}
file = open(input, "r")
while True:
inputLine = file.readline()
matchObj = regexSearch.search(inputLine)
command, atString = matchObj.groups()
if command == "VAR":
setVariable(atString,varDictionary)
if command == "FORMAT":
formatListERR = regexERRCheck.search(inputLine)
if formatListERR != None:
print("*** Not a recognizable command")
line = file.readline()
setFormat(atString, formatDictionary)
if command == "PRINT":
printVARObj = regexPrintVAR.search(inputLine)
printFormatObj = regexPrintFORMAT.search(inputLine)
if printVARObj != None:
pprint.pprint(varDictionary, width=30)
elif printFormatObj != None:
pprint.pprint(formatDict, width=30)
inputLine = file.readline()
file.close()
importFileIUse.py
# The atString is the remainder of the string after the VAR or FORMAT key word.
varDictionary = dict()
formatDictionary = {"FLOW":"YES", "LM":"1", "RM":"80","JUST":"LEFT","BULLET":"o"}
def setFormat(atString,formatDictionary):
regexFormat = re.compile(r'((?:(?:\w+)=(?:\w+)\s*)*)$')
line = re.split(" +", atString)
formatList = regexFormat.search(line)
if formatList:
for param in formatList[0].split():
splitParam = param.split('=')
formatDictionary[splitParam[0]] = splitParam[1]
def setVariable (atString, varDictionary):
regexVAR = re.compile(r'#(\w+)=(\w+|.*)\s*$', re.MULTILINE)
# file = open(input)
# line = file.readline()
# line = re.split(" +", atString)
#while line:
varList = regexVAR.findall(atString)
for key, value in varList:
varDictionary[key] = value
I have files with CommonChar is some of them and my python code works on them to build a dictionary. While building there are some required keys which users might forget to put in. The code should be able to flag the file and the key which is missing.
The syntax for python code to work on is like this:
CommonChar pins Category General
CommonChar pins Contact Mark
CommonChar pins Description 1st line
CommonChar pins Description 2nd line
CommonChar nails Category specific
CommonChar nails Description 1st line
So for above example "Contact" is missing:
CommonChar nails Contact Robert
I have a list for ex: mustNeededKeys=["Category", "Description", "Contact"]
mainDict={}
for dirName, subdirList, fileList in os.walk(sys.argv[1]):
for eachFile in fileList:
#excluding file names ending in .swp , swo which are creatied temporarily when editing in vim
if not eachFile.endswith(('.swp','.swo','~')):
#print eachFile
filePath= os.path.join(dirName,eachFile)
#print filePath
with open(filePath, "r") as fh:
contents=fh.read()
items=re.findall("CommonChar.*$",contents,re.MULTILINE)
for x in items:
cc, group, topic, data = x.split(None, 3)
data = data.split()
group_dict = mainDict.setdefault(group, {'fileLocation': [filePath]})
if topic in group_dict:
group_dict[topic].extend(['</br>'] + data)
else:
group_dict[topic] = data
This above code does its job of building a dict like this:
{'pins': {'Category': ['General'], 'Contact': ['Mark'], 'Description': ['1st', 'line', '2nd', 'line'] } , 'nails':{'Category':['specific'], 'Description':['1st line']}
So when reading each file with CommonChar and building a group_dict , a way to check all the keys and compare it with mustNeededKeys and flag if not there and proceed if met.
Something like this should work:
# Setup mainDict (equivalent to code given above)
mainDict = {
'nails': {
'Category': ['specific'],
'Description': ['1st', 'line'],
'fileLocation': ['/some/path/nails.txt']
},
'pins': {
'Category': ['General'],
'Contact': ['Mark'],
'Description': ['1st', 'line', '</br>', '2nd', 'line'],
'fileLocation': ['/some/path/pins.txt']
}
}
# check for missing keys
mustNeededKeys = {"Category", "Description", "Contact"}
for group, group_dict in mainDict.items():
missing_keys = mustNeededKeys - set(group_dict.keys())
if missing_keys:
missing_key_list = ','.join(missing_keys)
print(
'group "{}" ({}) is missing key(s): {}'
.format(group, group_dict['fileLocation'][0], missing_key_list)
)
# group "nails" (/some/path/nails.txt) is missing key(s): Contact
If you must check for missing keys immediately after processing each group, you could use the code below. This assumes that each group is stored as a contiguous collection of rows in a single file (i.e., not mixed with other groups in the same file or spread across different files).
from itertools import groupby
mainDict={}
mustNeededKeys = {"Category", "Description", "Contact"}
for dirName, subdirList, fileList in os.walk(sys.argv[1]):
for eachFile in fileList:
# excluding file names ending in .swp , swo which are created
# temporarily when editing in vim
if not eachFile.endswith(('.swp','.swo','~')):
#print eachFile
filePath = os.path.join(dirName,eachFile)
#print filePath
with open(filePath, "r") as fh:
contents = fh.read()
items = re.findall("CommonChar.*$", contents, re.MULTILINE)
split_items = [line.split(None, 3) for line in items]
# group the items by group name (element 1 in each row)
for g, group_items in groupby(split_items, lambda row: row[1]):
group_dict = {'fileLocation': [filePath]}
# store all items in the current group
for cc, group, topic, data in group_items:
data = data.split()
if topic in group_dict:
group_dict[topic].extend(['</br>'] + data)
else:
group_dict[topic] = data
# check for missing keys
missing_keys = mustNeededKeys - set(group_dict.keys())
if missing_keys:
missing_key_list = ','.join(missing_keys)
print(
'group "{}" ({}) is missing key(s): {}'
.format(group, filePath, missing_key_list)
)
# add group to mainDict
mainDict[group] = group_dict
data = '''CommonChar pins Category General
CommonChar pins Contact Mark
CommonChar pins Description 1st line
CommonChar pins Description 2nd line
CommonChar nails Category specific
CommonChar nails Description 1st line'''
from collections import defaultdict
from pprint import pprint
required_keys = ["Category", "Description", "Contact"]
d = defaultdict(dict)
for line in data.splitlines():
line = line.split()
if line[2] == 'Description':
if line[2] not in d[line[1]]:
d[line[1]][line[2]] = []
d[line[1]][line[2]].extend(line[3:])
else:
d[line[1]][line[2]] = [line[3]]
pprint(dict(d))
print('*' * 80)
# find missing keys
for k in d.keys():
for missing_key in set(d[k].keys()) ^ set(required_keys):
print('Key "{}" is missing "{}"!'.format(k, missing_key))
Prints:
{'nails': {'Category': ['specific'], 'Description': ['1st', 'line']},
'pins': {'Category': ['General'],
'Contact': ['Mark'],
'Description': ['1st', 'line', '2nd', 'line']}}
********************************************************************************
Key "nails" is missing "Contact"!
I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.
I am trying to read in from a data file that has lines like:
2007 ANDREA 30 31.40 -71.90 05/13/18Z 25 1007 LOW
2007 ANDREA 31 31.80 -69.40 05/14/00Z 25 1007 LOW
I am trying to create a nested dictionary that has a key holding the year and then the nested dictionary will hold the name and a tuple containing statistics. I would like the return value to look like this:
{'2007': {'ANDREA': [(31.4, -71.9, '05/13/18Z', 25.0, 1007.0), (31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
However when I run the code it returns only one set of statistics. It seems to be overwriting itself because I am getting that last line of statistics in the txt file returned:
{'2007': {'ANDREA': [(31.8, -69.4, '05/14/00Z', 25.0, 1007.0)]
Here is the code:
def create_dictionary(fp):
'''Remember to put a docstring here'''
dict1 = {}
f = []
for line in fp:
a = line.split()
f.append(a)
for item in f:
a = (float(item[3]), float(item[4]), item[5], float(item[6]),
float(item[7]))
dict1 = update_dictionary(dict1, item[0], item[1], a))
print(dict1)
def update_dictionary(dictionary, year, hurricane_name, data):
if year not in dictionary:
dictionary[year] = {}
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
else:
if hurricane_name not in dictionary:
dictionary[year][hurricane_name] = [data]
else:
dictionary[year][hurricane_name].append(data)
return dictionary
These lines:
if hurricane_name not in dictionary:
...should be:
if hurricane_name not in dictionary[year]:
Since I was a little late here's a suggestion instead of an answer to your original question. You can simplify the logic a bit because when the year doesn't exist then the name also can't exist for that year. Everything can be put in a single function and using a "with" statement to open the file will ensure it is properly closed even if your program encounters an error.
def build_dict(file_path):
result = {}
with open(file_path, 'r') as f:
for line in f:
items = line.split()
year, name, data = items[0], items[1], tuple(items[2:])
if year in result:
if name in result[year]:
result[year][name].append(data)
else:
result[year][name] = [data]
else:
result[year] = {name: [data]}
return result
print(build_dict(file_path))
Output:
{'2007': {'ANDREA': [('30', '31.40', '-71.90', '05/13/18Z', '25', '1007', 'LOW'), ('31', '31.80', '-69.40', '05/14/00Z', '25', '1007', 'LOW')]}}
There are four keywords: title, blog, tags, state
Excess keyword occurrences are being removed from their respective matches.
Example:
blog: blog state title tags and returns state title tags and instead of
blog state title tags and
The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+
Regex:
re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
Code:
def n15():
import re
a = """blog: blog: fooblog
state: private
title: this is atitle bun
and text"""
kwargs = {}
def matcher(string):
v = string.group(1).replace(string.group(2), '').replace(string.group(3), '').replace(string.group(4), '').replace(string.group(5), '')
if string.group(3) == 'title':
kwargs['title'] = v
elif string.group(3) == 'blog':
kwargs['blog_url'] = v
elif string.group(3) == 'tags':
kwargs['comma_separated_tags'] = v
elif string.group(3) == 'state':
kwargs['post_state'] = v
return ''
a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
a = a.replace('\n', '<br />')
a = a.replace('\r', '')
a = a.replace('"', r'\"')
a = '<p>' + a + '</p>'
kwargs['body'] = a
print kwargs
Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'}
Edit:
Desired Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(string.group(3), '')
is replacing all occurrences of 'blog' with '' .
Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match.
r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))'
which has () around the .+ to capture that part of the string, then
v = match.group(5)
at the start of matcher.