I have a system log that looks like the following:
{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}
I want to parse this in Python into a dictionary object with = splitting key-value pairs. At the same time, the array that is enclosed by [] is also preserved. I want to keep this as generic as possible so the parsing can also hold some future variations.
What I tried so far (code will be written): split each line by "=" into key-value pair, determine where [ and ] starts and end and then split the lines in between by ":" into key-value pairs. That seems a little hard-coded.. Any better idea?
This could be pretty easily simplified to YAML. pip install pyyaml, then set up like so:
import string, yaml
data = """
{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}
"""
With this setup, you can use the following to parse your data:
data2 = data.replace(":", ": ").replace("=", ":").replace("[","{").replace("]","}")
lines = data2.splitlines()
for i, line in enumerate(lines):
if len(line)>0 and line[-1] in string.digits and not line.endswith(",") or i < len(lines) - 1 and line.endswith("}"):
lines[i] += ","
data3 = "\n".join(lines)
yaml.load(data3) # {'a': 1, 'b': 2, 'c': {'x': 1, 'y': 2, 'z': 3}, 'd': 4}
Explanation
In the first line, we perform some simple substitutions:
YAML requires that there is a space after colons in key/value pairs. So with replace(":", ": "), we can ensure this.
Since YAML key/value pairs are always denoted by a colon and your format sometimes uses equals signs, we replace equal signs with commas using .replace("=", ":")
Your format sometimes uses square brackets where curly brackets should be used in YAML. We fix using .replace("[","{").replace("]","}")
At this point, your data looks like this:
{
a : 1
b : 2
c : {
x: 1,
y: 2,
z: 3,
}
d : 4
}
Next, we have a for loop. This is simply responsible for adding commas after lines where they're missing. The two cases in which for loops are missing are:
- They're absent after a numeric value
- They're absent after a closing bracket
We match the first of these cases using len(line)>0 and line[-1] in string.digits (the last character in the line is a digit)
The second case is matched using i < len(lines) - 1 and line.endswith("}"). This checks if the line ends with }, and also checks that the line is not the last, since YAML won't allow a comma after the last bracket.
After the loop, we have:
{
a : 1,
b : 2,
c : {
x: 1,
y: 2,
z: 3,
},
d : 4,
}
which is valid YAML. All that's left is yaml.load, and you've got yourself a python dict.
If anything isn't clear please leave a comment and I'll happily elaborate.
There is probably a better answer, but I would take advantage of all your dictionary keys being at the same indentation level. There's not an obvious way to be to do this with newline splitting, JSON loading, or that sort of thing since the list structure is a bit weird (it seems like a cross between a list and a dictionary).
Here's an implementation that parses keys based on indentation level:
import re
log = '''{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}'''
log_lines = log.split('\n')[1:-1] # strip bracket lines
KEY_REGEX = re.compile(r' [^ ]')
d = {}
current_pair = ''
for i, line in enumerate(log_lines):
if KEY_REGEX.match(line):
if current_pair:
key, value = current_pair.split('=')
d[key.strip()] = value.strip()
current_pair = line
else:
current_pair += line.strip()
if current_pair:
key, value = current_pair.split('=')
d[key.strip()] = value.strip()
print(d)
Output:
{'d': '4', 'c': '[x:1,y:2,z:3,]', 'a': '1', 'b': '2'}
Related
Implement the function most_popular_character(my_string), which gets the string argument my_string and returns its most frequent letter. In case of a tie, break it by returning the letter of smaller ASCII value.
Note that lowercase and uppercase letters are considered different (e.g., ‘A’ < ‘a’). You may assume my_string consists of English letters only, and is not empty.
Example 1: >>> most_popular_character("HelloWorld") >>> 'l'
Example 2: >>> most_popular_character("gggcccbb") >>> 'c'
Explanation: cee and gee appear three times each (and bee twice), but cee precedes gee lexicographically.
Hints (you may ignore these):
Build a dictionary mapping letters to their frequency;
Find the largest frequency;
Find the smallest letter having that frequency.
def most_popular_character(my_string):
char_count = {} # define dictionary
for c in my_string:
if c in char_count: #if c is in the dictionary:
char_count[c] = 1
else: # if c isn't in the dictionary - create it and put 1
char_count[c] = 1
sorted_chars = sorted(char_count) # sort the dictionary
char_count = char_count.keys() # place the dictionary in a list
max_per = 0
for i in range(len(sorted_chars) - 1):
if sorted_chars[i] >= sorted_chars[i+1]:
max_per = sorted_chars[i]
break
return max_per
my function returns 0 right now, and I think the problem is in the last for loop and if statement - but I can't figure out what the problem is..
If you have any suggestions on how to adjust the code it would be very appreciated!
Your dictionary didn't get off to a good start by you forgetting to add 1 to the character count, instead you are resetting to 1 each time.
Have a look here to get the gist of getting the maximum value from a dict: https://datagy.io/python-get-dictionary-key-with-max-value/
def most_popular_character(my_string):
# NOTE: you might want to convert the entire sting to upper or lower case, first, depending on the use
# e.g. my_string = my_string.lower()
char_count = {} # define dictionary
for c in my_string:
if c in char_count: #if c is in the dictionary:
char_count[c] += 1 # add 1 to it
else: # if c isn't in the dictionary - create it and put 1
char_count[c] = 1
# Never under estimate the power of print in debugging
print(char_count)
# max(char_count.values()) will give the highest value
# But there may be more than 1 item with the highest count, so get them all
max_keys = [key for key, value in char_count.items() if value == max(char_count.values())]
# Choose the lowest by sorting them and pick the first item
low_item = sorted(max_keys)[0]
return low_item, max(char_count.values())
print(most_popular_character("HelloWorld"))
print(most_popular_character("gggcccbb"))
print(most_popular_character("gggHHHAAAAaaaccccbb 12 3"))
Result:
{'H': 1, 'e': 1, 'l': 3, 'o': 2, 'W': 1, 'r': 1, 'd': 1}
('l', 3)
{'g': 3, 'c': 3, 'b': 2}
('c', 3)
{'g': 3, 'H': 3, 'A': 4, 'a': 3, 'c': 4, 'b': 2, ' ': 2, '1': 1, '2': 1, '3': 1}
('A', 4)
So: l and 3, c and 3, A and 4
def most_popular_character(my_string):
history_l = [l for l in my_string] #each letter in string
char_dict = {} #creating dict
for item in history_l: #for each letter in string
char_dict[item] = history_l.count(item)
return [max(char_dict.values()),min(char_dict.values())]
I didn't understand the last part of minimum frequency, so I make this function return a maximum frequency and a minimum frequency as a list!
Use a Counter to count the characters, and use the max function to select the "biggest" character according to your two criteria.
>>> from collections import Counter
>>> def most_popular_character(my_string):
... chars = Counter(my_string)
... return max(chars, key=lambda c: (chars[c], -ord(c)))
...
>>> most_popular_character("HelloWorld")
'l'
>>> most_popular_character("gggcccbb")
'c'
Note that using max is more efficient than sorting the entire dictionary, because it only needs to iterate over the dictionary once and find the single largest item, as opposed to sorting every item relative to every other item.
I am trying to make my program count everything but numbers in a string, and store it in a dictionary.
So far I have this:
string = str(input("Enter a string: "))
stringUpper = string.upper()
dict = {}
for n in stringUpper:
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
I just want the alphabetical numbers quantified, but I cannot figure out how to exclude the non-alphabetical characters.
Basically there are multiple steps involved:
Getting rid of the chars that you don't want to count
Count the remaining
You have several options available to do these. I'll just present one option, but keep in mind that there might be other (and better) alternatives.
from collections import Counter
the_input = input('Enter something')
Counter(char for char in the_input.upper() if char.isalpha())
For example:
Enter something: aashkfze3f8237rhbjasdkvjuhb
Counter({'A': 3,
'B': 2,
'D': 1,
'E': 1,
'F': 2,
'H': 3,
'J': 2,
'K': 2,
'R': 1,
'S': 2,
'U': 1,
'V': 1,
'Z': 1})
So it obviously worked. Here I used collections.Counter to count and a generator expression using str.isalpha as condition to get rid of the unwanted characters.
Note that there are several bad habits in your code that will make your life more complicated than it needs to be:
dict = {} will shadow the built-in dict. So it's better to choose a different name.
string is the name of a built-in module, so here a different name might be better (but not str which is a built-in name as well).
stringUpper = string.upper(). In Python you generally don't use camelCase but use _ to seperate word (i.e. string_upper) but since you only use it to loop over you might as well use for n in string.upper(): directly.
Variable names like n aren't very helpful. Usually you can name them char or character when iterating over a string or item when iterating over a "general" iterable.
You can use re to replace all non-alphabetical characters before doing any manipulation:
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', stringUpper )
string = str(input("Enter a string: "))
stringUpper = string.upper()
dict = {}
for n in stringUpper:
if n not in '0123456789':
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
for n in stringUpper:
if n.isalpha()
dict[n] += 1
else:
dict[n] = 1
print(dict)
You can check string for alphanumeric
n.isalnum()
for aphabetic:
n.isalpha()
So your code will be like:
dict = {}
for n in stringUpper:
if n.isalpha():
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
else:
#do something....
While iterating, check if the lower() and upper() is the same for a character. If they are different from each other, then it is an alphabetical letter.
if n.upper() == n.lower():
continue
This should do it.
I am new to python! I have created a code which successfully opens my text file and sorts my list of 100's of words. I then have put these in a list labelled stimuli_words, which consists of no duplicates words, all lower case etc.
However I now want to convert this list into a dictionary, where the keys are all possible 3 letter endings in my list of words, and the values are the words that correspond to those endings.
For instance 'ing: going, hiring...', but I only want the words in which have more than 40 words corresponding to the last two characters. So far I have this code:
from collections import defaultdict
fq = defaultdict( int )
for w in stimuli_list:
fq[w] += 1
print fq
However it is just returning a dictionary with my words and how many times they occur which is obviously once. e.g 'going': 1, 'hiring': 1, 'driving': 1.
Really would appreciate some help!! Thank You!!
You could do something like this:
dictionary = {}
words = ['going', 'hiring', 'driving', 'letter', 'better', ...] # your list or words
# Creating words dictionary
for word in words:
dictionary.setdefault(word[-3:], []).append(word)
# Removing lists that contain less than 40 words:
for key, value in dictionary.copy().items():
if len(value) < 40:
del dictionary[key]
print(dictionary)
Output:
{ # Only lists that are longer than 40 words
'ing': ['going', 'hiring', 'driving', ...],
'ter': ['letter', 'better', ...],
...
}
Since you're counting the words (because your key is the word), you only get 1 count per word.
You could create a key of the 3 last characters (and use Counter instead):
import collections
wordlist = ["driving","hunting","fishing","drive","a"]
endings = collections.Counter(x[-3:] for x in wordlist)
print(endings)
result:
Counter({'ing': 3, 'a': 1, 'ive': 1})
Create DemoData:
import random
# seed the same for any run
random.seed(10)
# base lists for demo data
prae = ["help","read","muck","truck","sleep"]
post= ["ing", "biothign", "press"]
# lots of data
parts = [ x+str(y)+z for x in prae for z in post for y in range(100,1000,100)]
# shuffle and take on ever 15th
random.shuffle(parts)
stimuli_list = parts[::120]
Creation of dictionary from stimuli_list
# create key with empty lists
dic = dict(("".join(e[len(e)-3:]),[]) for e in stimuli_list)
# process data and if fitting, fill list
for d in dic:
fitting = [x for x in parts if x.endswith(d)] # adapt to only fit 2 last chars
if len(fitting) > 5: # adapt this to have at least n in it
dic[d] = fitting[:]
for d in [x for x in dic if not dic[x]]: # remove keys with empty lists
dic.remove(d)
print()
print(dic)
Output:
{'ess': ['help400press', 'sleep100press', 'sleep600press', 'help100press', 'muck400press', 'muck900press', 'muck500press', 'help800press', 'muck100press', 'read300press', 'sleep400press', 'muck800press', 'read600press', 'help200press', 'truck600press', 'truck300press', 'read700press', 'help900press', 'truck400press', 'sleep200press', 'read500press', 'help600press', 'truck900press', 'truck800press', 'muck200press', 'truck100press', 'sleep700press', 'sleep500press', 'sleep900press', 'truck200press', 'help700press', 'muck300press', 'sleep800press', 'muck700press', 'sleep300press', 'help500press', 'truck700press', 'read400press', 'read100press', 'muck600press', 'read900press', 'read200press', 'help300press', 'truck500press', 'read800press']
, 'ign': ['truck200biothign', 'muck500biothign', 'help800biothign', 'muck700biothign', 'help600biothign', 'truck300biothign', 'read200biothign', 'help500biothign', 'read900biothign', 'read700biothign', 'truck400biothign', 'help300biothign', 'read400biothign', 'truck500biothign', 'read800biothign', 'help700biothign', 'help400biothign', 'sleep600biothign', 'sleep500biothign', 'muck300biothign', 'truck700biothign', 'help200biothign', 'sleep300biothign', 'muck100biothign', 'sleep800biothign', 'muck200biothign', 'sleep400biothign', 'truck100biothign', 'muck800biothign', 'read500biothign', 'truck900biothign', 'muck600biothign', 'truck800biothign', 'sleep100biothign', 'read300biothign', 'read100biothign', 'help900biothign', 'truck600biothign', 'help100biothign', 'read600biothign', 'muck400biothign', 'muck900biothign', 'sleep900biothign', 'sleep200biothign', 'sleep700biothign']
}
After performing some operations I get a list as following :
FreqItemset(items=[u'A_String_0'], freq=303)
FreqItemset(items=[u'A_String_0', u'Another_String_1'], freq=302)
FreqItemset(items=[u'B_String_1', u'A_String_0', u'A_OtherString_1'], freq=301)
I'd like to remove from list all items start from A_String_0 , but I'd like to keep other items (doesn't matter if A_String_0 exists in the middle or at the end of item )
So in example above delete lines 1 and 2 , keep line 3
I tried
filter(lambda a: a != 'A_String_0', result)
and
result.remove('A_String_0')
all this doesn't help me
It is as simple as this:
from pyspark.mllib.fpm import FPGrowth
sets = [
FPGrowth.FreqItemset(
items=[u'A_String_0'], freq=303),
FPGrowth.FreqItemset(
items=[u'A_String_0', u'Another_String_1'], freq=302),
FPGrowth.FreqItemset(
items=[u'B_String_1', u'A_String_0', u'A_OtherString_1'], freq=301)
]
[x for x in sets if x.items[0] != 'A_String_0']
## [FreqItemset(items=['B_String_1', 'A_String_0', 'A_OtherString_1'], freq=301)]
In practice it would better to filter beffore collect:
filtered_sets = (model
.freqItemsets()
.filter(lambda x: x.items[0] != 'A_String_0')
.collect())
How about result = result if result[0] != 'A_String_0' else result[1:]?
It seems that you are using a list called FreqItemset. However, the name suggests that you should be using a set, instead of a list.
This way, you could have a set of searchable pairs string, frequency. For example:
>>> d = { "the": 2, "a": 3 }
>>> d[ "the" ]
2
>>> d[ "the" ] = 4
>>> d[ "a" ]
3
>>> del d[ "a" ]
>>> d
{'the': 4}
You can easily access each word (which is a key of the dictionary), change its value (its frequency of apparition), or remove it. All operations avoid the access to all the elements of the list, since it is a dictionary, i.e., its performance is good (better than using a list, anyway).
Just my two cents.
I need to have my dictionary keys in this format '/ cat/' but i keep getting multiple forward slashes. Here is my code:
# Defining the Digraph method #
def digraphs(s):
dictionary = {}
count = 0;
while count <= len(s):
string = s[count:count + 2]
count += 1
dictionary[string] = s.count(string)
for entry in dictionary:
dictionary['/' + entry + '/'] = dictionary[entry]
del dictionary[entry]
print(dictionary)
#--End of the Digraph Method---#
Here is my output:
i do this:
digraphs('my cat is in the hat')
{'///in///': 1, '/// t///': 1, '/// c///': 1, '//s //': 1, '/my/': 1, '/n /': 1, '/e /': 1, '/ h/': 1, '////ha////': 1, '//////': 21, '/is/': 1, '///ca///': 1, '/he/': 1, '//th//': 1, '/t/': 3, '//at//': 2, '/t /': 1, '////y ////': 1, '/// i///': 2}
In Python, you generally shouldn't iterate over objects while modifying them. Instead of modifying your dictionary, make a new one:
new_dict = {}
for entry in dictionary:
new_dict['/' + entry + '/'] = dictionary[entry]
return new_dict
Or more compactly (Python 2.7 and above):
return {'/' + key + '/': val for key, val in dictionary.items()}
An even better approach would be to skip creating your original dictionary in the first place:
# Defining the Digraph method #
def digraphs(s):
dictionary = {}
for count in range(len(s)):
string = s[count:count + 2]
dictionary['/' + string + '/'] = s.count(string)
return dictionary
#--End of the Digraph Method---#
You are adding entries to the dictionary as you loop over it, so you your new entries are included in the loop too and get extra slashes added again. A better approach is to make a new dictionary containing the new keys you want:
newDict = dict(('/' + key + '/', val) for key, val in oldDict.iteritems())
As #Blender points out, you can also use a dictionary comprehension if you're using Python 3:
{'/'+key+'/': val for key, val in oldDict.items()}