Python - convert text file to dict and convert to json - python

How can I convert this text file to json? Ultimately, I'll be inserting the json blobs into a NoSQL database, but for now I plan to parse the text files and build a python dict, then dump to json.
I think there has to be a way to do this with a dict comprehension that I'm just not seeing/following (I'm new to python).
Example of a file:
file_1.txt
[namespace1] => metric_A = value1
[namespace1] => metric_B = value2
[namespace2] => metric_A = value3
[namespace2] => metric_B = value4
[namespace2] => metric_B = value5
Example of dict I want to build to convert to json:
{ "file1" : {
"namespace1" : {
"metric_A" : "value_1",
"metric_B" : "value_2"
},
"namespace2" : {
"metric_A" : "value_3",
"metric_B" : ["value4", "value5"]
}
}
I currently have this working, but my code is a total mess (and much more complex than this example w/ clean up etc). I'm basically going line by line through the file, building a python dict. I check each namespace for existence in the dict, if it exists, i check the metric. If the metric exists already, I know I have duplicates and need to convert the value to an array that contains the existing value and my new value(s). There has to be a more simple/clean way.

import glob
import json
answer = {}
for fname in glob.glob(file_*.txt): # loop over all filenames
answer[fname] = {}
with open(fname) as infile:
for line in infile:
line = line.strip()
if not line: continue
splits = line.split()[::2]
splits[0] = splits[0][1:-1]
namespace, metric, value = splits # all the values in the line that we're interested in
answer[fname].get(namespace, {})[metric] = value # populate the dict
required_json = json.dumps(answer) # turn the dict into proper JSON

You can use regex for that. re.findall('\w+', line) will find all text groups which you are after, then the rest is saving it in the dictionary of dictionary. The simplest way to do that is to use defaultdict from collections.
import re
from collections import defaultdict
answer = defaultdict(lambda: defaultdict(lambda: []))
with open('file_1.txt', 'r') as f:
for line in f:
namespace, metric, value = re.findall(r'\w+', line)
answer[namespace][metric].append(value)
As we know, that we expect exactly 3 alphanum groups, we assign it to 3 variable, i.e. namespace, metric, value. Finally, defaultdict will return defaultdict for the case when we see namespace first time, and the inner defaultdict will return an empty array for first append, making code more compact.

Related

how to replace the values of a dict in a txt file in python

I have a text file something.txt holds data like :
sql_memory: 300
sql_hostname: server_name
sql_datadir: DEFAULT
i have a dict parameter={"sql_memory":"900", "sql_hostname":"1234" }
I need to replace the values of paramter dict into the txt file , if parameters keys are not matching from keys in txt file then values in txt should left as it is .
For example, sql_datadir is not there in parameter dict . so, no change for the value in txt file.
Here is what I have tried :
import json
def create_json_file():
with open(something.txt_path, 'r') as meta_data:
lines = meta_data.read().splitlines()
lines_key_value = [line.split(':') for line in lines]
final_dict = {}
for lines in lines_key_value:
final_dict[lines[0]] = lines[1]
with open(json_file_path, 'w') as foo:
json.dumps(final_dict,foo, indent=4)
def generate_server_file(parameters):
create_json_file()
with open(json_file_path, 'r') as foo:
server_json_data = json.load(foo)
for keys in parameters:
if keys not in server_json_data:
raise KeyError("Cannot find keys")
# Need to update the paramter in json file
# and convert json file into txt again
x={"sql_memory":"900", "sql_hostname":"1234" }
generate_server_file(x)
Is there a way I can do this without converting the txt file into a JSON ?
Expected output file(something.txt) :
sql_memory: 900
sql_hostname: 1234
sql_datadir: DEFAULT
Using Python 3.6
If you want to import data from a text file use numpy.genfromtxt.
My Code:
import numpy
data = numpy.genfromtxt("something.txt", dtype='str', delimiter=';')
print(data)
something.txt:
Name;Jeff
Age;12
My Output:
[['Name' 'Jeff']
['Age' '12']]
It`s very useful and I use it all of the time.
If your full example is using Python dict literals, a way to do this would be to implement a serializer and a deserializer. Since yours closely follows object literal syntax, you could try using ast.literal_eval, which safely parses a literal from a string. Notice, it will not handle variable names.
import ast
def split_assignment(string):
'''Split on a variable assignment, only splitting on the first =.'''
return string.split('=', 1)
def deserialize_collection(string):
'''Deserialize the collection to a key as a string, and a value as a dict.'''
key, value = split_assignment(string)
return key, ast.literal_eval(value)
def dict_doublequote(dictionary):
'''Print dictionary using double quotes.'''
pairs = [f'"{k}": "{v}"' for k, v in dictionary.items()]
return f'{{{", ".join(pairs)}}}'
def serialize_collection(key, value):
'''Serialize the collection to a string'''
return f'{key}={dict_doublequote(value)}'
And example using the data above produces:
>>> data = 'parameter={"sql_memory":"900", "sql_hostname":"1234" }'
>>> key, value = deserialize_collection(data)
>>> key, value
('parameter', {'sql_memory': '900', 'sql_hostname': '1234'})
>>> serialize_collection(key, value)
'parameter={"sql_memory": "900", "sql_hostname": "1234"}'
Please note you'll probably want to use JSON.dumps rather than the hack I implemented to serialize the value, since it may incorrectly quote some complicated values. If single quotes are fine, a much more preferable solution would be:
def serialize_collection(key, value):
'''Serialize the collection to a string'''
return f'{key}={str(value)}'

Can I import dictionary items with the same values into a list?

I'm importing data from a text file, and then made a dictionary out of that. I'm now trying to make a separate one, with the entries that have the same value only. Is that possible?
Sorry if that's a little confusing! But basically, the text file looks like this:
"Andrew", "Present"
"Christine", "Absent"
"Liz", "Present"
"James", "Present"
I made it into a dictionary first, so I could group them into keys and values, and now I'm trying to make a list of the people who were 'present' only (I don't want to delete the absent ones, I just want a separate list), and then pick one from that list randomly.
This is what I tried:
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
d[name.strip()] = attendance.strip()
present_list = []
present_list.append({"name": str(d.keys), "attendance": "Present"})
print(random.choice(present_list))
When I tried running it, I only get:
{'name': '<built-in method keys of dict object at 0x02B26690>', 'attendance': 'Present'}
Which part should I change? Thank you so much in advance!
You can try this:
present_list = [key for key in d if d[key] == "Present"]
first, you have to change the way you the read lines than you can have in your initial dict as key the attendence :
from collections import defaultdict
d = defaultdict(list)
with open('directory.txt') as f:
for line in f.readlines():
name, attendance = line.strip().split(',')
d[attendance.strip()].append(name.strip())
present_list = d["Present"]
print(random.choice(present_list) if present_list else "All absent")
Dict.Keys is a method, not a field. So you must instead do:
d.keys()
This returns an array generator: if you want a comma separated list with square brackets, just calling str() on it is ok. If you want a different formatting, consider ','.join(dict.keys()) to do a simple comma separated list with no square brackets.
UPDATE:
You also have no filtering in place, instead I'd try something like this, where you grab the list of statuses and then compile (new code in BOLD):
d = {}
with open('directory.txt') as f:
for line in f:
name, attendance = line.strip().split(',')
**if name.strip() not in d.keys():
d[attendance.strip()] = [name.strip()]
else:
d[attendance.strip()] = d[attendance.strip()].append(name.strip())**
This way you don't need to go through all those intermediate steps, and you will have something like {"present": "Andrew, Liz, James"}

Python replace values in unknown structure JSON file

Say that I have a JSON file whose structure is either unknown or may change overtime - I want to replace all values of "REPLACE_ME" with a string of my choice in Python.
Everything I have found assumes I know the structure. For example, I can read the JSON in with json.load and walk through the dictionary to do replacements then write it back. This assumes I know Key names, structure, etc.
How can I replace ALL of a given string value in a JSON file with something else?
This function recursively replaces all strings which equal the value original with the value new.
This function works on the python structure - but of course you can use it on a json file - by using json.load
It doesn't replace keys in the dictionary - just the values.
def nested_replace( structure, original, new ):
if type(structure) == list:
return [nested_replace( item, original, new) for item in structure]
if type(structure) == dict:
return {key : nested_replace(value, original, new)
for key, value in structure.items() }
if structure == original:
return new
else:
return structure
d = [ 'replace', {'key1': 'replace', 'key2': ['replace', 'don\'t replace'] } ]
new_d = nested_replace(d, 'replace', 'now replaced')
print(new_d)
['now replaced', {'key1': 'now replaced', 'key2': ['now replaced', "don't replace"]}]
I think there's no big risk if you want to replace any key or value enclosed with quotes (since quotes are escaped in json unless they are part of a string delimiter).
I would dump the structure, perform a str.replace (with double quotes), and parse again:
import json
d = { 'foo': {'bar' : 'hello'}}
d = json.loads(json.dumps(d).replace('"hello"','"hi"'))
print(d)
result:
{'foo': {'bar': 'hi'}}
I wouldn't risk to replace parts of strings or strings without quotes, because it could change other parts of the file. I can't think of an example where replacing a string without double quotes can change something else.
There are "clean" solutions like adapting from Replace value in JSON file for key which can be nested by n levels but is it worth the effort? Depends on your requirements.
Why not modify the file directly instead of treating it as a JSON?
with open('filepath') as f:
lines = f.readlines()
for line in lines:
line = line.replace('REPLACE_ME', 'whatever')
with open('filepath_new', 'a') as f:
f.write(line)
You could load the JSON file into a dictionary and recurse through that to find the proper values but that's unnecessary muscle flexing.
The best way is to simply treat the file as a string and do the replacements that way.
json_file = 'my_file.json'
with open(json_file) as f:
file_data = f.read()
file_data = file_data.replace('REPLACE_ME', 'new string')
<...>
with open(json_file, 'w') as f:
f.write(file_data)
json_data = json.loads(file_data)
From here the file can be re-written and you can continue to use json_data as a dict.
Well that depends, if you want to place all the strings entitled "REPLACE_ME" with the same string you can use this. The for loop loops through all the keys in the dictionary and then you can use the keys to select each value in the dictionary. If it is equal to your replacement string it will replace it with the string you want.
search_string = "REPLACE_ME"
replacement = "SOME STRING"
test = {"test1":"REPLACE_ME", "test2":"REPLACE_ME", "test3":"REPLACE_ME", "test4":"REPLACE_ME","test5":{"test6":"REPLACE_ME"}}
def replace_nested(test):
for key,value in test.items():
if type(value) is dict:
replace_nested(value)
else:
if value==search_string:
test[key] = replacement
replace_nested(test)
print(test)
To solve this problem in a dynamic way, I have obtained to use the same json file to declare the variables that we want to replace.
Json File :
{
"properties": {
"property_1": "value1",
"property_2": "value2"
},
"json_file_content": {
"key_to_find": "{{property_1}} is my value"
"dict1":{
"key_to_find": "{{property_2}} is my other value"
}
}
Python code (references Replace value in JSON file for key which can be nested by n levels):
import json
def fixup(self, a_dict:dict, k:str, subst_dict:dict) -> dict:
"""
function inspired by another answers linked below
"""
for key in a_dict.keys():
if key == k:
for s_k, s_v in subst_dict.items():
a_dict[key] = a_dict[key].replace("{{"+s_k+"}}",s_v)
elif type(a_dict[key]) is dict:
fixup(a_dict[key], k, subst_dict)
# ...
file_path = "my/file/path"
if path.exists(file_path):
with open(file_path, 'rt') as f:
json_dict = json.load(f)
fixup(json_dict ["json_file_content"],"key_to_find",json_dict ["properties"])
print(json_dict) # json with variables resolved
else:
print("file not found")
Hope it helps

write a whole line in a .txt file if not in a .yaml file

I am trying to write in a text (download.txt) the lines from open.txt that there are not the same 'id' and there are not in excepcions (idexception, classexcepcion). I have got writing the 'ids' not repeated and the idexcepcion.
MY QUESTION is how to add the condition 'classexception', I tried it but it is impossible. Any idea about dictionaries/conditionals I have to use?
c = open('open.txt','r') #structure: name:xxx; id:xxxx; class:xxxx; name:xxx; id:xxxx;class:xxxx etc
t=c.read()
d=open('download.txt','a')
allLines = t.split("\n")
lines = {}
class=[s[10:-1] for s in t.split() if s.startswith("class")]
for line in allLines:
idPos = line.find("id:")
colPos = line.find(";",idPos)
if idPos > -1:
id = line[idPos+4: colPos if colPos > -1 else None]
if id not in idexception:
lines.setdefault(id,line)
for l in lines:
d.write(lines[l]+'\n')
c.close()
d.close()
Generally you are quite unclear but if I understand correctly here is my approach to your problem with a lot o comments inside:
import re
id_exceptions = ['id_ex_1', 'id_ex_2']
class_exceptions = ['class_ex_1', 'class_ex_2']
# Values to be written to dowload.txt file
# Since id's needs to be unique, structure of this dict should be like this:
# {[single key as value of an id]: {name: xxx, class: xxx}}
unique_values = dict()
# All files should be opened using 'with' statement
with open('open.txt') as source:
# Read whole file into one single long string
all_lines = source.read().replace('\n', '')
# Prepare regular expression for geting values from: name, id and class as a dict
# Read https://regex101.com/r/Kby3fY/1 for extra explanation what does it do
reg_exp = re.compile('name:(?<name>[a-zA-Z0-9_-]*);\sid:(?<id>[a-zA-Z0-9_-]*);\sclass:(?<class>[a-zA-Z0-9_-]*);')
# Read single long string and match to above regular expression
for match in reg_exp.finditer(all_lines):
# This will produce a single dict {'name': xxx, 'id': xxx, 'class': xxx}
single_line = match.groupdict()
# Now we will check againt all conditions at once and
# if they are not True we will add values as an unique id
if single_line['id'] not in unique_values or # Check if not present already
single_line['id'] not in id_exceptions or # Check is not in id exceptions
single_line['class'] not in class_exceptions: # Check is not in class exceptions
# Add unique id values
unique_values[single_line['id']] = {'name': single_line['name'],
'class': single_line['class']}
# Now we just need to write it to download.txt file
with open('download.txt', 'w') as destintion:
for key, value in all_lines.items(): # In Python 2.x use all_lines.iteritems()
line = "id:{}; name:{}; class:{}".format(key, value['name'], value['class'])

Import Mongodb to CSV - removing duplicates

I am importing data from Mongo into a CSV file. The import consists of "timestamp" and "text" for each JSON Document.
The documents:
{
name: ...,
size: ...,
timestamp: ISODate("2013-01-09T21:04:12Z"),
data: { text:..., place:...},
other: ...
}
The code:
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. Is it possible to remove these dupes as I import?
Thanks for your help!
I would use a set to store the hashes of the data, and check for duplicates. Something like this:
import md5
hashes = set()
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
digest = md5.new(r['text']).digest()
if digest in hashes:
# It's a duplicate!
continue
else:
hashes.add(digest)
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient.
You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. The 'text' is the key, so there won't be any duplicates. I will assume the order of reading is not guaranteed to return the oldest timestamp first. In that case you will have to make 2 passes-- once for reading and later one pass for writing.
textmap = {}
def insert(text, ts):
global textmap
if text in textmap:
textmap[text] = min(ts, textmap[text])
else:
textmap[text] = ts
for r in db.hello.find(fields=['text', 'timestamp']):
insert(r['text'], r['timestamp'])
for text in textmap:
print >>fp, text, textmap[text] # with whatever format desired.
At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example.
(See Sort a Python dictionary by value )

Categories