Reading/Writing/Appending using json module - python

I was recommended some code from someone here but I still need some help. Here's the code that was recommended:
import json
def write_data(data,filename):
with open(filename,'w') as outfh:
json.dump(data,outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)
The writing code works fine, but the reading code doesn't return any strings after inputting the data:
read_data('player.txt')
Another thing that I'd like to be able to do is to be able to specify a line to be printed. Something that is also pretty important for this project / exorcise is to be able to append strings into my file. Thanks to anyone that can help me.
Edit: I need to be storing strings in the file that I can convert to values. IE;
name="Karatepig"
Is something I would store so I can recall the data if I ever need to load a previous set of data or something of that sort.
I'm very much a noob at python so I don't know what would be best, whether a list or dictionary; I haven't really used a dictionary before, and also I have no idea yet as to how I'm going to convert strings into values.

Suppose you have a dictionary like this:
data = {'name': 'Karatepig', 'score': 10}
Your write_data function is fine, but your read_data function needs to actually return the data after reading it from the file:
def read_data(filename):
with open(filename, 'r') as infh:
data = json.load(infh)
return data
data = read_data('player.json')
Now suppose you want to print the name and update the score:
print data['name']
data['score'] += 5 # add 5 points to the score
You can then write the data back to disk using the write_data function. You can not use the json module to append data to a file. You need to actually read the file, modify the data, and write the file again. This is not so bad for small amounts of data, but consider using a database for larger projects.
Whether you use a list or dictionary depends on what you're trying to do. You need to explain what you're trying to accomplish before anyone can recommend an approach. Dictionaries store values assigned to keys (see above). Lists are collections of items without keys and are accessed using an index.
It isn't clear what you mean by "convert strings into values." Notice that the above code shows the string "Karatepig" using the dictionary key "name".

Related

How to find duplicates in a csv with python, and then alter the row

For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.
If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D
I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.

Django, Python: Best way to parse a CSV and convert to Django model instances

I have a page where a user uploads a CSV file. That bit works.
I can read the CSV and turn it into lists. It takes a fair bit of time for something I thought should be faster (about 7 seconds to parse through and convert it to lists, for a 17mb CSV file).
I'm wondering now though, what would be the best approach to doing this? The code I have so far is very convoluted (written a long time ago by a CS graduate colleague that has since left) and I think I want to rewrite it as it's too slow.
I haven't worked with CSVs before. Right now this is what I have:
import codecs
import csv
import sys
def read_csv_file(self, file_path):
is_file = False
while not is_file:
if os.path.exists(file_path):
is_file = True
result_data = []
csv.field_size_limit(sys.maxsize)
csv_reader = csv.reader(codecs.open(file_path, 'rU', 'utf-8'), delimiter=',')
for row in csv_reader:
result_data.append(row)
return result_data
Is turning a CSV into lists (that I can then perhaps zip?) the best way to go about it?
Ultimately, the goal is to create DB objects (in a loop perhaps?) that would be something like looping through each list, using an index to create objects, appending those objects to an object list and then doing a bulk_create:
object_instance_list.append(My_Object.objects.get_or_create(property=csv_property[some_index], etc etc)[0])
My_Object.bulk_create(object_instance_list)
Would that be efficient?
Should I be working with dicts instead?
Is there a built in method that would allow this for either Python's CSV or a bit of Django functionality that already does it?
Basically, since I'm not that experience, and this is my first time working with CSVs, I would like to get it right(ish) from the beginning.
I would appreciate any help in this regard, so I can learn the proper way to handle this. Thanks!
So, this is untested, but conceptually you should be able to get the idea. The trick is to take advantage of using **kwargs.
import csv
def read_csv():
"""Read the csv into dictionaries, transform the keys necessary
and return a list of cleaned-up dictionaries.
"""
with open('data.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
return [map_rows_to_fields(row) for row in reader]
def map_rows_to_fields(row):
"""Here for each dictionary you want to transform the dictionary
in order to map the keys of the dict to match the names of the
fields on the model you want to create so we can pass it in as
`**kwargs`. This would be an opportunity to use a nice dictionary
comprehension.
"""
csv_fields_to_model_fields = {
'csv_field_1': 'model_field_1',
'csv_field_2': 'model_field_2',
'csv_field_n': 'model_field_n',
}
return {
csv_fields_to_model_fields[key]: value
for key, value in row.items()
}
def instantiate_models():
"""Finally, we have our data from the csv in dictionaries
that map values to expected fields on our model constructor,
then we can just instantiate each of those models from the
dictionary data using a list comprehension, the result of which
we pass as the argument to `bulk_create` saving the rows to
the database.
"""
model_data = read_csv()
MyModel.objects.bulk_create([
MyModel(**data) for data in model_data
])
The method bulk_create does have a few caveats though so make sure it's ok to use it in your case.
https://docs.djangoproject.com/en/2.0/ref/models/querysets/#bulk-create
If you can't use bulk_create then just make the models in a loop.
for data in model_data:
MyModel.objects.create(**data)

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?
Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Combining JSON values based on key

I'm working on Python 2.6.6 and I'm struggling with one issue.
I have a large JSON file with the following structure:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]}]}
{"id":"12345","ua":[{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_D","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
And I need to merge the id's so they'll contain all the GROUPS as so:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]},{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]},{"n":"GROUP_D","v":["true"]},{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
I tried using the 'json' library, but I couldn't append the values correctly. Also I tried to convert it all to a dictionary and append the values (GROUPS) to the key (id) as lists, but I got stuck on printing it all as I need to the output file.
I can do it using bash but it takes too long to parse all the information and rearrange it in the needed format.
Any help is appreciated!
Thanks.
First, let's get the JSON stuff out of the way.
Your file is not a JSON structure, it's a bunch of separate JSON objects. From your sample, it looks like it's one object per line. So, let's read this in to a list:
with open('spam.json') as f:
things = [json.loads(line) for line in f]
Then we'll process this, and write it out:
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')
Now, you don't have a JSON structure that you want to append things to; you have a list of dicts, and you want to create a new list of dicts, merging together the ones with the same key.
Here's one way to do it:
new_things = {}
for thing in things:
thing_id = thing['id']
try:
old_thing = new_things[thing_id]
except KeyError:
new_things[thing_id] = thing
else:
old_thing['ua'].extend(thing['ua'])
new_things = new_things.values()
There are a few different ways you could simplify this; I just wrote it this way because it uses no tricks that should be beyond a novice. For example, you could do by it sorting and grouping:
def merge(things):
return {'id': things[0]['id'],
'ua': list(itertools.chain.from_iterable(t['ua'] for t in things))}
sorted_things = sorted(things, key=operator.itemgetter('id'))
grouped_things = itertools.groupby(sorted_things, key=operator.itemgetter('id'))
new_things = [merge(list(group)) for key, group in grouped_things]
I didn't realize from you original question that you had tens of millions of rows. All of the above steps require loading the entire original data set into memory, processing with some temporary storage, then writing it back out. But if your data set is too large, you need to find a way to process one row at a time, and keep as little in memory simultaneously as possibly.
First, to process one row at a time, you just need to change the initial list comprehension to a generator expression, and move the rest of the code inside the with statement, like this:
with open('spam.json') as f:
things = (json.loads(line) for line in f)
for thing in things:
# blah blah
… at which point it might be just as easy to rewrite it like this:
with open('spam.json') as f:
for line in f:
thing = json.loads(line)
# blah blah
Next, sorting obviously builds the whole sorted list in memory, so that's not acceptable here. But if you don't sort and group, the entire new_things result object has to be alive at the same time (because the very last input row could have to be merged into the very first output row).
Your sample data seems to already have the rows sorted by id. If you can count on that in real life—or just count on the rows always being grouped by id—just skip the sorting step, which isn't doing anything but wasting time and memory, and use a grouping solution.
On the other hand, if you can't count on the rows being grouped by id, there are only really two ways to reduce memory further: compress the data in some way, or back the storage to disk.
For the first, Foo Bar User's solution built a simpler and smaller data structure (a dict mapping each id to its list of uas, instead of a list of dicts, each with an id and a ua), which should take less memory, and which we could convert to the final format one row at a time. Like this:
with open('spam.json') as f:
new_dict = defaultdict(list)
for row in f:
thing = json.loads(row)
new_dict[thing["id"]].extend(thing["ua"])
with open('eggs.json', 'w') as f:
for id, ua in new_dict.items(): # use iteritems in Python 2.x
thing = {'id': id, 'ua': ua}
f.write(json.dumps(thing) + '\n')
For the second, Python comes with a nice way to use a dbm database as if it were a dictionary. If your values are just strings, you can use the anydbm/dbm module (or one of the specific implementations). Since your values are lists, you'll need to use shelve instead.
Anyway, while this will reduce your memory usage, it could slow things down. On a machine with 4GB of RAM, the savings in pagefile swapping will probably blow away the extra cost of going through the database… but on a machine with 16GB of RAM, you may just be adding overhead for very little gain. You may want to experiment with smaller files first, to see how much slower your code is with shelve vs. dict when memory isn't an issue.
Alternatively, if things get way beyond the limits of your memory, you can always use a more powerful database that actually can sort things on disk. For example (untested):
db = sqlite3.connect('temp.sqlite')
c = db.cursor()
c.execute('CREATE TABLE Things (tid, ua)')
for thing in things:
for ua in thing['ua']:
c.execute('INSERT INTO Things (tid, ua) VALUES (?, ?)',
thing['id'], ua)
c.commit()
c.execute('SELECT tid, ua FROM Things ORDER BY tid')
rows = iter(c.fetchone, None)
grouped_things = itertools.groupby(rows, key=operator.itemgetter(0))
new_things = (merge(list(group)) for key, group in grouped_things)
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')

Loading large file (25k entries) into dict is slow in Python?

I have a file which has about 25000 lines, and it's a s19 format file.
each line is like: S214 780010 00802000000010000000000A508CC78C 7A
There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:
def __init__(self,filename):
infile = file(filename,'r')
self.all_lines = infile.readlines()
self.dict_by_address = {}
for i in range(0, self.get_line_number()):
self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)
infile.close()
get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int
problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?
And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)
How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)
>>> with open('test.test') as f:
... dict_by_address = {line[4:10]:line[10:-3] for line in f}
...
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}
This code should be tremendously faster than what you have now. EDIT: As #sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
_, key, value, _ = line.split()
self.dict_by_address[key] = value
Some comments:
Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.
Best practice is to use open() rather than file(); I don't think Python 3.x even has file().
You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.
Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.
It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).
EDIT: corrected version:
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
key = extract_address(line)
value = extract_data(line)
self.dict_by_address[key] = value

Categories