Extracting data from a json file into a csv

Extracting data from a json file into a csv - python

I am new to dealing with json files and I am hoping for some help.
Here is a part of the json file (since it would be way too much for me to post it all) that I am dealing with
[{"id":804,"name":{"english":"Naganadel","japanese":"\u30a2\u30fc\u30b4\u30e8\u30f3"},"type":["Poison","Dragon"],"base":{"HP":73,"Attack":73,"Defense":73,"Sp. Attack":127,"Sp. Defense":73,"Speed":121}},{"id":805,"name":{"english":"Stakataka","japanese":"\u30c4\u30f3\u30c7\u30c4\u30f3\u30c7"},"type":["Rock","Steel"],"base":{"HP":61,"Attack":131,"Defense":211,"Sp. Attack":53,"Sp. Defense":101,"Speed":13}},{"id":806,"name":{"english":"Blacephalon","japanese":"\u30ba\u30ac\u30c9\u30fc\u30f3"},"type":["Fire","Ghost"],"base":{"HP":53,"Attack":127,"Defense":53,"Sp. Attack":151,"Sp. Defense":79,"Speed":107}},{"id":807,"name":{"english":"Zeraora","japanese":"\u30bc\u30e9\u30aa\u30e9"},"type":["Electric"],"base":{"HP":88,"Attack":112,"Defense":75,"Sp. Attack":102,"Sp. Defense":80,"Speed":143}},{"id":808,"name":{"english":"Meltan","japanese":"\u30e1\u30eb\u30bf\u30f3"},"type":["Steel"],"base":{"HP":46,"Attack":65,"Defense":65,"Sp. Attack":55,"Sp. Defense":35,"Speed":34}},{"id":809,"name":{"english":"Melmetal","japanese":"\u30e1\u30eb\u30e1\u30bf\u30eb"},"type":["Steel"],"base":{"HP":135,"Attack":143,"Defense":143,"Sp. Attack":80,"Sp. Defense":65,"Speed":34}}]
I am attempting to take the id, name, type, base, hp, attack, defense, and speed of each pokemon. I attached what I currently have which include my attempting to take the id and print it.
When I run this file I get list indices must be integers or slices, not str.
import json
def main():
f = open('pokedex.json')
data = json.load(f)
f.close()
#print data
id_poke = data['_embedded']['id_poke']
id_info = []
for i in id_poke:
id_poke.append(i['id'])
if __name__ == '__main__':
main()

Take a look at the json sample you included in your question: It starts with a [, meaning it is a list, not a dictionary. When you assign this object to the variable data and then try to index into this list with the (string) key _embedded, you get the error you saw.
I don't know how you expected this to work since your json file has neither _embedded nor id_poke as keys, but to get you started, here's how to print out the numeric id and English name of each object; you can take it from there.
for poke in data: # magic iteration over a list: data[0], data[1] etc.
print(poke["id"], poke["name"]["english"])

Declare
id_poke = data['_embedded']['id_poke']
As str()

Related

Removing duplicates from an attribute of a class variable

I'm extremely new to python and was having some trouble with removing duplicate values from an attribute of a class (I think this is the correct terminology).
Specifically I want to remove every value that is the same year. I should note that I'm printing only the first four value and searching for the first four values. The data within the attribute is actually in Yearmonthday format (example: 19070101 is the year 1907 on the first on january).
Anyways, here is my code:
import csv
import os
class Datatype:
'Data from the weather station'
def __init__ (self, inputline):
[ self.DATE,
self.PRCP] = inputline.split(',')
filename ='LAWe.txt'
LAWd = open(filename, 'r')
LAWefile = LAWd.read()
LAWd.close()
'Recognize the line endings for MS-DOS, UNIX, and Mac and apply the .split() method to the string wholeFile'
if '\r\n' in LAWefile:
filedat = LAWefile.split('\r\n') # the split method, applied to a string, produces a list
elif '\r' in LAWefile:
filedat = LAWefile.split('\r')
else:
filedat = LAWefile.split('\n')
collection = dict()
date= dict()
for thisline in filedat:
thispcp = Datatype(thisline) # here is where the Datatype object is created (running the __init__ function)
collection[thispcp.DATE] = thispcp # the dictionary will be keyed by the ID attribute
for thisID in collection.keys():
studyPRP = collection[thisID]
if studyPRP.DATE.isdigit():
list(studyPRP.DATE)
if len(date[studyPRP.DATE][0:4]):
pass #if year is seen once, then skip and go to next value in attribute
else:
print studyPRP.DATE[0:4] #print value in this case the year)
date[studyPRP.DATE]=studyPRP.DATE[0:4]
I get a this error:
Traceback (most recent call last):
File "project.py", line 61, in
if len(date[studyPRP.DATE][0:4]):
KeyError: '19770509'
A key error (which means a value isn't in a list? but it is for my data) can be fixed by using a set function (or so I've read), but I have 30,000 pieces of information I'm dealing with and it seems like you have to manually type in that info so that's not an option for me.
Any help at all would be appreciated
Sorry if this is confusing or nonsensical as I'm extremely new to python.

Replace this
if len(date[studyPRP.DATE][0:4])
by this
if len(date[studyPRP.DATE[0:4]]):
Explanation :
In the first line you are selecting the whole date as the key KeyError: '19770509' in the 4 first entry of date
In the correction you send the the first 4 character of the date(the year) in the dictionary

Don't know what exactly you want here. I'll reply based on I can help you on what.
Your error is because you are accessing your year in data before you are adding it.
Also, what you are adding to your collection is like
{
<object>.DATE: <object>
}
I don't know what you need here. Your lower for loop can be written as under:
for thisID in collection:
if thisID.isdigit():
if thisID[0:4] in date and len(date[thisID[0:4]]):
#if year is seen once, then skip and go to next
# value in attribute
pass
else:
print thisID[0:4] #print value in this case the year)
date[thisID[0:4]]=thisID[0:4]
Note your studyPRP.DATE is same as thisID.

dynamic variables from list

This is similar to, Python creating dynamic global variable from list, but I'm still confused.
I get lots of flo data in a semi proprietary format. I've already used Python to strip the data to my needs and save the data into a json file called badactor.json and are saved in the following format:
[saddr as a integer, daddr as a integer, port, date as Julian, time as decimal number]
An arbitrary example [1053464536, 1232644361, 2222, 2014260, 15009]
I want to go through my weekly/monthly flo logs and save everything by Julian date. To start I want to go through the logs and create a list that is named according to the Julian date it happened, i.e, 2014260 and then save it to the same name 2014260.json. I have the following, but it is giving me an error:
#!/usr/bin/python
import sys
import json
import time
from datetime import datetime
import calendar
#these are varibles I've had to use throughout, kinda a boiler plate for now
x=0
templist2 = []
templist3 = []
templist4 = []
templist5 = []
bad = {}
#this is my list of "bad actors", list is in the following format
#[saddr as a integer, daddr as a integer, port, date as Julian, time as decimal number]
#or an arbitrary example [1053464536, 1232644361, 2222, 2014260, 15009]
badactor = 'badactor.json'
with open(badactor, 'r') as f1:
badact = json.load(f1)
f1.close()
for i in badact:
print i[3] #troubleshooting to verify my value is being read in
tmp = str(i[3])
print tmp#again just troubleshooting
tl=[i[0],i[4],i[1],i[2]]
bad[tmp]=bad[tmp]+tl
print bad[tmp]
Trying to create the variable is giving me the following error:
Traceback (most recent call last):
File "savetofiles.py", line 39, in <module>
bad[tmp]=bad[tmp]+tl
KeyError: '2014260'

By the time your code is executed, there is no key "2014260" in the "bad" dict.
Your problem is here:
bad[tmp]=bad[tmp]+tl
You're saying "add t1 to something that doesn't exist."
Instead, you seem to want to do:
bad[tmp]=tl

I suggest you initialize bad to be an empty collections.defaultdict instead of just regular built-in dict. i.e.
import collections
...
bad = collections.defaultdict(list)
That way, initial empty list values will be created for you automatically the first time a date key is encountered and the error you're getting from the bad[tmp]=bad[tmp]+tl statement will go away since it will effectively become bad[tmp]=list()+tl — where the list() call just creates and returns an empty list — the first time a particular date is encountered.
It's also not clear whether you really need the tmp = str(i[3]) conversion because values of any non-mutable type are valid dictionary (or defaultdict) keys, not just strings — assuming i[3] isn't a string already. Regardless, subsequent code would be more readable if you named the result something else, like julian_date = i[3] (or julian_date = str(i[3]) if the conversion really is required).

python: argument generated by a prior function error

so this is probably very simple but it's confusing me.
I have a function that takes in a txt file in JSON form and sorts it in descending order of bandwidth. The function is:
def sort_guards_with_input(guards):
json_source = json.dumps(guards)
data = json.loads(str(json_source))
data['relays'].sort(key=lambda item: item['bandwidth'], reverse=True)
data = json.dumps(data)
return data
segundo = sort_guards_with_input("the original txt file")
..and this returns the sorted file of the form (lets call it TEXT):
{"relays": [{"nickname": "Snowden4ever pd7wih1gdUU8bLhWsvH6QHDWfs8",
"bandwidth": 201000, "type": ["Fast", "Guard", "HSDir", "Named", "Running",
"Stable", "V2Dir", "Valid"]},{"nickname": "rmblue jMdIu0VZYE+S2oeHShQBAHsdj80",
"bandwidth": 8, "type": ["Fast", "Guard", "HSDir", "Running", "Stable", "Unnamed",
"V2Dir", "Valid"]}]}
Now I have a function that pulls out the banwidth and nickname and creates a list. The function is:
def get_sorted_names_bw(guards):
sorted_guards_bw = list(entry['bandwidth'] for entry in guards["relays"])
sorted_guards_names = list(d['nickname'] for d in guards["relays"])
temps = [None]*(len(sorted_guards_bw)+len(sorted_guards_names))
temps[::2] = sorted_guards_bw
temps[1::2] = sorted_guards_names
sorted_grds_w_names = [temps[i:i+2] for i in range(0, len(temps), 2)]
return sorted_grds_w_names
The problem is when I try and print the result of get_sorted_names_bw by doing:
print get_sorted_names_bw(segundo)
.. I get the error:
sorted_guards_bw = list(entry['bandwidth'] for entry in guards["relays"])
TypeError: string indices must be integers, not str
But if i try and print the the result of get_sorted_names_bw with copy and pasting TEXT as the argument it returns a result (the wrong one because nicknames and bandwidths are mixed up, that's another problem I'll deal with myself, unless the reader is feeling very kind and wants to help with that too :) ). Namely:
[[201000, 'rmblue jMdIu0VZYE+S2oeHShQBAHsdj80'], [8, 'Snowden4ever pd7wih1gdUU8bLhWsvH6QHDWfs8']]
Why do I get an error when I try use an argument generated by a prior function but don't get an error when I just copy and paste the argument?
Thanks and sorry for the long post.

Your function sort_guards_with_input dumps the data to a JSON string and returns that string. But get_sorted_names_bw assumes it is receiving the actual data (as a dict), not a string representation of it. The easiest thing is probably to just have sort_guards_with_input return data without dumping it to JSON. That is:
def sort_guards_with_input(guards):
json_source = json.dumps(guards)
data = json.loads(str(json_source))
data['relays'].sort(key=lambda item: item['bandwidth'], reverse=True)
return data

looking up values and adding to data structure

I have a .tsv file of text data, named world_bank_indicators.
I have another tsv file, which contains additional information that I need to append to a list item in my script. that file is named world_bank_regions
So far, I have code (thanks to some of the good people on this site) that filters the data that I need from world bank indicators and writes it as a 2D list to the variable mylist. additionally, I have code that reads in the second file as a dictionary. code is below:
from math import log
import csv
import re
#filehandles for spreadsheets
fhand=open("world_bank_indicators.txt", "rU")
fhand2=open("world_bank_regions.txt", "rU")
#csv reader objects for files
reader=csv.reader(fhand, dialect="excel", delimiter="\t")
reader2=csv.reader(fhand2, dialect="excel", delimiter="\t")
#empty list for appending data into
#appending into this will create a 2d list, or "a list OF lists"
mylist=list()
mylist2=list()
mydict=dict()
myset=set()
newset=set()
#filters data by iterating over each row in the reader object
#note that this IGNORES headers. This will need to be appended later
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
#plug columns into specific variables, for easier coding
#replaces "," with empty space for columns that need to be converted to floats
name=row[0]
date=row[1]
pop=row[9].replace(",",'')
mobile=row[4].replace(",",'')
health=row[6]
internet=row[5]
gdp=row[19].replace(",",'')
#only appends rows that have COMPLETE rows of data
if name != '' and date != '' and pop != '' and mobile != '' and health != '' and internet != '' and gdp != '':
#declare calculated variables
mobcap=(float(mobile)/float(pop))
gdplog=log(float(gdp))
healthlog=log(float(health))
#re-declare variables as strings, rounds decimal points to 5th place
#this could have been done once in above step, merely re-coded here for easier reading
mobcap=str(round(mobcap, 5))
gdplog=str(round(gdplog, 5))
healthlog=str(round(healthlog,5))
#put all columns into 2d list (list of lists)
newrow=[name, date, pop, mobile, health, internet, gdp, mobcap, gdplog, healthlog]
mylist.append(newrow)
myset.add(name)
for row in reader2:
mydict[row[2]]=row[0]
what I need to do now is
1. read the country name from the mylist variable,
2.look up that string in the key value of mydict, and
3. append the value of that key back to mylist.
I'm totally stumped on how to do this.
should I make both data structures dictionaries? I still wouldn't know how to execute the above steps.
thanks for any insights.

It depends what you mean by "append the value of that key back to mylist". Do you mean, append the value we got from mydict to the list that contains the country name we used to look it up? Or do you mean to append that value from mydict to mylist itself?
The latter would be a strange thing to do, since mylist is a list of lists, wheras the value we are talking about ("row[0]") is a string. I can't intuit why we would append some strings to a list of lists, even though this is what your description says to do. So I'm assuming the former :)
Let's assume that your mylist is actually called "indicators", and mydict is called "region_info"
for indicator in indicators:
try:
indicator.append(region_info[indicator[0]])
except:
print "there is no region info for country name %s" % indicator[0]
Another comment on readability. I think that the elements of mylist would be better being dicts than lists. I would do this:
newrow={"country_name" : name,
"date": date,
"population": pop,
#... etc
because then when you use these things, you can use them by name instead of number, which will be more readable:
for indicator in indicators:
try:
indicator["region_info"] = region_info[indicator["country_name"]]
except:
print "there is no region info for country name %s" % indicator["country_name"]

Generating a .CSV with Several Columns - Use a Dictionary?

I am writing a script that looks through my inventory, compares it with a master list of all possible inventory items, and tells me what items I am missing. My goal is a .csv file where the first column contains a unique key integer and then the remaining several columns would have data related to that key. For example, a three row snippet of my end-goal .csv file might look like this:
100001,apple,fruit,medium,12,red
100002,carrot,vegetable,medium,10,orange
100005,radish,vegetable,small,10,red
The data for this is being drawn from a couple sources. 1st, a query to an API server gives me a list of keys for items that are in inventory. 2nd, I read in a .csv file into a dict that matches keys with item name for all possible keys. A snippet of the first 5 rows of this .csv file might look like this:
100001,apple
100002,carrot
100003,pear
100004,banana
100005,radish
Note how any key in my list of inventory will be found in this two column .csv file that gives all keys and their corresponding item name and this list minus my inventory on hand yields what I'm looking for (which is the inventory I need to get).
So far I can get a .csv file that contains just the keys and item names for the items that I don't have in inventory. Give a list of inventory on hand like this:
100003,100004
A snippet of my resulting .csv file looks like this:
100001,apple
100002,carrot
100005,radish
This means that I have pear and banana in inventory (so they are not in this .csv file.)
To get this I have a function to get an item name when given an item id that looks like this:
def getNames(id_to_name, ids):
return [id_to_name[id] for id in ids]
Then a function which gives a list of keys as integers from my inventory server API call that returns a list and I've run this function like this:
invlist = ServerApiCallFunction(AppropriateInfo)
A third function takes this invlist as its input and returns a dict of keys (the item id) and names for the items I don't have. It also writes the information of this dict to a .csv file. I am using the set1 - set2 method to do this. It looks like this:
def InventoryNumbers(inventory):
with open(csvfile,'w') as c:
c.write('InvName' + ',InvID' + '\n')
missinginvnames = []
with open("KeyAndItemNameTwoColumns.csv","rb") as fp:
reader = csv.reader(fp, skipinitialspace=True)
fp.readline() # skip header
invidsandnames = {int(id): str.upper(name) for id, name in reader}
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
print missinginvnameswithids
with open(csvfile,'a') as c:
for invname, invid in missinginvnameswithids.iteritems():
c.write(invname + ',' + str(invid) + '\n')
return missinginvnameswithids
Which I then call like this:
InventoryNumbers(invlist)
With that explanation, now on to my question here. I want to expand the data in this output .csv file by adding in additional columns. The data for this would be drawn from another .csv file, a snippet of which would look like this:
100001,fruit,medium,12,red
100002,vegetable,medium,10,orange
100003,fruit,medium,14,green
100004,fruit,medium,12,yellow
100005,vegetable,small,10,red
Note how this does not contain the item name (so I have to pull that from a different .csv file that just has the two columns of key and item name) but it does use the same keys. I am looking for a way to bring in this extra information so that my final .csv file will not just tell me the keys (which are item ids) and item names for the items I don't have in stock but it will also have columns for type, size, number, and color.
One option I've looked at is the defaultdict piece from collections, but I'm not sure if this is the best way to go about what I want to do. If I did use this method I'm not sure exactly how I'd call it to achieve my desired result. If some other method would be easier I'm certainly willing to try that, too.
How can I take my dict of keys and corresponding item names for items that I don't have in inventory and add to it this extra information in such a way that I could output it all to a .csv file?
EDIT: As I typed this up it occurred to me that I might make things easier on myself by creating a new single .csv file that would have date in the form key,item name,type,size,number,color (basically just copying in the column for item name into the .csv that already has the other information for each key.) This way I would only need to draw from one .csv file rather than from two. Even if I did this, though, how would I go about making my desired .csv file based on only those keys for items not in inventory?
ANSWER: I posted another question here about how to implement the solution I accepted (becauseit was giving me a value error since my dict values were strings rather than sets to start with) and I ended up deciding that I wanted a list rather than a set (to preserve the order.) I also ended up adding the column with item names to my .csv file that had all the other data so that I only had to draw from one .csv file. That said, here is what this section of code now looks like:
MyDict = {}
infile = open('FileWithAllTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in missinginvids: #note that this is the list I was using as the keys for my dict which I was zipping together with a corresponding list of item names to make my dict before.
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict

it sounds like what you need is a dict mapping ints to sets, ie,
MyDict = {100001: set([apple]), 100002: set([carrot])}
you can add with update:
MyDict[100001].update([fruit])
which would give you: {100001: set([apple, fruit]), 100002: set([carrot])}
Also if you had a list of attributes of carrot... [vegetable,orange]
you could say MyDict[100002].update([vegetable, orange])
and get: {100001: set([apple, fruit]), 100002: set([carrot, vegetable, orange])}
does this answer your question?
EDIT:
to read into CSV...
infile = open('MyFile.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[spl_line[0]].update(spl_line[1:])

This isn't an answer to the question, but here is a possible way of simplifying your current code.
This:
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
Can be replaced with:
invonhandset = set(inventory)
missinginvnameswithids = {k: v for k, v in invidsandnames.iteritems() if k in in inventory}
Or:
invonhandset = set(inventory)
for key in invidsandnames.keys():
if key not in invonhandset:
del invidsandnames[key]
missinginvnameswithids = invidsandnames

Have you considered making a temporary RDB (python has sqlite support baked in) and for reasonable numbers of items I don't think you would have a performance issues.
I would turn each CSV file and the result from the web-api into a tables (one table per data source). You can then do everything you want to do with some SQL queries + joins. Once you have the data you want, you can then dump it back to CSV.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a json file into a csv - python

Declare id_poke = data['_embedded']['id_poke'] As str()

Related

Removing duplicates from an attribute of a class variable

dynamic variables from list

python: argument generated by a prior function error

looking up values and adding to data structure

Generating a .CSV with Several Columns - Use a Dictionary?

Categories

Resources