My problem is as follows:
I have a txt-file that holds nothing but a dictionary with one single key. The value to that one, single key is a huge list containing dictionaries as list entries. First key:value pair for comparison:
"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}, [...etc.]
I tried the following method to convert the value of the single-keyed dictionary into a list by calling the .values method and then using list():
list_variable = list(dict_variable.values())
But it seems that this just converts the value into a list with just one index, for when I try to call index 0 the file crashes (list is too big) and if I try to call index 1 I get a KeyError stating that the index is out of range. (My current idea is to frist convert it into a list and THEN into a DataFrame)
I'm a bloody beginner and have no idea what else I could try. What am I missing?
Thanks a lot in advance! fpr your helpful comments!
looks like a json to me. try using pandas.json_normalize
d = {"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}]}
pd.json_normalize(d['data'])
type id attributes.timestamp attributes.count
0 utl 53150 T13:00:00Z 0.0
Does the below codes help you?
test.txt
"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}, {"type": "utl2", "id": "53151", "attributes": {"timestamp": "T12:00:00Z", "count": 1.0}}]
from re import findall
from pandas.io.json import json_normalize
with open("test.txt") as f:
print(json_normalize(eval(findall("{.+}", f.read())[0])))
Output:
type id attributes.timestamp attributes.count
0 utl 53150 T13:00:00Z 0.0
1 utl2 53151 T12:00:00Z 1.0
Related
I have a sizable json file and i need to get the index of a certain value inside it. Here's what my json file looks like:
data.json
[{...many more elements here...
},
{
"name": "SQUARED SOS",
"unified": "1F198",
"non_qualified": null,
"docomo": null,
"au": "E4E8",
"softbank": null,
"google": "FEB4F",
"image": "1f198.png",
"sheet_x": 0,
"sheet_y": 28,
"short_name": "sos",
"short_names": [
"sos"
],
"text": null,
"texts": null,
"category": "Symbols",
"sort_order": 167,
"added_in": "0.6",
"has_img_apple": true,
"has_img_google": true,
"has_img_twitter": true,
"has_img_facebook": true
},
{...many more elements here...
}]
How can i get the index of the value "FEB4F" whose key is "google", for example?
My only idea was this but it doesn't work:
print(data.index('FEB4F'))
Your basic data structure is a list, so there's no way to avoid looping over it.
Loop through all the items, keeping track of the current position. If the current item has the desired key/value, print the current position.
position = 0
for item in data:
if item.get('google') == 'FEB4F':
print('position is:', position)
break
position += 1
Assuming your data can fit in an table, I recommend using pandas for that. Here is the summary:
Read de data using pandas.read_json
Identify witch column to filter
Filter using pandas.DataFrame.loc
IE:
import pandas as pd
data = pd.read_json("path_to_json.json")
print(data)
#lets assume you want to filter using the 'unified' column
filtered = data.loc[data['unified'] == 'something']
print(filtered)
Of course the steps would be different depending on the JSON structure
I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?
it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.
I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]
I have seen some answers for similar questions but I am not sure that they were the best way to fix my problem.
I have a very large table (100,000+ rows of 20+ columns) being handled as a list of dictionaries. I need to do a partial deduplication of this list using a comparison. I have simplified an example of what I am doing now below.
table = [
{ "serial": "111", "time": 1000, "name": jon },
{ "serial": "222", "time": 0900, "name": sal },
{ "serial": "333", "time": 1100, "name": tim },
{ "serial": "444", "time": 1300, "name": ron },
{ "serial": "111", "time": 1300, "name": pam }
]
for row in table:
for row2 in table:
if row != row2:
if row['serial'] == row2['serial']:
if row['time'] > row2['time']:
action
This method does work (obviously simplified and just wrote "action" for that part) but the question I have is whether there is a more efficient method to get to the "row" that I want without having to double iterate the entire table. I don't have a way to necessarily predict where in the list matching rows would be located, but they would be listed under the same "serial" in this case.
I'm relatively new to Python and efficiency is the goal here. As of now with the amount of rows that are being iterated it is taking a long time to complete and I'm sure there is a more efficient way to do this, I'm just not sure where to start.
Thanks for any help!
You can sort the table with serial as the primary key and time as the secondary key, in reverse order (so that the latter of the duplicate items take precedence), then iterate through the sorted list and take action only on the first dict of every distinct serial:
from operator import itemgetter
table = [
{ "serial": "111", "time": "1000", "name": "jon" },
{ "serial": "222", "time": "0900", "name": "sal" },
{ "serial": "333", "time": "1100", "name": "tim" },
{ "serial": "444", "time": "1300", "name": "ron" },
{ "serial": "111", "time": "1300", "name": "pam" }
]
last_serial = ''
for d in sorted(table, key=itemgetter('serial', 'time'), reverse=True):
if d['serial'] != last_serial:
action(d)
last_serial = d['serial']
A list of dictionaries is always going to be fairly slow for this much data. Instead, look into whether Pandas is suitable for your use case - it is already optimised for this kind of work.
It may not be the most efficient, but one thing you can do is get a list of the serial numbers, then sort them. Let's call that list serialNumbersList. The serial numbers that only appear once, we know that they cannot possibly be duplicates, so we remove them from serialNumbersList. Then, you can use that list to reduce the amount of rows to process. Again, I am sure there are better solutions, but this is a good starting point.
#GiraffeMan91 Just to clarify what I mean (typed directly here, do not copy-paste):
serials = collections.defaultdict(list)
for d in table:
serials[d.pop('serial')].append(d)
def process_serial(entry):
serial, values = entry
# remove duplicates, take action based on time
# return serial, processed values
results = dict(
multiprocess.Pool(10).imap(process_serial, serials.iteritems())
)
I'm pulling data from a json endpoint, which returns a list.
Some of the elements in this list, I want to throw out. I'm only interested in certain elements.
I'm pulling the data as such:
# Pull the data
url = "https://my-endpoint.com"
user = 'user1'
pwd = 'password1'
response = requests.get(url, auth=(user, pwd))
data = json.loads(response.text)
The payload looks similar to:
[{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "testing123"
},
{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "foobar"
},
{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "testing456"
}]
Assume that the above continues on with many other elements, but with a different name. How can I pull all of the data, but exclude what I want?
From the example above, I would like to pull all data for names "testing123" and "testing456", but exclude the data from "foobar".
The new list is what I would iterate over to pull the data I need for my purposes.
There's a good deal of mismatched braces in your question, but I think I've figured it out. You have 3 (+ many more) dictionaries in a list, each with it's own apples, oranges (or other) keys, and then a name key. You want a list of dictionaries with the same structure as this one, just only the dictionaries where name in set_of_preapproved_names. For the sake of brevity I'll assume you have such a list of names called OK_NAMES:
new_data = [Dict for Dict in data if Dict ["name"] in OK_NAMES]
There you go!
If instead you wanted to eliminate all names with a specific pattern:
new_data = [Dict for Dict in data if not Dict ["name"].startswith ("foobar")]
That should work
Btw I know it's almost never a good idea to name variables after a type, I was just doing it for clarity here.