I have a dataframe df that has a column tags . Each element of the row tags is a list of dictionary and looks like this:
[
{
"id": "new",
"name": "new",
"slug": null,
"type": "HashTag",
"endIndex": 0,
"startIndex": 0
},
{
"id": "1234",
"name": "abc ltd.",
"slug": "5678",
"type": "StockTag",
"endIndex": 0,
"startIndex": 0
}
]
The list can have any number of elements.
I want to filter the dataframe df for rows where any element of the tags column has the type can be either StockTag or UserTag
I was able to check if the first element of the list has the type: StockTag as follows
df[df['tags'].map(lambda d: d[0]['type'] == 'StockTag')]
I am unable to check for other elements. Instead of checking only the first(index=0) element, I want to iterate through all the elements and check.
Any help on this?
I'm supposing, you have dataframe like this:
data tags
0 some_data [{'id': 'new', 'name': 'new', 'slug': None, 't...
Where tags column contains list of dictionaries.
Then you can use any() to search the tags column for StockTag type:
print(df[df["tags"].apply(lambda x: any(d["type"] == "StockTag" for d in x))])
Related
I need to convert this json file to a data frame in python:
print(resp2)
{
"totalCount": 1,
"nextPageKey": null,
"result": [
{
"metricId": "builtin:tech.generic.cpu.usage",
"data": [
{
"dimensions": [
"process_345678"
],
"dimensionMap": {
"dt.entity.process_group_instance": "process_345678"
},
"timestamps": [
1642021200000,
1642024800000,
1642028400000
],
"values": [
10,
15,
12
]
}
]
}
]
}
Output needs to be like this:
metricId dimensions timestamps values
builtin:tech.generic.cpu.usage process_345678 1642021200000 10
builtin:tech.generic.cpu.usage process_345678 1642024800000 15
builtin:tech.generic.cpu.usage process_345678 1642028400000 12
I have tried this:
print(pd.json_normalize(resp2, "data"))
I get invalid syntax, any ideas?
Take a look at the examples of json_normalize, and you'll see a list of dictionaries that have the key names of the columns you want, unique to each row. When you have nested lists/objects, then the columns will be flatten to have dot-notation, but nested arrays will not end up duplicated across rows.
Therefore, parse the data into a flat list, then you can use from_records.
data = []
for r in resp2['result']:
metricId = r['metricId']
for d in r['data']:
dimension = d['dimensions'][0] # unclear why this is an array
timestamps = d['timestamps']
values = d['values']
for t, v in zip(timestamps, values):
data.append({'metricId': metricId, 'dimensions': dimension, 'timestamps': t, 'values': v})
df = pd.DataFrame.from_records(data)
I have a nested dictionary, whose first level keys are [0, 1, 2...] and the corresponding values of each key are of the form:
{
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
I want to make a pandas dataframe of the form:
Geometry Type Properties
Type Coordinates State Code Name Zone Address
0 Point [..., ...] Features Rajasthan BDHL ... ... ...
1
2
I am not able to understand the examples over the net about multi indexing/nested dataframe/pivoting. None of them seem to take the first level keys as the primary index in the required dataframe.
How do I get from the data I have, to making it into this formatted dataframe?
I would suggest to create columns as "geometry_type", "geometry_coord", etc.. in order to differentiate theses columns from the column that you would name "type". In other words, using your first key as a prefix, and the subkey as the name, and hence creating a new name. And after, just parse and fill your Dataframe like that:
import json
j = json.loads("your_json.json")
df = pd.DataFrame(columns=["geometry_type", "geometry_coord", ... ])
for k, v in j.items():
if k == "geometry":
df = df.append({
"geometry_type": v.get("type"),
"geometry_coord": v.get("coordinates")
}, ignore_index=True)
...
Your output could then looks like this :
geometry_type geometry_coord ...
0 [75.4516454, 27.2520587] NaN ...
PS : And if you really want to go for your initial option, you could check here : Giving a column multiple indexes/headers
I suppose you have a list of nested dictionaries.
Use json_normalize to read json data and split current column index into 2 part using str.partition:
import pandas as pd
import json
data = json.load(open('data.json'))
df = pd.json_normalize(data)
df.columns = df.columns.str.partition('.', expand=True).droplevel(level=1)
Output:
>>> df.columns
MultiIndex([( 'type', ''),
( 'geometry', 'type'),
( 'geometry', 'coordinates'),
('properties', 'state'),
('properties', 'code'),
('properties', 'name'),
('properties', 'zone'),
('properties', 'address')],
)
>>> df
type geometry properties
type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
You can use pd.json_normalize() to normalize the nested dictionary into a dataframe df.
Then, split the column names with dots into multi-index with Index.str.split on df.columns with parameter expand=True, as follows:
Step 1: Normalize nested dict into a dataframe
j = {
"geometry": {
"type": "Point",
"coordinates": [75.4516454, 27.2520587]
},
"type": "Feature",
"properties": {
"state": "Rajasthan",
"code": "BDHL",
"name": "Badhal",
"zone": "NWR",
"address": "Kishangarh Renwal, Rajasthan"
}
}
df = pd.json_normalize(j)
Step 1 Result:
print(df)
type geometry.type geometry.coordinates properties.state properties.code properties.name properties.zone properties.address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
Step 2: Create Multi-index column labels
df.columns = df.columns.str.split('.', expand=True)
Step 2 (Final) Result:
print(df)
type geometry properties
NaN type coordinates state code name zone address
0 Feature Point [75.4516454, 27.2520587] Rajasthan BDHL Badhal NWR Kishangarh Renwal, Rajasthan
I have a data structure like this:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
},
.
.
.
]
Each key in the dictionaries has values of type integer, string or
list of strings (not all keys are in all dicts present), each
dictionary represents a row in a table; all rows are given as the list
of dictionaries.
How can I easily import this into Pandas? I tried
df = pd.DataFrame.from_records(data)
but here I get an "ValueError: arrays must all be same length" error.
The DataFrame constructor takes row-based arrays (amoungst other structures) as data input. Therefore the following works:
data = [{
"name": "leopard",
"character": "mean",
"skills": ["sprinting", "hiding"],
"pattern": "striped",
},
{
"name": "antilope",
"character": "good",
"skills": ["running"],
}]
df = pd.DataFrame(data)
print(df)
Output:
character name pattern skills
0 mean leopard striped [sprinting, hiding]
1 good antilope NaN [running]
[MongoDB shell or pyMongo] I would like to know how to efficiently convert one record in a collection with an array in one field, to multiple records in say anew collection. So far, the only solution, I've been able to achieve is iterating the records one by one and then iterating the array in the field I want and do individual inserts. I'm hoping there's a more efficient way to do this.
Example:
I want to take a collection in MongoDB with structure similar to :
[{
"_id": 1,
"points": ["a", "b", "c"]
}, {
"_id": 2,
"points": ["d"]
}]
and convert it to something like this:
[{
"_id": 1,
"points": "a"
}, {
"_id": 2,
"points": "b"
}, {
"_id": 3,
"points": "c"
}, {
"_id": 4,
"points": "d"
}]
Assuming you're ok with auto-generated _id values in the new collection, you can do this with an aggregation pipeline that uses $unwind to unwind the points array and $out to output the results to a new collection:
db.test.aggregate([
// Duplicate each doc, one per points array element
{$unwind: '$points'},
// Remove the _id field to prompt regeneration as there are now duplicates
{$project: {_id: 0}},
// Output the resulting docs to a new collection, named 'newtest'
{$out: 'newtest'}
])
Here's another version which can be expected to perform worse than #JohnnyHK's solution because of a second $unwind and a potentially massive $group but it generates integer IDs based on some order that you can specify in the $sort stage:
db.collection.aggregate([{
// flatten the "points" array to get individual documents
$unwind: { "path": "$points" },
}, {
// sort by some criterion
$sort: { "points": 1 }
}, {
// throw all sorted "points" in the very same massive array
$group: {
_id: null,
"points": { $push: "$points" },
}
}, {
// flatten the massive array making each document's position index its `_id` field
$unwind: {
"path": "$points",
includeArrayIndex: "_id"
}
} , {
// write results to new "result" collection
$out: "result"
}], {
// make sure we do not run into memory issues
allowDiskUse: true
})
I have seen some answers for similar questions but I am not sure that they were the best way to fix my problem.
I have a very large table (100,000+ rows of 20+ columns) being handled as a list of dictionaries. I need to do a partial deduplication of this list using a comparison. I have simplified an example of what I am doing now below.
table = [
{ "serial": "111", "time": 1000, "name": jon },
{ "serial": "222", "time": 0900, "name": sal },
{ "serial": "333", "time": 1100, "name": tim },
{ "serial": "444", "time": 1300, "name": ron },
{ "serial": "111", "time": 1300, "name": pam }
]
for row in table:
for row2 in table:
if row != row2:
if row['serial'] == row2['serial']:
if row['time'] > row2['time']:
action
This method does work (obviously simplified and just wrote "action" for that part) but the question I have is whether there is a more efficient method to get to the "row" that I want without having to double iterate the entire table. I don't have a way to necessarily predict where in the list matching rows would be located, but they would be listed under the same "serial" in this case.
I'm relatively new to Python and efficiency is the goal here. As of now with the amount of rows that are being iterated it is taking a long time to complete and I'm sure there is a more efficient way to do this, I'm just not sure where to start.
Thanks for any help!
You can sort the table with serial as the primary key and time as the secondary key, in reverse order (so that the latter of the duplicate items take precedence), then iterate through the sorted list and take action only on the first dict of every distinct serial:
from operator import itemgetter
table = [
{ "serial": "111", "time": "1000", "name": "jon" },
{ "serial": "222", "time": "0900", "name": "sal" },
{ "serial": "333", "time": "1100", "name": "tim" },
{ "serial": "444", "time": "1300", "name": "ron" },
{ "serial": "111", "time": "1300", "name": "pam" }
]
last_serial = ''
for d in sorted(table, key=itemgetter('serial', 'time'), reverse=True):
if d['serial'] != last_serial:
action(d)
last_serial = d['serial']
A list of dictionaries is always going to be fairly slow for this much data. Instead, look into whether Pandas is suitable for your use case - it is already optimised for this kind of work.
It may not be the most efficient, but one thing you can do is get a list of the serial numbers, then sort them. Let's call that list serialNumbersList. The serial numbers that only appear once, we know that they cannot possibly be duplicates, so we remove them from serialNumbersList. Then, you can use that list to reduce the amount of rows to process. Again, I am sure there are better solutions, but this is a good starting point.
#GiraffeMan91 Just to clarify what I mean (typed directly here, do not copy-paste):
serials = collections.defaultdict(list)
for d in table:
serials[d.pop('serial')].append(d)
def process_serial(entry):
serial, values = entry
# remove duplicates, take action based on time
# return serial, processed values
results = dict(
multiprocess.Pool(10).imap(process_serial, serials.iteritems())
)