Extract specific column from csv stored in S3

Extract specific column from csv stored in S3 - python

I am querying Athena thru lambda. Results are getting stored in csv format in S3 bucket.
The csv files has two columns - EventTime and instance id.
I am reading csv file via one of function in my lambda handler:
def read_instanceids(path):
s3 = boto3.resource('s3')
bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
obj = bucket.Object(key= path)
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
return lines**
Output:
[
"\"eventTime\",\"instanceId\"",
"\"2021-09-27T19:46:08Z\",\"\"\"i-0aa1f4dd\"\"\"",
"\"2021-09-27T21:04:13Z\",\"\"\"i-0465c287\"\"\"",
"\"2021-09-27T21:10:48Z\",\"\"\"i-08b75f79\"\"\"",
"\"2021-09-27T19:40:43Z\",\"\"\"i-0456700b\"\"\"",
"\"2021-03-29T21:58:40Z\",\"\"\"i-0724f99f\"\"\"",
"\"2021-03-29T23:27:44Z\",\"\"\"i-0fafbe64\"\"\"",
"\"2021-03-29T21:41:12Z\",\"\"\"i-0064a8552\"\"\"",
"\"2021-03-29T23:19:09Z\",\"\"\"i-07f5f08e5\"\"\""
]
I want to store only my instance ids in one array.
How I can achieve that. I cant use Pandas/Numpy.
If I am using get_query_results - and returning the response - its in the below format:
[
{
"Data": [
{
"VarCharValue": "eventTime"
},
{
"VarCharValue": "instanceId"
}
]
},
{
"Data": [
{
"VarCharValue": "2021-09-23T22:36:15Z"
},
{
"VarCharValue": "\"i-053090803\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T21:58:40Z"
},
{
"VarCharValue": "\"i-0724f62a\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T21:41:12Z"
},
{
"VarCharValue": "\"i-552\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:19:09Z"
},
{
"VarCharValue": "\"i-07f4e5\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:03:09Z"
},
{
"VarCharValue": "\"i-0eb453\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T19:18:11Z"
},
{
"VarCharValue": "\"i-062120\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T18:15:26Z"
},
{
"VarCharValue": "\"i-0121a04\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-29T23:27:44Z"
},
{
"VarCharValue": "\"i-0f213\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T18:07:05Z"
},
{
"VarCharValue": "\"i-0ee19d8\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-28T14:49:22Z"
},
{
"VarCharValue": "\"i-04ad3c29\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-28T14:38:43Z"
},
{
"VarCharValue": "\"i-7c6166\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-03-30T19:13:42Z"
},
{
"VarCharValue": "\"i-07bc579d\""
}
]
},
{
"Data": [
{
"VarCharValue": "2021-04-29T19:47:34Z"
},
{
"VarCharValue": "\"i-0b8bc7df5\""
}
]
}
]

You can use the result returned from Amazon Athena via get_query_results().
If the data variable contains the JSON shown in your question, you can extract a list of the instances with:
rows = [row['Data'][1]['VarCharValue'].replace('"', '') for row in data]
print(rows)
The output is:
['instanceId', 'i-053090803', 'i-0724f62a', 'i-552', 'i-07f4e5', 'i-0eb453', 'i-062120', 'i-0121a04', 'i-0f213', 'i-0ee19d8', 'i-04ad3c29', 'i-7c6166', 'i-07bc579d', 'i-0b8bc7df5']
You can skip the column header by referencing: rows[1:]

IF your list was valid, you can do:
l = [ "eventTime",
"instanceId",
"2021-09-27T19:46:08Z",
"i-0aa1f4dd",
"2021-09-27T21:04:13Z",
"""i-0465c287""",
"2021-09-27T21:10:48Z",
"""i-08b75f79""",
"2021-09-27T19:40:43Z",
"""i-0456700b""",
"2021-03-29T21:58:40Z",
"""i-0724f99f""",
"2021-03-29T23:27:44Z",
"""i-0fafbe64""",
"2021-03-29T21:41:12Z",
"""i-0064a8552""",
"2021-03-29T23:19:09Z",
"""i-07f5f08e5""" ]
print(l[2:][1::2])
['i-0aa1f4dd', 'i-0465c287', 'i-08b75f79', 'i-0456700b', 'i-0724f99f', 'i-0fafbe64', 'i-0064a8552', 'i-07f5f08e5']

Python has csv module in standard library. https://docs.python.org/3/library/csv.html
But in this use case, if instanceIds doesn't contain comma you can split lines by comma, take second field and strip double quotes.
def read_instanceids(path):
s3 = boto3.resource('s3')
bucket = s3.Bucket('aws-athena-query-results-mybucket-us-east-1')
obj = bucket.Object(key= path)
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
return [line.split(',')[1].strip('"') for line in lines[1:]]

Related

How to delete an element of an array in a JSON file by its key in Python?

I am creating a kind-of database in using the .JSON file and I want to delete a specific element of an array in a JSON file using Python language, but I can't do this how I want, here's what info.json file looks like:
{
"dates": [
{
"date": "10/10",
"desc": "test1"
},
{
"date": "09/09",
"desc": "test3"
}
],
"data": [
{
"name": "123",
"infotext": "1234"
},
{
"name": "!##",
"infotext": "!##$"
}
]
}
Here's what json_import.py file looks like:
def delete_data():
name = input("Enter data name\n")
with open("info.json", "r+") as f:
file_data = json.load(f)
for x in file_data["data"]:
if x["name"] == name:
file_data["data"].remove(x)
f.seek(0)
json.dump(file_data, f, indent = 4)
delete_data()
TERMINAL:
Enter data name
!##
Expected:
{
"dates": [
{
"date": "10/10",
"desc": "test1"
},
{
"date": "09/09",
"desc": "test3"
}
],
"data": [
{
"name": "123",
"datatext": "1234"
}
]
}
Actual result:
{
"dates": [
{
"date": "10/10",
"desc": "test1"
},
{
"date": "09/09",
"desc": "test3"
}
],
"data": [
{
"name": "123",
"datatext": "1234"
}
]
} {
"name": "!##",
"infotext": "!##$"
}
]
}
So how to fix it?

Creating custom JSON from existing JSON using Python

(Python beginner alert) I am trying to create a custom JSON from an existing JSON. The scenario is - I have a source which can send many set of fields but I want to cherry pick some of them and create a subset of that while maintaining the original JSON structure. Original Sample
{
"Response": {
"rCode": "11111",
"rDesc": "SUCCESS",
"pData": {
"code": "123-abc-456-xyz",
"sData": [
{
"receiptTime": "2014-03-02T00:00:00.000",
"sessionDate": "2014-02-28",
"dID": {
"d": {
"serialNo": "3432423423",
"dType": "11111",
"dTypeDesc": "123123sd"
},
"mode": "xyz"
},
"usage": {
"duration": "661",
"mOn": [
"2014-02-28_20:25:00",
"2014-02-28_22:58:00"
],
"mOff": [
"2014-02-28_21:36:00",
"2014-03-01_03:39:00"
]
},
"set": {
"abx": "1",
"ayx": "1",
"pal": "1"
},
"rEvents": {
"john": "doe",
"lorem": "ipsum"
}
},
{
"receiptTime": "2014-04-02T00:00:00.000",
"sessionDate": "2014-04-28",
"dID": {
"d": {
"serialNo": "123123",
"dType": "11111",
"dTypeDesc": "123123sd"
},
"mode": "xyz"
},
"usage": {
"duration": "123",
"mOn": [
"2014-04-28_20:25:00",
"2014-04-28_22:58:00"
],
"mOff": [
"2014-04-28_21:36:00",
"2014-04-01_03:39:00"
]
},
"set": {
"abx": "4",
"ayx": "3",
"pal": "1"
},
"rEvents": {
"john": "doe",
"lorem": "ipsum"
}
}
]
}
}
}
Here the sData array tag has got few tags out of which I want to keep only 24 and get rid of the rest. I know I could use element.pop() but I cannot go and delete a new incoming field every time the source publishes it. Below is the expected output -
Expected Output
{
"Response": {
"rCode": "11111",
"rDesc": "SUCCESS",
"pData": {
"code": "123-abc-456-xyz",
"sData": [
{
"receiptTime": "2014-03-02T00:00:00.000",
"sessionDate": "2014-02-28",
"usage": {
"duration": "661",
"mOn": [
"2014-02-28_20:25:00",
"2014-02-28_22:58:00"
],
"mOff": [
"2014-02-28_21:36:00",
"2014-03-01_03:39:00"
]
},
"set": {
"abx": "1",
"ayx": "1",
"pal": "1"
}
},
{
"receiptTime": "2014-04-02T00:00:00.000",
"sessionDate": "2014-04-28",
"usage": {
"duration": "123",
"mOn": [
"2014-04-28_20:25:00",
"2014-04-28_22:58:00"
],
"mOff": [
"2014-04-28_21:36:00",
"2014-04-01_03:39:00"
]
},
"set": {
"abx": "4",
"ayx": "3",
"pal": "1"
}
}
]
}
}
}
I myself took reference from How can I create a new JSON object form another using Python? but its not working as expected. Looking forward for inputs/solutions from all of you gurus. Thanks in advance.

Kind of like this:
data = json.load(open("fullset.json"))
def subset(d):
newd = {}
for name in ('receiptTime','sessionData','usage','set'):
newd[name] = d[name]
return newd
data['Response']['pData']['sData'] = [subset(d) for d in data['Response']['pData']['sData']]
json.dump(data, open('newdata.json','w'))

Filter MongoDB query to find documents only if a field in a list of objects is not empty

I have a MongoDB document structure like following:
Structure
{
"stores": [
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": [],
"item_category": "101",
"item_id": "11"
}
]
},
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
},
{
"feedback": [],
"item_category": "101",
"item_id": "12"
},
{
"feedback": [],
"item_category": "102",
"item_id": "13"
},
{
"feedback": [],
"item_category": "102",
"item_id": "14"
}
],
"store_id": 500
}
]
}
This is a single document in a collection. Some field are deleted to produce minimal representation of the data.
What I want is to get items only if the feedback field in the items array is not empty. The expected result is:
Expected result
{
"stores": [
{
"items": [
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
}
],
"store_id": 500
}
]
}
This is what I tried based on examples in this, which I think pretty same situation, but it didn't work. What's wrong with my query, isn't it the same situation in zipcode search example in the link? It returns everything like in the first JSON code, Structure:
What I tried
query = {
'date': {'$gte': since, '$lte': until},
'stores.items': {"$elemMatch": {"feedback": {"$ne": []}}}
}
Thanks.

Please try this :
db.yourCollectionName.aggregate([
{ $match: { 'date': { '$gte': since, '$lte': until }, 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores' },
{ $match: { 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores.items' },
{ $match: { 'stores.items.feedback': { "$ne": [] } } },
{ $group: { _id: { _id: '$_id', store_id: '$stores.store_id' }, items: { $push: '$stores.items' } } },
{ $project: { _id: '$_id._id', store_id: '$_id.store_id', items: 1 } },
{ $group: { _id: '$_id', stores: { $push: '$$ROOT' } } },
{ $project: { 'stores._id': 0 } }
])
We've all these stages as you need to operate on an array of arrays, this query is written assuming you're dealing with a large set of data, Since you're filtering on dates just in case if your documents size is way less after first $match then you can avoid following $match stage which is in between two $unwind's.
Ref 's :
$match,
$unwind,
$project,
$group

This aggregate query gets the needed result (using the provided sample document and run from the mongo shell):
db.stores.aggregate( [
{ $unwind: "$stores" },
{ $unwind: "$stores.items" },
{ $addFields: { feedbackExists: { $gt: [ { $size: "$stores.items.feedback" }, 0 ] } } },
{ $match: { feedbackExists: true } },
{ $project: { _id: 0, feedbackExists: 0 } }
] )

Customize Python JSON object_hook

I am trying to customize json data using object_hook in Python 3, but do not know how to get started. Any pointers are much appreciated. I am trying to introduce a new key and move existing data into the new key in Python Object.
I am trying to convert below json text:
{
"output": [
{
"Id": "101",
"purpose": "xyz text",
"array": [
{
"data": "abcd"
},
{
"data": "ef gh ij"
}
]
},
{
"Id": "102",
"purpose": "11xyz text",
"array": [
{
"data": "abcd"
},
{
"data": "java"
},
{
"data": "ef gh ij"
}
]
}
]
}
to
{
"output": [
{
"Id": "101",
"mydata": {
"purpose": "xyz text",
"array": [
{
"data": "abcd"
},
{
"data": "ef gh ij"
}
]
}
},
{
"Id": "102",
"mydata": {
"purpose": "11xyz text",
"array": [
{
"data": "abcd"
},
{
"data": "java"
},
{
"data": "ef gh ij"
}
]
}
}
]
}
My Python JSON object hook is defined as:
class JSONObject:
def __init__( self, dict ):
vars(self).update( dict )
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__,
sort_keys=True, indent=4)

You can specify a custom object_pairs_hook (input_json is the string with your input JSON).
def mydata_hook(obj):
obj_d = dict(obj)
if 'Id' in obj_d:
return {'Id': obj_d['Id'], 'mydata': {k: v for k, v in obj_d.items() if 'Id' not in k}}
else:
return obj_d
print(json.dumps(json.loads(input_json, object_pairs_hook=mydata_hook), indent=2))
And the output:
{
"output": [
{
"mydata": {
"array": [
{
"data": "abcd"
},
{
"data": "ef gh ij"
}
],
"purpose": "xyz text"
},
"Id": "101"
},
{
"mydata": {
"array": [
{
"data": "abcd"
},
{
"data": "java"
},
{
"data": "ef gh ij"
}
],
"purpose": "11xyz text"
},
"Id": "102"
}
]
}

How to search through JSON object for specific list o expected key values?

I have this json object, and I am curious how to iterate through servicecatalog:name and alert for any name that does not equal "service-foo" or "service-bar".
Here is my json object:
{
"access": {
"serviceCatalog": [
{
"endpoints": [
{
"internalURL": "https://snet-storage101.example.com//v1.0",
"publicURL": "https://storage101.example.com//v1.0",
"region": "LON",
"tenantId": "1
},
{
"internalURL": "https://snet-storage101.example.com//v1.0",
"publicURL": "https://storage101.example.com//v1.0",
"region": "USA",
"tenantId": "1
}
],
"name": "service-foo",
"type": "object-store"
},
{
"endpoints": [
{
"publicURL": "https://x.example.com:9384/v1.0/x",
"tenantId": "6y5t4re32"
}
],
"name": "service-bar",
"type": "rax:test"
},
{
"endpoints": [
{
"publicURL": "https://y.example.com:9384/v1.0/x",
"tenantId": "765432"
}
],
"name": "service-thesystem",
"type": "rax:test"
}
]
}

If x is the above mentioned dictionary. You could do
for item in x["access"]["serviceCatalog"]:
if item["name"] not in ["service-foo", "service-bar"]:
print(item["name"])
ps: you could use json.loads() to decode json data if you are asking for that. And also you have errors in your JSON.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific column from csv stored in S3 - python

Related

How to delete an element of an array in a JSON file by its key in Python?

Creating custom JSON from existing JSON using Python

Filter MongoDB query to find documents only if a field in a list of objects is not empty

Customize Python JSON object_hook

How to search through JSON object for specific list o expected key values?

Categories

Resources