I have the following list of dictionaries:
datalist = [
{'Business': 'Business A', 'Category': 'IT', 'Title': 'IT Manager'},
{'Business': 'Business A', 'Category': 'Sourcing', 'Title': 'Sourcing Manager'}
]
I would like to upload to DynamoDB in the following format:
table.put_item(Item={
'Business':'Business A would go here',
'Category':'IT would go here',
'Title':'IT Manager would go here'
},
{
'Business':'Business B would go here',
'Category':'Sourcing would go here',
'Title':'Sourcing Manager would go here'
}
)
I've tried converting the list of dicts to dict first and accessing the elements that way or trying to iterate through the Items parameter but no luck. Any help would be appreciated.
Here's my DynamoDB structure, 3 columns (Business, Category, Title):
{
"Business": {
"S": "Business goes here"
},
"Category": {
"S": "Category goes here"
},
"Title": {
"S": "Title goes here"
}
}
The put_item API call allows you to upload exactly one item, you're trying to upload two at the same time, this doesn't work.
You can batch multiple put items requests using boto3 to decrease the number of API calls. Here's an example adapted from the documentation:
with table.batch_writer() as batch:
for item in datalist:
batch.put_item(
Item=item
)
This will automatically create the batch writes under the hood.
Related
I have a JSON file that I read in pandas and converted to a dataframe. I then exported this file as a CSV so I could edit it easier. Once finished, I read the CSV file back into a dataframe and then wanted to convert it back to a JSON file. However, in that process a whole lot of extra data was automatically added to my original list of dictionaries (the JSON file).
I'm sure I could hack together a fix, but wanted to know if anyone knows an efficient way to handle this process so that NO new data or columns are added to my original JSON data?
Original JSON (snippet):
[
{
"tag": "!= (not-equal-to operator)",
"definition": "",
"source": [
{
"title": "Compare Dictionaries",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch08.html#idm45795007002280"
}
]
},
{
"tag": "\"intelligent\" applications",
"definition": "",
"source": [
{
"title": "Why Machine Learning?",
"URL": "https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch01.html#idm45613685872600"
}
]
},
{
"tag": "# (pound sign)",
"definition": "",
"source": [
{
"title": "Comment with #",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch04.html#idm45795038172984"
}
]
},
CSV as a dataframe (an index was automatically added):
tag definition source
0 != (not-equal-to operator) [{'title': 'Compare Dictionaries', 'URL': 'htt...
1 "intelligent" applications [{'title': 'Why Machine Learning?', 'URL': 'ht...
2 # (pound sign) [{'title': 'Comment with #', 'URL': 'https://l...
3 $ (Mac/Linux prompt) [{'title': 'Test Driving Python', 'URL': 'http...
4 $ (anchor) [{'title': 'Patterns: Using Specifiers', 'URL'...
... ... ... ...
11375 { } (curly brackets) []
11376 | (vertical bar) [{'title': 'Combinations and Operators', 'URL'...
11377 || (concatenation) function (DB2/Oracle/Postgr... [{'title': 'Discussion', 'URL': 'https://learn...
11378 || (for Oracle Database) [{'title': 'Including special characters', 'UR...
11379 || (vertical bar, double), concatenation opera... [{'title': 'Including special characters', 'UR...
7009 rows × 3 columns
JSON file after converting from CSV (all sorts of awful):
{
"0":{
"Unnamed: 0":0,
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"Unnamed: 0":1,
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"Unnamed: 0":2,
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
Here is my code:
import pandas as pd
import json
# to dataframe
tags_df = pd.read_json('dsa_tags_flat.json')
# csv file was manually cleaned then reloaded here
cleaned_csv_df = pd.read_csv('dsa-parser-flat.csv')
# write to JSON
cleaned_csv_df.to_json(r'dsa-tags.json', orient='index', indent=2)
EDIT: I added an index=false to the code when going from dataframe to CSV, which helped, but still have the index of keys there that were not in the original JSON. I wonder if a library function out somewhere would prevent this? Or do I have to just write some loops and remove them myself?
Also, as you can see, the URL forward-slashes were escaped. Not what I wanted.
{
"0":{
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
"3":{
"tag":"$ (Mac\/Linux prompt)",
"definition":null,
"source":"[{'title': 'Test Driving Python', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/data-wrangling-with\/9781491948804\/ch01.html#idm140080973230480'}]"
},
The issue is that you are adding an index at two places.
Once while writing your file to csv. This adds the "Unnamed: 0" fields in the final JSON files. You can use index = False in the to_csv method while writing CSV to disk or specify the index_col parameter while reading the saved CSV in read_csv.
Secondly you are adding an index while writing the df to json with orient="index". This adds the outermost indices such as "0", "1" in the final JSON file. You should use orient="records" if you intend to save the json in a similar format to it was loaded in.
To understand how the orient parameter works, refer to pandas.DataFrame.to_json
I want to create a random video picker in Python, where you input a (YouTube) channel's name, and it picks a random video from this channel. I saw an internet tutorial, which said to use youtube.channels().list(part="contentDetails",forUsername="GoogleDevelopers") and then take the playlist ID from this and call youtube.playlistItems().list(part="snippet",maxResults=50,playlistId="playlistId"). The problem is: how can I take just the playlist ID from youtube.channels().list(), instead of this long thing it normally outputs? This response is stored in a variable and even if there is no way to get just the playlist ID, is there a way to just read the uploads value from the variable?
The normal output looks like this:
{
'kind': 'youtube#channelListResponse',
'etag': 'h612UhyviV63eK7y4HMgXE59VnY',
'pageInfo':
{
'totalResults': 1,
'resultsPerPage': 5
},
'items': [
{
'kind': 'youtube#channel',
'etag': 'tjfVDNBL4GkV4fzZBO9NE36KY5o',
'id': 'UC_x5XG1OV2P6uZZ5FSM9Ttw',
'contentDetails':
{
'relatedPlaylists':
{
'likes': '',
'uploads': 'UU_x5XG1OV2P6uZZ5FSM9Ttw'
}
}
}]
}
Sorry if my English isn't clear and please tell me to provide any further information.
If x would be your 'normal output':
some_variable = x['items'][0]['contentDetails']['relatedPlaylists']['uploads']
I am trying to create a pandas dataframe out of a nested json. For some reason, I seem to be unable to address the third level.
My json looks something like this:
"numberOfResults": 376,
"results": [
{
"name": "single",
"docs": [
{
"id": "RAKDI342342",
"type": "Culture",
"category": "Culture",
"media": "unknown",
"label": "exampellabel",
"title": "testtitle and titletest",
"subtitle": "Archive"
]
},
{
"id": "GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER",
"type": "Culture",
"category": "Culture",
"media": "image",
"label": "more label als example",
"title": "test the second title",
"subtitle": "picture"
and so on.
Within the "docs"-part are all the actual results, starting with "id". Once all the information is there, the next block starting with "id" simply follows.
Now I am trying to create a table with the keys id, label and title (for a start) for each of these separate blocks (in this case actual items).
After defining the search_url (where I get the json from), my code for this currently looks like this:
result = requests.get(search_url)
data = result.json()
data.keys()
With this, I get told that they dict_keys are the following:
dict_keys(['numberOfResults', 'results', 'facets', 'entities', 'fulltexts', 'correctedQuery', 'highlightedTerms', 'randomSeed', 'nextCursorMark'])
Given the json from above, I know I want to look into "results" and then further into "docs". According to the documentation I found, I should be able to achieve this by addressing the results-part directly and then addressing the nested bit by separating the fields with ".".
I have now tried the following the code:
fields = ["docs.id", "docs.label", "docs.title"]
df = pd.json_normalize(data["results"])
df[fields]
This works until df[field] - at this stage the programm tells me:
KeyError: "['docs.id'] not in index"
It does work for the level above though, so if I try the same with "name" and "docs" I get a lovely dataframe. What am I doing wrong? I am still a python and pandas beginner and would appreciate any help very much!
EDIT:
The desired dataframe output would look roughly like this:
id label title
0 RAKDI342342 exampellabel testtitle and titletest
Use pandas.json_normalize()
The following code uses pandas v.1.2.4
If you don't want the other columns, remove the list of keys assigned to meta
Use pandas.DataFrame.drop to remove any other unwanted columns from df.
import pandas as pd
df = pd.json_normalize(data, record_path=['results', 'docs'], meta=[['results', 'name'], 'numberOfResults'])
display(df)
id type category media label title subtitle results.name numberOfResults
0 RAKDI342342 Culture Culture unknown exampellabel testtitle and titletest Archive single 376
1 GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER Culture Culture image more label als example test the second title picture single 376
Data
The posted JSON / Dict is not correctly formed
Assuming the following corrected form
data = \
{'numberOfResults': 376,
'results': [{'docs': [{'category': 'Culture',
'id': 'RAKDI342342',
'label': 'exampellabel',
'media': 'unknown',
'subtitle': 'Archive',
'title': 'testtitle and titletest',
'type': 'Culture'},
{'category': 'Culture',
'id': 'GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER',
'label': 'more label als example',
'media': 'image',
'subtitle': 'picture',
'title': 'test the second title',
'type': 'Culture'}],
'name': 'single'}]}
In my scrapy project I want to extract data from a website. It turned out that all information are stored in some script that I can easily read in JSON format and from there extract the data I need.
That's my function:
def parse(self, response):
items = response.css("script:contains('window.__INITIAL_STATE__')::text").re_first(r"window\.__INITIAL_STATE__ =(.*);")
for item in json.loads(items)['offers']:
yield {
"title": item['jobTitle'],
"employer": item['employer'],
"country": item['countryName'],
"details_page": item['companyProfileUrl'],
"expiration_date": item['expirationDate'],
'salary': item['salary'],
'employmentLevel': item['employmentLevel'],
}
And json file have that structure:
var = {
"offers":[
{
"commonOfferId":"1200072247",
"jobTitle":"Automatyk - Programista",
"employer":"MULTIPAK Spółka Akcyjna",
"companyProfileUrl":"https://pracodawcy.pracuj.pl/company/20379037/profile",
"expirationDate":"2021-04-28T12:47:06.273",
"salary":"",
"employmentLevel":"Specjalista (Mid / Regular)" ,
"offers": [
{
"offerId":500092126,
"regionName":"kujawsko-pomorskie",
"cities":["Małe Czyste (pow. chełmiński)"],
"label":"Małe Czyste (pow. chełmiński)"}],
Above example of one element. So when I try to extract data like cities or regioName I receive an error. How can I make for loop from throughout two dictionaries and yield that data date to the new dictionary?
You didn't make it clear what you want, but I'm guessing this is close:
def parse(self, response):
items = response.css("script:contains('window.__INITIAL_STATE__')::text").re_first(r"window\.__INITIAL_STATE__ =(.*);")
for item in json.loads(items)['offers']:
for offer in item['offers']:
yield {
"title": item['jobTitle'],
"employer": item['employer'],
"country": item['countryName'],
"details_page": item['companyProfileUrl'],
"expiration_date": item['expirationDate'],
'salary': item['salary'],
'employmentLevel': item['employmentLevel'],
'offernumber': offer['offerId'],
'region': offer['regionName'],
'city': offer['cities'][0]
}
I am trying to use the cost explorer API using boto3. I am trying to get cost for EC2 snapshots. These snapshots have custom tags associated with them. What I am trying to retrieve is the cost of snapshots which have a particular tag.
I have written the following script:
import boto3
client = boto3.client('ce')
response = client.get_cost_and_usage(
TimePeriod={
'Start': '2019-01-20',
'End': '2019-01-24'
},
Metrics=['BLENDED_COST','USAGE_QUANTITY','UNBLENDED_COST'],
Granularity='MONTHLY',
Filter={
'Dimensions': {
'Key':'USAGE_TYPE_GROUP',
'Values': ['EC2: EBS - Snapshots']
}
}
)
This gives me the cost. But this is the total cost for the snapshot usage, i.e. for all the volumes. Is there any way to filter based on tags on the snapshot?
I tries adding the fallowing Filter:
Filter={
'And': [
{
'Dimensions': {
'Key':'USAGE_TYPE_GROUP',
'Values': ['EC2: EBS - Snapshots']
}
},
{
'Tags':{
'Key': 'test',
'Values': ['aj']
}
}
]
}
There is 1 snapshot where I have added that tag. I checked the date range and the snapshot was created within that time range and is still available. I tried changing granularity to DAILY too.
But this always shows 0 cost.
To query the snapshots or even other services using tags, you need to activate them in the billing menu.
Refer the link to activate the tags you need to query:
https://console.aws.amazon.com/billing/home?region=us-east-1#/preferences/tags
NOTE: Only master accounts in an organization and single accounts that are not members of an organization have access to the Cost Allocation Tags.
I hope that helps!
'Tags' can be added in your filter as follows:
response = client.get_cost_and_usage(
TimePeriod={
'Start': '2019-01-10',
'End': '2019-01-15'
},
Metrics=['BLENDED_COST','USAGE_QUANTITY','UNBLENDED_COST'],
Granularity='MONTHLY',
Filter={
'Dimensions': {
'Key':'USAGE_TYPE',
'Values': ['APN1-EBS:SnapshotUsage']
},
'Tags': {
'Key': 'keyName',
'Values': [
'keyValue',
]
}
}
)
You can find the exact usage in the boto3 cost explorer API reference.
You could also group by tag keys like this:
Filter={
'Dimensions': {
'Key':'USAGE_TYPE',
'Values': ['APN1-EBS:SnapshotUsage']
}
},
GroupBy=[
{
'Type': 'DIMENSION'|'TAG',
'Key': 'string'
},
],
It won't filter out tags, but it will group the returned data by tag key. This will return ALL tag values matching the tag key, so it may be too broad, but you can use it to troubleshoot any additional problems.
I'd confirm that your tag values and keys all match up.