How to pull all data in the table using aws cli? - python

I have a table on dynamodb with about 50,000 records. I want to pull this whole table to my local computer using aws cli. But I know there is a 1MB limit.
What I've done so far
Get data in json format using aws dynamodb scan --table-name my_table_name> output.json
Then, Convert json output to dataframe using the python code below.
import json
import pandas as pd
with open("output.json", encoding='utf-8-sig', errors='ignore') as json_data:
data = json.load(json_data, strict=False)
df=pd.DataFrame(data["Items"])
After, necessary preprocessing steps...
Is there a way I can pull the whole table?

Related

How to integrate automated csv processing with Quicksight ingestion through aws Lambda?

I've been working on a project that uses a fairly simple data pipeline to clean and transform raw csv files into processed data using Python3.8 and Lambda to create various subsets which are sent to respective S3 buckets. The Lambda function is triggered by uploading a raw csv file to an intake S3 bucket, which initiates the process.
However, I would like to also send some of that processed data directly to Quicksight for ingestion from that same Lambda function for visual inspection as well, and that's where I'm currently stuck.
A portion of the function (leaving out the imports) I have with just the csv processing and uploading to S3, and this is the portion I like direclty ingested to Quicksight:
def featureengineering(event, context):
bucket_name = event['Records'][0]['s3']['bucket']['name']
s3_file_name = event['Records'][0]['s3']['object']['key']
read_file = s3_client.get_object(Bucket=bucket_name,Key=s3_file_name)
#turning the CSV into a dataframe in AWS Lambda
s3_data = io.BytesIO(read_file.get('Body').read())
df = pd.read_csv(s3_data, encoding="ISO-8859-1")
#replacing erroneous zero values to nan (missing) which is more accurate and a general table,
#and creating a new column with just three stages instead for simplification
df[['Column_A','Column_B']] = df[['Column_A','Column_B']].replace(0,np.nan)
#applying function for feature engineering of 'newstage' function
df['NewColumn'] = df.Stage.apply(newstage)
df1 = df
df1.to_csv(csv_buffer1)
s3_resource.Object(bucket1, csv_file_1).put(Body=csv_buffer1.getvalue()) #downloading df1 to S3
So at that point where the df1 is sent to its S3 bucket (which works fine), but I'd like it directly ingested into Quicksight as an automated spice refresh as well.
In digging around I did found a similar question with an answer
import boto3
import time
import sys
client = boto3.client('quicksight')
response = client.create_ingestion(DataSetId='<dataset-id>',IngestionId='<ingetion-id>',AwsAccountId='<aws-account-id>')
but the hang up I'm having is in the DataSetId or more generally, how do I turn the pandas DataFrame df1 in the Lambda Function into something the CreateIngestion API can accept and automatically send to QuickSight as an automated spice refresh of the most recent processed data?
You should first create a Quicksight Dataset, quoting from the docs:
A dataset identifies the specific data in a data source that you want to use. For example, the data source might be a table if you are connecting to a database data source. It might be a file if you are connecting to an Amazon S3 data source.
Once you have saved your DataFrame on S3 (either as a csv or parquet file), you can create a Quicksight Dataset that sources data from it.
You can do so either via Console or programmatically (probably what you’re looking for).
Finally, once you have the Dataset ID you can reference it in other Quicksight API calls.

Get JSON format for Google BigQuery Data using Pandas/Python

I am trying to query Google BigQuery using the Pandas/Python client interface. I am following the tutorial here: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas. I was able to get it to work but I want to query the data as the JSON format that can be downloaded directly from the WebUI (see screenshot). Is there a way to download data as the JSON structure pictured instead of converting it to the data frame object?
I imagine the command would be somewhere around this part of the code from the tutorial:
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
Just add .to_json(orient='records') call after converting to dataframe:
json_data = bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient).to_json(orient='records')
pandas docs

transform data in azure data factory using python data bricks

I have the task to transform and consolidate millions of single JSON file into BIG CSV files.
The operation would be very simple using a copy activity and mapping the schemas, I have already tested, the problem is that a massive amount of files have bad JSON format.
I know what is the error and the fix is very simple too, I figured that I could use a Python Data brick activity to fix the string and then pass the output to a copy activity that could consolidate the records into a big CSV file.
I have something in mind like this, I'm not sure if this is the proper way to address this task. I don't know to use the output of the Copy Activy in the Data Brick activity
It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as #KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas Python package to do that via few steps on Azure Databricks.
Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_json function, the code as below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)
#print(list(blob_urls_with_token))
Then, you can read these JSON file directly from blobs via read_json function to create their pandas Dataframe.
import pandas as pd
for blob_url_with_token in blob_urls_with_token:
df = pd.read_json(blob_url_with_token)
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in Combining / joining / merging like append.
To write a dataframe to a csv file, I think it's very easy by to_csv function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs to run Spark jobs.
Hope it helps.
Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.
So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.

Kinesis firehose to s3 JSON objects not delimited

I have a data pipeline that is something like this : Kinesis Firehose -> S3
When I use the glue crawler to create an Athena table over this data the table only reads some of the actual rows. The data in the file looks like this :
{row1}{row2}{row3}{row4}\n
{row5}{row6}{row7}
If I modify the data to have a newline after each row the Athena table will read the data properly. What I am wondering is how other people have solved this problem.
The solution I am considering is to write a python lambda function that takes care of the \n new line delimiter for the rows. Is there a better way to do this?

Bigquery (and pandas) - ensure data-insert consistency

In my python project, I need to fill a bigquery table with a relational dataframe. I'm having a lot of trouble at creating a new table from scratch and being sure that the first data I upload to it are actually put into the table.
I've read the page https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency and have seen that applying a insertId to the insert query would solve the problem, but since I use pandas's dataframes, the function to_gbq of the pandas-gbq package seems to be perfect for this task. Yet, when using to_gbq function and a new table is created/replaced, sometimes (apparently randomly) the first data chunk is not written into the table.
Does anybody know how to ensure the complete insertion of a DataFrame into a bigquery new created table? Thanks
I believe you are encountering https://github.com/pydata/pandas-gbq/issues/75. Basically, Pandas using the BigQuery streaming API to write data into tables, but the streaming API has a delay after table creation to when it starts working.
Edit: Version 0.3.0 of pandas-gbq fixes this issue by using a load job to upload data frames to BigQuery instead of streaming.
In the meantime, I'd recommend using a "load job" to create the tables. For example, using the client.load_table_from_file method in the google-cloud-bigquery package.
from google.cloud.bigquery import LoadJobConfig
from six import StringIO
destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []
for row in maybe_a_dataframe:
row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
body = StringIO('{}\n'.format('\n'.join(rows)))
client.load_table_from_file(
body,
destination_table,
job_config=job_config).result()
Edit: This code sample fails for columns containing non-ASCII characters. See https://github.com/pydata/pandas-gbq/pull/108

Categories