Issue while inserting dataframe value into Azure table storage in Python - python

I am trying to push my dataframe to Azure Table Storage using Python. But when I tried to insert a value, the values are getting jumbled up and also some of the records were not inserted into Azure. I dont know whether it is because of timing issue. Please find the code below.
for i in range(0,forecast.shape[0]):
partition_key=ticker+str(i)
stock_date=str(forecast.iloc[i]['ds'])
row_key=partition_key
stock_price=str(forecast.iloc[i]['yhat'])
companyname=str(forecast.iloc[i]['Company_Name'])
task = {'PartitionKey': partition_key, 'RowKey': row_key, 'StockPrice':stock_price, 'CompanyName':companyname,'Stock_date':stock_date}
v=table_service_actual.insert_entity("StockPricePrediction",task)
But in my Power BI when I tried to access the table storage:
But my actual dataframe looks like this:
Please help me in resolving the issue. I have also tried batch insertion.

The reason is due to ordering. Since Azure uses sorted indices, it needs the partition key to be ordered. Consider having the sorted index as partition key

Related

How to create a managed Hive Table using pyspark

I am facing the problem that every table I create using pyspark has type EXTERNAL_TABLE in hive, but I want to create managed tables and don't know what I am doing wrong. I tried different possibilities to create those tables. For instance:
spark.sql('CREATE TABLE dev.managed_test(key int, value string) STORED AS PARQUET')
spark.read.csv('xyz.csv').write.saveAsTable('dev.managed_test2')
In both options the resulting table is an EXTERNAL_TABLE. When I describe the table in Apache Hue or Beeline, I find also the property TRANSLATED_TO_EXTERNAL is true.
Does anyone have an idea, what could be wrong or what I could do instead of these two options shown above? Maybe, I am missing some Configuration parameter?
Thank you!

How to compare hash of two table columns hashes across SQL Server and Postgres?

I have a table in SQL Server 2017 which has many rows and that table was migrated to Postgres 10.5 along with data (my colleagues did it using Talend tool).
I want to compare if the data is correct after migration. I want to compare the values in a column in SQL Server vs Postgres.
I could try reading the columns into a Numpy series items from SQL server and Postgres and compare both.
But both the DBs are not in my local machine. They're hosted on a server that I need to access from the network which means the data retrieval is going to take much time.
Instead, I want to do something like this.
Perform sha256 or md5 hash on the column values which are ordered_by primary_key and compare the hash values from both databases which means I don't need to retrieve the results from the database to my local for comparison.
That function or something should return the same value for the hash if the column has exact same values.
I'm not even sure if it's possible or is there any better way to do it.
Can someone please point me in some direction.
If an FDW isn't going to work out for you, maybe the hash comparison is a good idea. MD5 is probably a good idea, only because you ought to get consistent results from different software.
Obviously, you'll need the columns to be in the same order in the two databases for the hash comparison to work. If the layouts are different, you can create a view in Postgres to match the column order in SQL Server.
Once you've got tables/views to compare, there's a shortcut to the hashing on the Postgres side. Imagine a table named facility:
SELECT MD5(facility::text) FROM facility;
If that's not obvious, here's what's going in there. Postgres has the ability to case any compound type to text. Like:
select your_table_here::text from your_table_here
The result is like this example:
(2be4026d-be29-aa4a-a536-de1d7124d92d,2200d1da-73e7-419c-9e4c-efe020834e6f,"Powder Blue",Central,f)
Notice the (parens) around the result. You'll need to take that into account when generating the hash on the SQL Server side. This pithy piece of code strips the parens:
SELECT MD5(substring(facility::text, 2, length(facility::text))) FROM facility;
Alternatively, you can concatenate columns as strings manually, and hash that. Chances are, you'll need to do that, or use a view, if you've got ID or timestamp fields that automatically changed during the import.
The :: casting operator can also cast a row to another type, if you've got a conversion in place. And where I've listed a table above, you can use a view just as well.
On the SQL Server side, I have no clue. HASHBYTES?

Bigquery data not getting inserted

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

How do I delete items from a DynamoDB table wherever an attribute is missing, regardless of key?

Is it possible to delete items from a DynamoDB table without specifying partition or sort keys? I have numerous entries in a table with different partition and sort keys and I want to delete all the items where a certain attribute does not exist.
AWS CLI or boto3/python solutions are welcome.
To delete large number of items from the table you need to query or scan first and then delete the items using BatchWriteItem or DeleteItem operation.
Query and BatchWriteItem is better interms of performance and cost, so if this is a job that happens frequently, its better to add a global secondary index on the attribute you need to check for deletion. However you need to manage BatchWriteItem iteratively for large number of items since query will return paginated values.
Else you can do a scan and
DeleteItem iteratively.
Check this Stackoverflow question for more insight.
It worth to try to use EMR Hive integration with DynamoDB. It allows you to write SQL queries against a DynamoDB. Hive supports DELETE statement and Amazon have implemented a DynamoDB connector. I am not sure if this would integrate perfectly, but this worth a try. Here is how to work with DynamoDB using EMR Hive.
Another option is to use parallel scan. Just get all items from DynamoDB that match a filter expression, and delete each one of them. Here is how to do scans using boto client.
To speed up the process you can batch delete items using the BatchWriteItem method. Here is how to do this in boto.
Notice that BatchWriteItem has following limitations:
BatchWriteItem can write up to 16 MB of data, which can comprise as
many as 25 put or delete requests.
Keep in mind that scans are expensive when you are doing scans you consume RCU for all items DynamoDB reads in your table and not for items it returns. So you either need to read data slowly or provision very high RCU for a table.
It's ok to do this operation infrequently, but you can't do it as a part of a web-server request if you have a table of a decent size.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

Categories