I am using a python script in order to structure a CSV file fetched from a website and store it as CSV in Azure container. The CSV File is loaded into Azure database using Azure Data Factory Copy Activity.
The datatype one of the amount column in CSV is having Exponential value inside it.For example it has value 3.7E+08 (the actual value is 369719968 ). But when the data is loaded into the database the values are stored as 3.7E+08 in the staging table for which the column data type is varchar(30). I am converting it to float and then to decimal in order to store it in the target table. On converting to float the value is coming as 370000000 inside database and hence it causing a big mismatch in the amount.
Kindly suggest how we can resolve this issue. We tried to convert the amount column to string in the CSV using Python. But still the value is coming as exponential.
Related
I am loading data from CSV files that are present under S3 into Snowflake table using a pyspark script. Earlier, the table had 10 columns & in the updated files I started receiving 12 columns. Now in Snowflake you can’t add new columns in between (we can alter a table and add columns in between in MySQL, but in Snowflake it is possible in the end only).
I have now altered the table and added two new columns in the end.
In the script I am doing upserts (insert - update). For new records that are inserted in snowflake, I am using append mode. For update records, I am using overwrite mode.
In the insertion append mode, I have specified column mapping by name (it is working fine)
df.write.format(“net.snowflake.spark.snowflake”).options(“column_mapping”,”name”).mode(“append”).save()
But when I added column mapping by name in update logic (overwrite mode).
it is giving me a error saying column mapping only works with append mode.
df.write.format(“net.snowflake.spark.snowflake”).options(“column_mapping”,”name”).mode(“overwrite”).save()
I have to specify column mapping because order of columns in snowflake table is different from order of columns in csv file.
Error:
the column mapping only works in append mode.
I'm trying to upload a CSV file to bigQuery via Google Cloud and facing formatting issues. I have two date columns(date and cancel) to convert to the required bigQuery DateTime format, I'm using this code for conversion.
df['date'] = pd.to_datetime(df['date']
this works fine for the "date" column but doesn't work for the "cancel" column, the "cancel column has some empty rows, are empty rows an issue?? And when I execute the code mentioned above, an additional column is automatically added(as a first column) to the CSV with random integer values. How to get rid of the formatting issues??
Me using the ELT approach, first load all your data to Bigquery and transform accordingly.
i.e., Make all columns as string and load. Thus you will not get error. Then you can transform as you want in Bigquery.
I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.
I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.
It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.
Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?
At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.
Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?
I am trying to push my dataframe to Azure Table Storage using Python. But when I tried to insert a value, the values are getting jumbled up and also some of the records were not inserted into Azure. I dont know whether it is because of timing issue. Please find the code below.
for i in range(0,forecast.shape[0]):
partition_key=ticker+str(i)
stock_date=str(forecast.iloc[i]['ds'])
row_key=partition_key
stock_price=str(forecast.iloc[i]['yhat'])
companyname=str(forecast.iloc[i]['Company_Name'])
task = {'PartitionKey': partition_key, 'RowKey': row_key, 'StockPrice':stock_price, 'CompanyName':companyname,'Stock_date':stock_date}
v=table_service_actual.insert_entity("StockPricePrediction",task)
But in my Power BI when I tried to access the table storage:
But my actual dataframe looks like this:
Please help me in resolving the issue. I have also tried batch insertion.
The reason is due to ordering. Since Azure uses sorted indices, it needs the partition key to be ordered. Consider having the sorted index as partition key