I am reading the schema of parquet and JSON files to generate a DDL for a redshift table. I am getting some data types like timestamp[ns] and timestamp[s]. I tried to look upon the internet but couldn't understand the difference.
Can you please make me understand with some examples?
timestamp[x] is expressed in units x.
For you, s=seconds, ns=nanoseconds
For example:
Timestamp[s] = 2020-03-14T15:32:52
Timestamp[ms] = 2020-03-14T15:32:52.192
Timestamp[us] = 2020-03-14T15:32:52.192548
Timestamo[ns]= 2020-03-14T15:32:52.192548165
Related
I have a dataframe that can have some columns that are numerical values (some are whole, some are not), and I need to get the data into AWS DynamoDB. DynamoDB isn't happy about floats and wants Decimals only.
I am currently doing:
results = formatted_df.T.to_dict().values()
with self.approval_table.batch_writer() as batch:
for result in results:
formatted_item = json.loads(json.dumps(result), parse_float=Decimal)
batch.put_item(Item=formatted_item)
But with or without the json.loads(json.dumps(...),...) on put_item, I'm still getting Float types are not supported. Use Decimal types instead. I am not sure why, as I thought this was supposed to handle getting rid of the Float types.
Any help to figure out what I'm missing here would be greatly appreciated
I've got some timestamps in a database that are 9999-12-31 and trying to convert to parquet. Somehow these timestamps all end up as 1816-03-29 05:56:08.066 in the parquet file.
Below is some code to reproduce the issue.
file_path = "tt.parquet"
schema = pa.schema([pa.field("tt", pa.timestamp("ms"))])
table = pa.Table.from_arrays([pa.array([datetime(9999, 12, 31),], pa.timestamp('ms'))], ["tt"])
writer = pq.ParquetWriter(file_path, schema)
writer.write_table(table)
writer.close()
I'm not trying to read the data with pandas but I've tried inspecting with pandas but that ends up with pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: error.
I'm loading the parquet files into Snowflake and get back the incorrect timestamp. I've also tried inspecting with parquet-tools but that doesn't seem to work with timestamps.
Does parquet/pyarrow not support large timestamps? How can I store the correct timestamp?
It turns out for me, it was cause I needed to set use_deprecated_int96_timestamps=False on parquet writer
It says by default it's False but I had set the flavor to 'spark' so I think it overrode it.
Thanks for help
Clearly the timestamp '9999-12-31' is being used not as a real timestamp, but as a flag for an invalid value.
If at the end of the pipeline Snowflake is seeing those as '1816-03-29 05:56:08.066', then you could just keep them as that - or re-cast them to whatever value you want them to have in Snowflake. At least it's consistent.
But if you insist that you want Python to handle the 9999 cases correctly, look at this question that solves it with use_deprecated_int96_timestamps=True:
handling large timestamps when converting from pyarrow.Table to pandas
More of a theoretical question as to the best way to set something up.
I have quite a large dataframe in pandas (roughly 330 columns) and I am hoping to transfer it into a table in SQL Server.
My current process has been to export the dataframe as a .csv and then use the Import Flat File function to first of all create the table, and then in future I have a direct connection setup in Python to interact. For smaller dataframes this has worked fine as it has been easier to change data column types and eventually get it to work.
When doing it on the larger dataframes my problem is that I am frequently getting the following message:
TITLE: Microsoft SQL Server Management Studio
Error inserting data into table. (Microsoft.SqlServer.Import.Wizard)
The given value of type String from the data source cannot be converted to type nvarchar of the specified target column. (System.Data)
String or binary data would be truncated. (System.Data)
It doesn't give me a specific column as to what is causing the problem so is there any way to more efficiently get this data in as opposed to going through each column manually?
Any help would be appreciated! Thanks
As per your query, this is in fact an issue when you are trying to write a string value into a column, the size limit is exceeded. Either you may increase the column size limit or try truncating before inserting.
Let's say column A in df is of type varchar(500), Try the following before insertion :-
df.A=df.A.apply(lambda x: str(x)[:500])
Below is the sqlalchemy alternative for the insertion.
connect_str="mssql+pyodbc://<username>:<password>#<dsnname>"
To create a connection -
engine = create_engine(connect_str)
Create the table -
from sqlalchemy import Table, MetaData, Column, Integer
m = MetaData()
t = Table('example', m,
Column('column_1', Integer),
Column('column_2', Integer)),
...)
m.create_all(engine)
Once created, do the following :-
df.to_sql('example', if_exists='append')
I'm trying to append scrape data while the scrape is running.
Some of the columns contain an array of multiple strings which are to be saved and postprocessed into dummies after the scrape.
e.g tags= array(['tag1','tag2'])
However writing to and reading from the database doesn't work for the arrays.
Ive tried different storage methods, csv, pickling, HDF all of these dont work for different reasons.
(mainly problems appending to a central database and storing lists like strings).
I also tried different database formats (mysql and postgres) , i tried using dtype ARRAY, however that requires a fixed length (known beforehand) array..
From what i gather, i can go the JSON route or the pickle route.
I chose the pickle route since i dont need the db to do anything with the contents of the array.
from sqlalchemy.types import PickleType
df=pd.DataFrame([],columns=['Name','Tags'])
df['Price'] = array(['tag1','tag2'], dtype='<U8')
type_dict = {'Name': String ,'Tags': PickleType}
engine = create_engine('sqlite://', echo=False)
df.to_sql('test', con=engine, if_exists='append', index=False, dtype=type_dict)
df2=pd.read_sql_table('test' ,con =engine)
expected output:
df2['Tags'].values
array(['tag1','tag2'], dtype='<U8')
actual output:
df2['Tags'].iloc[0]
b'\x80\x04\x95\xa4\x00\x00\x00\x00\x00\x00\x00\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02U8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNK K\x04K\x08t\x94b\x89C \xac \x00\x00\xac \x00\x00 \x00\x00\x00-\x00\x00\x00 \x00\x00\x00\xac \x00\x00\xac \x00\x00\xac \x00\x00\x94t\x94b.'
So something has gone wrong during pickling, and I cant figure out what.
edit:
Okay, so np.loads(df2['Tags'].iloc[0]) gives the original array back. Is there a way to pass this to read_sql_table? such that i immediately get the "original" dataframe back?
So the problems occurs during the reading, the arrays are pickled, but they are not automatically read back as pickled data. There is no way to pass dtype to read_sql_table right?
def unpickle_the_pickled(df):
df2=df
for col in df.columns:
if type(df[col].iloc[0])==bytes:
df2[col]=df[col].apply(np.loads)
return df2
finally solved it, so happy!
I am attempting to insert a dataframe into my postgres database using the pscycopg2 module used with sqlalchemy. The process is loading an excel file into a pandas dataframe and then inserting the dataframe into the database via the predefined table schema.
I believe these are the relevant lines of code:
post_meta.reflect(schema="users")
df = pd.read_excel(path)
table = sql.Table(table_name, post_meta, schema="users")
dict_items = df.to_dict(orient='records')
connection.execute(table.insert().values(dict_items))
I'm getting the following error:
<class 'sqlalchemy.exc.ProgrammingError'>, ProgrammingError("(psycopg2.ProgrammingError) can't adapt type 'numpy.int64'",)
All data field types in the dataframe are int64.
I can't seem to find a similar question or information regarding why this error is and what it means.
Any direction would be great.
Thanks
Looks like you're trying to insert numpy integers, and psycopg2 doesn't know how to handle those objects. You need to convert them to normal python integers first. Maybe try calling the int() function on each value... Please provide more context with code if that fails.
I also ran into this error, and then realized that I was trying to insert integer data into a SqlAlchemy Numeric column, which maps to float, not int. Changing the offending DataFrame column to float did the trick for me:
df[col] = df[col].astype(float)
Perhaps you are also trying to insert integer data into a non-integer column?