Question about DateTime in Data Frame vs DateTime in Google Big Query - python

I'm trying to push data from a data frame to Google Big Query.
I set my date field of the data frame as
df['time'] = df['time'].astype('datetime64[ns]')
and I set Google's Big Query date to *DATETIME*. When I do the export from Python to GBQ, I get this error:
InvalidSchema: Please verify that the structure and data types in the
DataFrame match the schema of the destination table.
If I make everything into string format, it works. I don't think you can just set a data frame field to just date, right? Is there a clever way to get this working, or do dates have to be set as strings?
TIA.

I found data loading with date and datetime type column is not working. So I tried with datatype timestamp and then could load the data in bigquery table.
while defining schema for date columns define it as timestamp as below.
bigquery.SchemaField('dateofbirth', 'timestamp')
and convert the dataframe column datatype from object to other datetime format which bigquery can understand.
df.dateofbirth=df.dateofbirth.astype('datetime64')
as of 8-mar-2019 date and datetime column type are not working.

Changing datetime datatype to timestamp in biguery schema will give you a time value added with UTC. This might not be the ideal scenario for may of us. Rather try the below code:
job_config = bigquery.LoadJobConfig(
schema=table_schema, source_format=bigquery.SourceFormat.CSV
)
load_job = bigquery_client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)

Related

Convert datetime.date or string to timestamp in python

I am aware that this question was posted more times before but yet I have few doubts. I have a datetime.date (ex. mydate = date(2014,5,1)) and I converted this as a string, then saved in DB as a column (dtype:object) in a table. Now I wanted to change the storage of dates from text to timestamp in DB. I tried this,
Ex. My table is tab1. I read this as dataframe df in python.
# datetime to timestamp
df['X'] = pd.to_datetime(mydate)
When I check dtype in python editor df.info(), the dtype of X is datetime64[ns] but when I save this to DB in MySQL and read again as dataframe in python, the dtype changes as object. I have datatype as datetime in MySQL but I need this as timestamp datatype in MySQL. Is there any way to do it? Also, I need only date from Timestamp('2014-5-01 00:00:00') and exclude time.
The problem is that when u read the serialized value from MySQL the python MySQL connector does not convert it. you have to convert it to DateTime value after reading data from the cursor by calling your function again on retrieved data:
df['X'] = pd.to_datetime(df['col'])
As suggested, I changed the column type directly by using dtype argument in to_sql() function while inserting into the database. So, now I can have datatypes as TIMESTAMP, DATETIME and also DATE in MySQL.
such works for me:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df['timestamp'] = df['Date'].apply(lambda x: x.timestamp())

Right format of Timestamp for filtering pyspark dataframe for Cassandra

I'm storing the timestamp as YYYY-mm-dd HH:MM:SSZ in Cassandra and I am able to filter the data to get a certain range of time in cql shell, but when I try the same on a pyspark dataframe I don't get any values in the filtered dataframe.
Can anyone help me find the right datetime format in pyspark for this?
Thank you.
This format for timestamps works just fine. I think that you have a problem with Spark SQL types, so you may need to perform explicit cast for timestamp string, so Spark can perform correct comparison.
For example, this Scala code works correctly (you may need to adjust it to Python):
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-07-17 14:41:34.373Z' as timestamp) AND ts <= cast('2019-07-19 19:01:56Z' as timestamp)")

How can I export to sqlite (or another format) and retain the date datatype?

I have a script that loads a CSV into a pandas dataframe, cleanses the resulting table (eg removes invalid values, formats dates as dates, etc) and saves the output to a local sqlite .db file.
I then have other scripts that open that database file and perform other operations on it.
My problem is that Sqlite3 doesn't have an explicit date format: https://www.sqlite.org/datatype3.html
This means that operations on dates fail, e.g.:
df_read['Months since mydate 2'] = ( pd.to_datetime('15-03-2019') - df_read['mydate'] )
returns
TypeError: unsupported operand type(s) for -: 'Timestamp' and 'str'
How can I export my dataframe in a way which keeps track of all the data types, including dates?
I have thought of the following:
Export to another format, but what format? A proper SQL Server would be great, but I don't have access to any in this case. I'd need a format which EXPLICITLY declares the data type of each column, so CSV is not an option.
Having a small function which reconverts the columns to dates, after reading them from SQL lite. But this would mean I'd have to manually keep track of what the column dates are - it would be cumbersome and slow on large datasets.
Having another table in the SQL lite database which keeps track of which columns are dates, and what format they are in (e.g. %Y-%m-%d); this can help with the reconversion into dates, but it still feels very cumbersome, clunky and very un-pythonic.
Here is a quick example of what I mean:
import numpy as np
import pandas as pd
import sqlite3
num=int(10e3)
df=pd.DataFrame()
df['month'] = np.random.randint(1,13,num)
df['year'] = np.random.randint(2000,2005,num)
df['mydate'] = pd.to_datetime(df['year'] * 10000 + df['month']* 100 + df['month'], format ='%Y%m%d' )
df.iloc[20:30,2]=np.nan
#this works
df['Months since mydate'] = ( pd.to_datetime('15-03-2019') - df['mydate'] )
conn=sqlite3.connect("test_sqllite_dates.db")
df.to_sql('mydates',conn, if_exists='replace')
conn.close()
conn2=sqlite3.connect("test_sqllite_dates.db")
df_read=pd.read_sql('select * from mydates',conn2 )
# this doesn't work
df_read['Months since mydate 2'] = ( pd.to_datetime('15-03-2019') - df_read['mydate'] )
conn2.close()
print(df.dtypes)
print(df_read.dtypes)
As shown here (w/ sqlite writing), here (reading back from sqlite), the solution could be by creating the column type in sqlite as a datetime, so that when reading back, python will convert automatically to the datetime type.
Mind that, when you are connecting to the database, you need to give the parameter detect_types=sqlite3.PARSE_DECLTYPES

Apache Spark Query only on YEAR from "dd/mm/yyyy" format

I have more than 1 Million records in excel file. I want to query on the Table using python, but date format is dd/mm/yyyy. I know that in MySQL the supported format is yyyy-mm-dd. I am restricted towards changing the format of date. Is there any possibility that I could do it on run-time. Just query on yyyy from dd/mm/yyyy and fetch the record.
How Do I query on such format only on Year and not on Month or Date to get data ?
Assuming the "date" is being received as a string, then RIGHT(date, 4) will give you just the year.
(I see no need to reformat the string if you only need the data. Otherwise see STR_TO_DATE()

Python Pandas: Overwriting an Index with a list of datetime objects

I have an input CSV with timestamps in the header like this (the number of timestamps forming columns is several thousand):
header1;header2;header3;header4;header5;2013-12-30CET00:00:00;2013-12-30CET00:01:00;...;2014-00-01CET00:00:00
In Pandas 0.12 I was able to do this, to convert string timestamps into datetime objects. The following code strips out the 'CEST' in the timestamp string (translate()), reads it in as a datetime (strptime()) and then localizes it to the correct timezone (localize()) [The reason for this approach was because, with the versions I had at least, CEST wasn't being recognised as a timezone].
DF = pd.read_csv('some_csv.csv',sep=';')
transtable = string.maketrans(string.uppercase,' '*len(string.uppercase))
tz = pytz.country_timezones('nl')[0]
timestamps = DF.columns[5:]
timestamps = map(lambda x:x.translate(transtable), timestamps)
timestamps = map(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'), timestamps)
timestamps = map(lambda x: pytz.timezone(tz).localize(x), timestamps)
DF.columns[5:] = timestamps
However, my downstream code required that I run off of pandas 0.16
While running on 0.16, I get this error with the above code at the last line of the above snippet:
*** TypeError: Indexes does not support mutable operations
I'm looking for a way to overwrite my index with the datetime object. Using the method to_datetime() doesn't work for me, returning:
*** ValueError: Unknown string format
I have some subsequent code that copies, then drops, the first few columns of data in this dataframe (all the 'header1; header2, header3'leaving just the timestamps. The purpose being to then transpose, and index by the timestamp.
So, my question:
Either:
how can I overwrite a series of column names with a datetime, such that I can pass in a pre-arranged set of timestamps that pandas will be able to recognise as a timestamp in subsequent code (in pandas v0.16)
Or:
Any other suggestions that achieve the same effect.
I've explored set_index(), replace(), to_datetime() and reindex() and possibly some others but non seem to be able to achieve this overwrite. Hopefully this is simple to do, and I'm just missing something.
TIA
I ended up solving this by the following:
The issue was that I had several thousand column headers with timestamps, that I couldn't directly parse into datetime objects.
So, in order to get these timestamp objects incorporated I added a new column called 'Time', and then included the datetime objects in there, then setting the index to the new column (I'm omitting code where I purged the rows of other header data, through drop() methods:
DF = DF.transpose()
DF['Time'] = timestamps
DF = DF.set_index('Time')
Summary: If you have a CSV with a set of timestamps in your headers that you cannot parse; a way around this is to parse them separately, include in a new column of Time with the correct datetime objects, then set_index() based on the new column.

Categories