I'm having an issue converting time. Column[0] is a timestamp, I want to insert a new column at[1] for now its called timestamp2. I'm trying to then use the for statement to convert column[0] to a readable time and add it to column[1]. Currently I get the new column inserted but I get this error:
raise TypeError(f"cannot convert the series to {converter}")
TypeError: cannot convert the series to <class 'int'>
I added .astype(int) to the timestamp variable but that didn't help.
Code:
import requests
import json
import pandas as pd
from datetime import datetime
url = 'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/cbse/spot/btc-usd/aggregations/count_ohlcv_vwap?interval=1h&page_size=1000'
KEY = 'xxx'
headers = {
"X-Api-Key": KEY,
"Accept": "application/json",
"Accept-Encoding": "gzip"
}
res = requests.get(url, headers=headers)
j_data = res.json()
parse_data = j_data['data']
# create dataframe
df = pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns')
df.insert(1, 'timestamp2', ' ')
for index, row in df.iterrows():
timestamp = df['timestamp'].astype(int)
dt = datetime.fromtimestamp(timestamp)
df.at[index, "timestamp2"] = dt
print(df)
df.to_csv('test.csv', index=False, encoding='utf-8')
Parsed data:
timestamp,timestamp2,open,high,low,close,volume,price,count
1611169200000,5,35260,35260.6,35202.43,35237.93,7.1160681299999995,35231.58133242965,132
1611165600000,5,34861.78,35260,34780.26,35260,1011.0965832999998,34968.5318431902,11313
1611162000000,5,34730.11,35039.98,34544.33,34855.43,1091.5246025199979,34794.45207484006,12877
In this example I set 'df.at[index, "timestamp2"] = dt' to 5 just to make sure it inserted in each row, it does so I just need to convert column[0] to a readable time for column[1].
If you convert the timestamp to integer, it seems to be milliseconds since the epoc based on the magnitudes of the values.
Here is some more details on unix-time if you are interested. https://en.wikipedia.org/wiki/Unix_time
You can convert this to datetime using pd.to_datetime.
It is a vectorised operation so you don't need to use the loop through the dataframe. Both pd.to_numeric and pd.to_datetime can be applied to an entire series.
It's hard to debug without all your data but the below should work. .astype(int) is an alternative to pd.to_numeric, the only difference is pd.to_numeric gives you more flexibility in the treatment of errors, allowing you to coerce to nan (not sure if this is wanted or not).
import pandas as pd
df = pd.DataFrame({'timestamp':['1611169200000']})
# convert to integer. If there are invalid entries this will set to nan. Depends on your case how you want to treat these.
timestamp_num = pd.to_numeric(df['timestamp'],errors='ignore')
df['timestamp2'] pd.to_datetime(timestamp_num,unit='ms')
print(df.to_dict())
#{'timestamp': {0: '1611169200000'}, 'timestamp2': {0: Timestamp('2021-01-20 19:00:00')}}
Related
I am trying to read a csv compressed file from S3 using pandas.
The dataset has 2 date columns, or at least they should be parsed as dates, I have read pandas docs and using parse_dates=[col1, col2] should work fine. Indeed it parsed one column as date but not the second one, which is something weird because they have the same formatting (YYYYmmdd.0), and both have Nan values as shown below
I read the file as follow :
date_columns = ['PRESENCE_UO_DT_FIN_PREVUE', 'PERSONNE_DT_MODIF']
df = s3manager_in.read_csv(object_key='Folder1/file1', sep=';', encoding = 'ISO-8859-1', compression = 'gzip', parse_dates=date_columns, engine='python')
Is there any explanation why 1 column get parsed as date and the second one is not ?
Thanks
The column 'PRESENCE_UO_DT_FIN_PREVUE' seems to carry some "bad" values (that are not formatted as 'YYYYmmdd.0'). That's probably the reason why pandas.read_csv can't parse this column as a date even with passing it as an argument of the parameter parse_dates.
Try this :
df = s3manager_in.read_csv(object_key='Folder1/file1', sep=';', encoding = 'ISO-8859-1', compression = 'gzip', engine='python')
date_columns = ['PRESENCE_UO_DT_FIN_PREVUE', 'PERSONNE_DT_MODIF']
df[date_columns ] = df[date_columns].apply(pd.to_datetime, errors='coerce')
Note that the 'coerce' in pandas.to_datetime will put NaN instead of every bad value in the column 'PRESENCE_UO_DT_FIN_PREVUE'.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.
Found out what was wrong with the column, in fact -as my collegue pointed out- the column has some values that exceed Python maximum value to be parsed as date in Python.
Pandas maximum datetime is Timestamp.max=Timestramp('2262-04-11 23:47:16.854775807')
and it happened that values in that column can be : df.loc['PRESENCE_UO_DT_FIN_PREVUE'].idxmax()]['RESENCE_UO_DT_FIN_PREVUE'] = 99991231.0
The problem is read_csv with parse_dates is not generating any errors or warning, so it is difficult to find out what's is wrong.
So to encounter this problem, I manually convert the column:
def date_time_processing(var):
#if var == 99991231.0 or var == 99991230.0 or var == 29991231.0 or var == 99991212.0 or var == 29220331.0 or var == 30000131.0 or var == 30001231.0:
if var > 21000000.0 :
return (pd.Timestamp.max).strftime('%Y%m%d')
elif var is np.nan :
return pd.NaT
else :
return (pd.to_datetime(var, format='%Y%m%d'))
and then give it a lambda function :
df['PRESENCE_UO_DT_FIN_PREVUE'] = df['PRESENCE_UO_DT_FIN_PREVUE'].apply(lambda x:date_time_processing(x)
I can´t see the result...
My result is 0 and it should be 824
import pandas as pd
apple = r'C:\Users\User\Downloads\AAPL.xlsx'
data = pd.read_excel(apple)
dateindextime = data.set_index("timestamp")
rango = dateindextime.loc["2011-08-20":"2008-05-15"]
print(len(rango))
If I do
print(rango)
output:
Empty DataFrame Columns: [open, high, low, close, adjusted_close, volume] Index: []
Kinda hard to tell without the AAPL.xlsx dataset, but I'm guessing you will need to convert the "timestamp" column to a datetime object first using pd.to_datetime. From there you would slice on the datetime object vs slicing on a string, which is what you were doing below. If you posted the AAPL.xlsx dataset, I could dig deeper.
import pandas as pd
import datetime
apple = r'C:\Users\User\Downloads\AAPL.xlsx'
data = pd.read_excel(apple)
data["datetime_timestamp"] = pd.to_datetime(data["timestamp"], infer_datetime_format=True)
dateindextime = data.set_index("datetime_timestamp")
ti = datetime.date(2008,5,15)
tf = datetime.date(2011,8,20)
rango = dateindextime.loc[ti:tf]
print(len(rango))
I have a dataframe:
id timestamp
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:09:59"
I need to turn timestamp into integer number to iterate over conditions. So it look like this:
id timestamp
1 20250802190859
1 20250802190859
1 20250802190959
you can convert string using string of pandas :
df = pd.DataFrame({'id':[1,1,1],'timestamp':["2025-08-02 19:08:59",
"2025-08-02 19:08:59",
"2025-08-02 19:09:59"]})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df['timestamp'] = df['timestamp'].str.replace(r'[-\s:]', '').astype('float64')
>>> df
id timestamp
0 1 20250802190859.000
1 1 20250802190859.000
2 1 20250802190959.000
Have you tried opening the file, skipping the first line (or better: validating that it contains the header fields as expected) and for each line, splitting it at the first space/tab/whitespace. The second part, e.g. "2025-08-02 19:08:59", can be parsed using datetime.fromisoformat(). You can then turn the datetime object back to a string using datetime.strftime(format) with e.g. format = '%Y%m%d%H%M%S'. Note that there is no "milliseconds" format in strftime though. You could use %f for microseconds.
Note: if datetime.fromisoformat() fails to parse the dates, try datetime.strptime(date_string, format) with a different format, e.g. format = '%Y-%m-%d %H:%M:%S'.
You can use the solutions provided in this post: How to turn timestamp into float number? and loop through the dataframe.
Let's say you have already imported pandas and have a dataframe df, see the additional code below:
import re
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])
This way you will not modify the original dataframe df and will get desired output in a new dataframe df1.
Full code that I tried (including creatiion of first dataframe), this might help in removing any confusions:
import pandas as pd
import re
l = ["2025-08-02 19:08:59", "2025-08-02 19:08:59", "2025-08-02 19:09:59"]
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])
I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
Given a list of values or strings, how can I detect whether these are either dates, date and times, or neither?
I have used the pandas api to infer data types but it doesn't work well with dates. See example:
import pandas as pd
def get_redshift_dtype(values):
dtype = pd.api.types.infer_dtype(values)
return dtype
This is the result that I'm looking for. Any suggestions on better methods?
# Should return "date"
values_1 = ['2018-10-01', '2018-02-14', '2017-08-01']
# Should return "date"
values_2 = ['2018-10-01 00:00:00', '2018-02-14 00:00:00', '2017-08-01 00:00:00']
# Should return "datetime"
values_3 = ['2018-10-01 02:13:00', '2018-02-14 11:45:00', '2017-08-01 00:00:00']
# Should return "None"
values_4 = ['123098', '213408', '801231']
You can write a function to return values dependent on conditions you specify:
def return_date_type(s):
s_dt = pd.to_datetime(s, errors='coerce')
if s_dt.isnull().any():
return 'None'
elif s_dt.normalize().equals(s_dt):
return 'date'
return 'datetime'
return_date_type(values_1) # 'date'
return_date_type(values_2) # 'date'
return_date_type(values_3) # 'datetime'
return_date_type(values_4) # 'None'
You should be aware that Pandas datetime series always include time. Internally, they are stored as integers, and if a time is not specified it will be set to 00:00:00.
Here's something that'll give you exactly what you asked for using re
import re
classify_dict = {
'date': '^\d{4}(-\d{2}){2}$',
'date_again': '^\d{4}(-\d{2}){2} 00:00:00$',
'datetime': '^\d{4}(-\d{2}){2} \d{2}(:\d{2}){2}$',
}
def classify(mylist):
key = 'None'
for k, v in classify_dict.items():
if all([bool(re.match(v, e)) for e in mylist]):
key = k
break
if key == 'date_again':
key = 'date'
return key
classify(values_2)
>>> 'date'
The checking is done iteratively using regex and it tries to match all items of a list. Only if all items are matched will the key be returned. This works for all of your example lists you've given.
For now, the regex string does not check for numbers outside certain range, e.g (25:00:00) but that would be relatively straightforward to implement.