I am facing an issue while converting one of my datetime columns in pandas dataframe to int. My code is:
df['datetime_column'].astype(np.int64)
The error which I am getting is:
invalid literal for int() with base 10: '2018-02-25 09:31:15'
I am quite clueless about what is happening as the conversion for some of my other datetime columns are working fine. Is there some issue with the range of the date which can be converted to int?
You would use
df['datetime_colum'].apply(lambda x:x.toordinal())
If it fails, the cause could be that your column is an object and not datetime. So you need:
df['datetime_colum'] = pd.to_datetime(df['datetime_colum'])
before sending it to ordinal.
If you are working on features engineering, you can try creating days between date1 and date2, get boolean for if it is winter, summer, autumn or spring by looking at months, and if you have time, boolean of if it is morning, noontime, or night, but all depending on your machines learning problem.
it seems you solved the problem yourself judging from your comment. My guess is that you created the data frame without specifying that the column should be read as anything other than a string, so it's a string. If I'm right, and you check the column type, it should show as object. If you check an individual entry in the column, it should show as a string.
If the issue is something else, please follow up.
Related
So im using python and postgresql and I select two things from columns that both contain dates that are formatted yyyy-mm-dd and have that result stored in the array results. However, one of the columns is varchar, and the other one is date. so results would look like [2022-01-10, 2022-01-10]. The type of idx0 is a string, and the type of idx1 is a date object? I have a method that allows me to add any number of days to a certain date, and it works for the varchar one but it doesn't work for the one that is a date object instead of a string. How can I get around this? I'm trying to do if type(results[1]) == date object, or if results[1] isInstace(), but having issues comparing if the type is a certain type pretty much. Can someone show me how to do this? Apologies if this is unclear, i'm happy to explain further.
Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.
I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.
I stumbled upon a "problem" while working with my data some time ago, when I started messing with pandas. That problem is that, when you compare np.datetime64 objects with strings, numpy will fill out the rest of the information to fit datetime with the lowest value possible (01 for months, 01 for days and so on).
The same happens if you call an np.datetime64 object and specify only up to the month, the rest of the information will still be filled with the lowest possible value:
np.datetime('2019-07','M')
>>numpy.datetime64('2019-08')
The problem for me is that, many times, my only concern is with what happens between time periods, like months.
For exemple, if I want to filter every row where payments were made within last month, it would be ideal to use:
month = '2019-07'
df[df['pay_day']==month]
But when doing something like that, it will compare up to the day and fail for every date that isn't the first day of the month. I have tried transforming datetime to str, slicing and putting it back together, but for filtering purposes it gets messy. Another thing I have tried is:
df['pay_day'].days=1
The idea was to bring all days to 01, so there would be no problem when comparing and filtering, but it just fills the whole column with int64 1's.
Any ideas on how to do that?
You can use pandas datetime accessor object .dtand get corresponding property (month here) for comparison.
df[df['pay_day'].dt.month == month]
I have found a way that works for this problem in specific: if we set all days to 01, there should be no problem, but it is hard do manipulate np.datetime64. There is a way, though:
df['pay_day'] = df['pay_day'].astype('datetime64[M]')
So that all days will be set to 01, and comparison based on month becomes easy. But if there is a need of editing the days to any other value, I guess it's harder, but this works.
I got the idea from: https://stackoverflow.com/a/52810147/8424939
Simplified version of the code:
df.loc[df.loc['A'].notnull(), 'B'] = df.loc[df.loc['A'].notnull(), 'A'].map(
lambda x: dt.datetime.strptime(x, '%m/%d/%Y'))
problem:
This code populated column 'B' with 19 digit integers (I assume nanosecond Unix time) instead of the expected datetime or pandas Timestamp.
The data frame was read in from a csv created by a MSSQL Server export. The date format in the file did not have padded zeros. That shouldn't be a problem, but when the file was reformatted the column then populated with timestamps as expected.
We found a pretty easy work around, but I can't find any documentation that would explain why strptime would return an integer and I can't reproduce this behavior when I run it entering values in a console (or Jupyter).
Q1: What's going on here?
Q2: Could string formatting from MSSQL somehow produce this behavior?
Q3: Should we just count our blessings and stop asking why?
I will just note that this was a weird circumstance and a solution was found by adjusting other parts of the code, I would just like to understand what was happening.
(python 2.7, pandas 0.18)