read_json not showing column as datetime64[ns] - python

so I'm having a lot of trouble getting a column of a pandas dataframe to read as datetime64[ns] dtype after having been saved in json format. I've tried pretty much everything I've seen online, pd.datetime(coerce, format), astype(datetime64[ns]), dateformat = 'iso', etc.
This is strange and very frustrating as all my other dataframes with date columns and saved as json files are being read correctly with the dtype as datetime64[ns].
I would really appreciate some help
Here are the last few lines of my code where I create the data frame and what it returns:
player = pd.DataFrame(full, index = list(range(len(full))), columns = ['Name', 'Handedness', 'Height', 'Bday'])
player.Height = player.Height.str[:-2]
player.Height = pd.to_numeric(player.Height)
player.Bday = pd.to_datetime(player.Bday, format = '%d/%m/%Y')
player = player.reset_index(drop = True)
player.to_json(f'../../Datasets/Singles_players/Player_Traits/{Event}_players.json', date_format = 'iso')
print(player.info())
print(player.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 25 non-null object
1 Handedness 25 non-null object
2 Height 25 non-null float64
3 Bday 25 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 928.0+ bytes
None
Name Handedness Height Bday
0 KENTO MOMOTA Left 175.0 1994-09-01
1 VIKTOR AXELSEN Right 194.0 1994-01-04
2 ANDERS ANTONSEN Right 183.0 1997-04-27
3 CHOU TIEN CHEN Right 180.0 1990-01-08
4 ANTHONY SINISUKA GINTING Right 171.0 1996-05-11
All good BUT here is what happens when I read the file:
player = pd.read_json('../Datasets/Singles_Players/Player_Traits/MS_players.json')
print(player.info())
print(player.head())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 24
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 25 non-null object
1 Handedness 25 non-null object
2 Height 25 non-null int64
3 Bday 25 non-null object
dtypes: int64(1), object(3)
memory usage: 1000.0+ bytes
None
Name Handedness Height Bday
0 KENTO MOMOTA Left 175 1994-09-01T00:00:00.000Z
1 VIKTOR AXELSEN Right 194 1994-01-04T00:00:00.000Z
2 ANDERS ANTONSEN Right 183 1997-04-27T00:00:00.000Z
3 CHOU TIEN CHEN Right 180 1990-01-08T00:00:00.000Z
4 ANTHONY SINISUKA GINTING Right 171 1996-05-11T00:00:00.000Z

Related

How to parse a date column as datetimes, not objects in Pandas?

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

How to replace a LONG string of text in a column that is dtype object in pandas?

I have a df ("data") full of information from IMDB, and I'm trying to replace the string of text for unavailable synopses with a NaN. I wanted to start by simply counting them, using this code:
data[data['synopsis'=='\\nIt looks like we don\'t have a Synopsis for this title yet. Be the first to contribute! Just click the "Edit page" button at the bottom of the page or learn more in the Synopsis submission guide.\\n']].count()
But, I get a key error. I have a hunch it's because of the dtype?
I've tried to convert the synopsis column from object into string, to no avail, using this code:
data['synopsis'] = data['synopsis'].apply(str)
and this code:
pd.Series('synopsis').astype('str')
But when I look at the info, nothing changes. I was able to convert startYear to datetime, though.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27007 entries, 0 to 31893
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 27007 non-null object
1 titleType 27007 non-null object
2 primaryTitle 27007 non-null object
3 originalTitle 27007 non-null object
4 isAdult 27007 non-null int64
5 startYear 27007 non-null datetime64[ns]
6 endYear 27007 non-null object
7 runtimeMinutes 27007 non-null object
8 genres 27007 non-null object
9 storyline 20362 non-null object
10 synopsis 27007 non-null object
11 countries_of_origin 26640 non-null object
12 budget 11295 non-null object
13 opening_weekend 771 non-null object
14 production_company 19478 non-null object
15 rating 13641 non-null float64
16 number_of_votes 13641 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(14)
memory usage: 4.7+ MB
I'm new to all this--what am I doing wrong?
You've got the bracket in the wrong spot in your filter line. You want to close df[df['synopsis']==...].
And I believe that pandas uses object as the dtype for strings, so it is correct for it to say object.

Python to_datetime doesn't show the date in day/months(decimal) format

I am learning python by myself and I have the following problem.
First of all the file.csv that I am working on can be accessed here
https://www.dropbox.com/s/1vo5oqrwi0jhcrn/2013_ACCIDENTS_TIPUS_GU_BCN_2013.csv?dl=0
I want to collect my data according to the date and for this I am using "groupby" function. When I run my program in order to get the mean I get "nan". I think that the problem is related to the pd.to_datetime because the new column 'Date' has 0 value. I would appreciate any help you can provide.
Many thanks in advance!
This is the code:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
data = pd.read_csv("Desktop/2013_ACCIDENTS_TIPUS_GU_BCN_2013.csv")
print(data)
data["Date"] = data[u'Dia de mes'].apply(lambda x: str(x)) + '-' + data[u'Mes de any'].apply(lambda x: str(x))
data["Date"] = pd.to_datetime(data["Date"], format = '%d%m', errors = 'coerce')
data.info()
data
accidents = data.groupby(['Date']).size()
print(accidents)
print(accidents.mean())
This is the output:
N�mero d'expedient Codi districte ... Coordenada UTM (Y) Coordenada UTM (X)
0 2013S009145 10 ... 4585368,61 432116,29
1 2013S006244 10 ... 4585265,29 432053,62
2 2013S000511 10 ... 4585305,49 432014,19
3 2013S009354 10 ... 4585434,72 431625,93
4 2013S001212 10 ... 4585250,74 431554,85
... ... ... ... ... ...
10034 2013S008522 9 ... 4588380,51 433851,83
10035 2013S005935 9 ... 4588457,97 433753,46
10036 2013S004640 9 ... 4587839,29 433957,61
10037 2013S003063 9 ... 4588028,25 433629,35
10038 2013S006183 9 ... 4588430,86 433021,42
[10039 rows x 20 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10039 entries, 0 to 10038
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 N�mero d'expedient 10039 non-null object
1 Codi districte 10039 non-null object
2 Nom districte 10039 non-null object
3 Codi barri 10039 non-null object
4 Nom barri 10039 non-null object
5 Codi carrer 10039 non-null object
6 Nom carrer 10039 non-null object
7 Num postal caption 10039 non-null object
8 Descripci� dia setmana 10039 non-null object
9 Dia setmana 10039 non-null object
10 Descripci� tipus dia 10039 non-null object
11 NK Any 10039 non-null int64
12 Mes de any 10039 non-null int64
13 Nom mes 10039 non-null object
14 Dia de mes 10039 non-null int64
15 Hora de dia 10039 non-null int64
16 Descripci� torn 10039 non-null object
17 Descripci� tipus accident 10039 non-null object
18 Coordenada UTM (Y) 10039 non-null object
19 Coordenada UTM (X) 10039 non-null object
20 Date 0 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(16)
memory usage: 1.6+ MB
Series([], dtype: int64)
nan
This should do it. You need to zfill the days and month, and use the dayfirst parameter. I'm assuming NK Any is the year. If it is not you can remove it and change the format to simply %d-%m
pd.to_datetime(df['Dia de mes'].astype(str).str.zfill(2) + '-' + df['Mes de any'].astype(str).str.zfill(2) + '-' + df['NK Any'].astype(str), dayfirst=True, format='%d-%m-%Y')

Including minutes column in CSV breaks date parsing

Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?
I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Categories