I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)
I have a df ("data") full of information from IMDB, and I'm trying to replace the string of text for unavailable synopses with a NaN. I wanted to start by simply counting them, using this code:
data[data['synopsis'=='\\nIt looks like we don\'t have a Synopsis for this title yet. Be the first to contribute! Just click the "Edit page" button at the bottom of the page or learn more in the Synopsis submission guide.\\n']].count()
But, I get a key error. I have a hunch it's because of the dtype?
I've tried to convert the synopsis column from object into string, to no avail, using this code:
data['synopsis'] = data['synopsis'].apply(str)
and this code:
pd.Series('synopsis').astype('str')
But when I look at the info, nothing changes. I was able to convert startYear to datetime, though.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27007 entries, 0 to 31893
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 27007 non-null object
1 titleType 27007 non-null object
2 primaryTitle 27007 non-null object
3 originalTitle 27007 non-null object
4 isAdult 27007 non-null int64
5 startYear 27007 non-null datetime64[ns]
6 endYear 27007 non-null object
7 runtimeMinutes 27007 non-null object
8 genres 27007 non-null object
9 storyline 20362 non-null object
10 synopsis 27007 non-null object
11 countries_of_origin 26640 non-null object
12 budget 11295 non-null object
13 opening_weekend 771 non-null object
14 production_company 19478 non-null object
15 rating 13641 non-null float64
16 number_of_votes 13641 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(14)
memory usage: 4.7+ MB
I'm new to all this--what am I doing wrong?
You've got the bracket in the wrong spot in your filter line. You want to close df[df['synopsis']==...].
And I believe that pandas uses object as the dtype for strings, so it is correct for it to say object.
I am learning python by myself and I have the following problem.
First of all the file.csv that I am working on can be accessed here
https://www.dropbox.com/s/1vo5oqrwi0jhcrn/2013_ACCIDENTS_TIPUS_GU_BCN_2013.csv?dl=0
I want to collect my data according to the date and for this I am using "groupby" function. When I run my program in order to get the mean I get "nan". I think that the problem is related to the pd.to_datetime because the new column 'Date' has 0 value. I would appreciate any help you can provide.
Many thanks in advance!
This is the code:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
data = pd.read_csv("Desktop/2013_ACCIDENTS_TIPUS_GU_BCN_2013.csv")
print(data)
data["Date"] = data[u'Dia de mes'].apply(lambda x: str(x)) + '-' + data[u'Mes de any'].apply(lambda x: str(x))
data["Date"] = pd.to_datetime(data["Date"], format = '%d%m', errors = 'coerce')
data.info()
data
accidents = data.groupby(['Date']).size()
print(accidents)
print(accidents.mean())
This is the output:
N�mero d'expedient Codi districte ... Coordenada UTM (Y) Coordenada UTM (X)
0 2013S009145 10 ... 4585368,61 432116,29
1 2013S006244 10 ... 4585265,29 432053,62
2 2013S000511 10 ... 4585305,49 432014,19
3 2013S009354 10 ... 4585434,72 431625,93
4 2013S001212 10 ... 4585250,74 431554,85
... ... ... ... ... ...
10034 2013S008522 9 ... 4588380,51 433851,83
10035 2013S005935 9 ... 4588457,97 433753,46
10036 2013S004640 9 ... 4587839,29 433957,61
10037 2013S003063 9 ... 4588028,25 433629,35
10038 2013S006183 9 ... 4588430,86 433021,42
[10039 rows x 20 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10039 entries, 0 to 10038
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 N�mero d'expedient 10039 non-null object
1 Codi districte 10039 non-null object
2 Nom districte 10039 non-null object
3 Codi barri 10039 non-null object
4 Nom barri 10039 non-null object
5 Codi carrer 10039 non-null object
6 Nom carrer 10039 non-null object
7 Num postal caption 10039 non-null object
8 Descripci� dia setmana 10039 non-null object
9 Dia setmana 10039 non-null object
10 Descripci� tipus dia 10039 non-null object
11 NK Any 10039 non-null int64
12 Mes de any 10039 non-null int64
13 Nom mes 10039 non-null object
14 Dia de mes 10039 non-null int64
15 Hora de dia 10039 non-null int64
16 Descripci� torn 10039 non-null object
17 Descripci� tipus accident 10039 non-null object
18 Coordenada UTM (Y) 10039 non-null object
19 Coordenada UTM (X) 10039 non-null object
20 Date 0 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(16)
memory usage: 1.6+ MB
Series([], dtype: int64)
nan
This should do it. You need to zfill the days and month, and use the dayfirst parameter. I'm assuming NK Any is the year. If it is not you can remove it and change the format to simply %d-%m
pd.to_datetime(df['Dia de mes'].astype(str).str.zfill(2) + '-' + df['Mes de any'].astype(str).str.zfill(2) + '-' + df['NK Any'].astype(str), dayfirst=True, format='%d-%m-%Y')
Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?
I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",