How can I print certain rows from a CSV in pandas - python

My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?

One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3

Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01

Related

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Remove all rows between two timestamps from pandas dataframe

How to delete all rows from dataframe between two timestamps inclusive?
my Dataframe looks like :
b a
0 2016-12-02 22:00:00 19.218519
1 2016-12-02 23:00:00 19.171197
2 2016-12-03 00:00:00 19.257836
3 2016-12-03 01:00:00 19.195610
4 2016-12-03 02:00:00 19.176413
For eg : I want to delete all rows from above dataframe whose timestamp falls is in between : "2016-12-02 22:00:00" to "2016-12-03 00:00:00".
So, the result will contain only rows 3 and 4.
the type of b column is datetime64 and the type of a is float.
Please suggest.
You can filter those out:
from_ts = '2016-12-02 22:00:00'
to_ts = '2016-12-03 00:00:00'
df = df[(df['b'] < from_ts) | (df['b'] > to_ts)]
Convert the column b to datetime and then apply mask
df.b = pd.to_datetime(df.b, format = '%Y-%m-%d %H:%M:%S')
df[(df.b < '2016-12-02 22:00:00') | (df.b > '2016-12-03 00:00:00')]
b a
3 2016-12-03 01:00:00 19.195610
4 2016-12-03 02:00:00 19.176413
index_list= df.b[(df.b >= "2016-12-02 22:00:00") & (df.b <= "2016-12-03 00:00:00")].index.tolist()
df.drop(df.index[index_list] , inplace = True)

Pandas and csv import into dataframe. How to best to combine date anbd date fields into one

I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]

Handling monthly-binned data in pandas

I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories