Count the number of observations that occur per day - python

I have a pandas dataframe indexed by time. I want to know the total number of observations (i.e. dataframe rows) that happen each day.
Here is my dataframe:
import pandas as pd
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data, columns = ['date', 'value'])
print(df)
What I want is a dataframe (or series) that looks like this:
date value
0 2014-05-01 2
1 2014-05-02 3
2 2014-05-03 2
3 2014-05-04 2
After reaching a bunch of StackOverflow questions, the closest I can get is:
df['date'].groupby(df.index.map(lambda t: t.day))
But that doesn't produce anything of use.

Use resampling. You'll need the date columns to be datetime data type (as is, they are strings) and you'll need to set it as the index to use resampling.
In [13]: df['date'] = pd.to_datetime(df['date'])
In [14]: df.set_index('date').resample('D', 'count')
Out[14]:
value
date
2014-05-01 2
2014-05-02 4
2014-05-03 2
2014-05-04 2
You can use any arbitrary function or built-in convenience functions given as strings, included 'count' and 'sum' etc.

Wow, #Jeff wins:
df.resample('D',how='count')
My worse answer:
The first problem is that your date column is strings, not datetimes. Using code from this thread:
import dateutil
df['date'] = df['date'].apply(dateutil.parser.parse)
Then it's trivial, and you had the right idea:
grouped = df.groupby(df['date'].apply(lambda x: x.date()))
grouped['value'].count()

I know nothing about pandas, but in Python you could do something like:
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
import datetime
dates = [datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d') for ts in data['date']]
cnt = {}
for d in dates: cnt[d] = (cnt.get(d) or 0) + 1
for i, k in enumerate(sorted(cnt)):
print("%d %s %d" % (i,k,cnt[k]))
Which would output:
0 2014-05-01 2
1 2014-05-02 4
2 2014-05-03 2
3 2014-05-04 2
If you didn't care about parsing and reformatting your datetime strings, I suppose something like
dates = [d[0:10] for d in data['date']]
could replace the longer dates=... line, but it seems less robust.

As exp1orer mentions, you'll need to convert string date to date format. Or if you simply want to count obs but don't care date format, you can take the first 10 chars of date column. Then use the value_counts() method (Personally, I prefer this to groupby + sum for this simple obs counts.
You can achive what you need by one liner:
In [93]: df.date.str[:10].value_counts()
Out[93]:
2014-05-02 4
2014-05-04 2
2014-05-01 2
2014-05-03 2
dtype: int64

Related

Pivoting 3 Dates Per Personal-ID to Columns

I have a dataframe (DF1) as such - each Personal-ID will have 3 dates associated w/that ID:
I have created a dataframe (DF_ID) w/1 row for each Personal-ID & Column for Each Respective Date (which is currently blank) and would like load/loop the 3 dates/Personal-ID (DF1) into the respective date columns the final dataframe to look as such:
I am trying to learn python and have tried a number of codinging script to accomplish such as:
{for index, row in df_bnp_5.iterrows():
df_id['Date-1'] = (row.loc[0,'hv_lab_test_dt'])
df_id['Date-2'] = (row.loc[1,'hv_lab_test_dt'])
df_id['Date-3'] = (row.loc[2,'hv_lab_test_dt'])
for i in range(len(df_bnp_5)) :
df_id['Date-1'] = df1.iloc[i, 0], df_id['Date-2'] = df1.iloc[i, 2])}
Any assistance would be appreciated.
Thank You!
Here is one way. I created a 'helper' column to arrange the dates for each Personal-ID.
import pandas as pd
# create data frame
df = pd.DataFrame({'Personal-ID': [1, 1, 1, 5, 5, 5],
'Date': ['10/01/2019', '12/28/2019', '05/08/2020',
'01/19/2020', '06/05/2020', '07/19/2020']})
# change data type
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# create grouping key
df['x'] = df.groupby('Personal-ID')['Date'].rank().astype(int)
# convert to wide table
df = df.pivot(index='Personal-ID', columns='x', values='Date')
# change column names
df = df.rename(columns={1: 'Date-1', 2: 'Date-2', 3: 'Date-3'})
print(df)
x Date-1 Date-2 Date-3
Personal-ID
1 2019-10-01 2019-12-28 2020-05-08
5 2020-01-19 2020-06-05 2020-07-19

Change date format of pandas column (month-day-year to day-month-year)

Got the following issue.
I have an column in my pandas with some dates and some empty values.
Example:
1 - 3-20-2019
2 -
3 - 2-25-2019
etc
I want to convert the format from month-day-year to day-month-year, and when its empty, i just want to keep it empty.
What is the fastest approach?
Thanks!
One can initialize the data for the days using strings, then convert the strings to datetimes. A print can then deliver the objects in the needed format.
I will use an other format (with dots as separators), so that the conversion is clear between the steps.
Sample code first:
import pandas as pd
data = {'day': ['3-20-2019', None, '2-25-2019'] }
df = pd.DataFrame( data )
df['day'] = pd.to_datetime(df['day'])
df['day'] = df['day'].dt.strftime('%d.%m.%Y')
df[ df == 'NaT' ] = ''
Comments on the above.
The first instance of df is in the ipython interpreter:
In [56]: df['day']
Out[56]:
0 3-20-2019
1 None
2 2-25-2019
Name: day, dtype: object
After the conversion to datetime:
In [58]: df['day']
Out[58]:
0 2019-03-20
1 NaT
2 2019-02-25
Name: day, dtype: datetime64[ns]
so that we have
In [59]: df['day'].dt.strftime('%d.%m.%Y')
Out[59]:
0 20.03.2019
1 NaT
2 25.02.2019
Name: day, dtype: object
That NaT makes problems. So we replace all its occurrences with the empty string.
In [73]: df[ df=='NaT' ] = ''
In [74]: df
Out[74]:
day
0 20.03.2019
1
2 25.02.2019
Not sure if this is the fastest way to get it done. Anyway,
df = pd.DataFrame({'Date': {0: '3-20-2019', 1:"", 2:"2-25-2019"}}) #your dataframe
df['Date'] = pd.to_datetime(df.Date) #convert to datetime format
df['Date'] = [d.strftime('%d-%m-%Y') if not pd.isnull(d) else '' for d in df['Date']]
Output:
Date
0 20-03-2019
1
2 25-02-2019

Convert Time to Integer

I have a column in a dataframe that contains time in the below format.
Dataframe: df
column: time
value: 07:00:00, 13:00:00 or 14:00:00
The column will have only one of these three values in each row. I want to convert these to 0, 1 and 2. Can you help replace the times with these numeric values?
Current:
df['time'] = [07:00:00, 13:00:00, 14:00:00]
Expected:
df['time'] = [0, 1, 2]
Thanks in advance.
You can use map to do this:
import datetime
mapping = {datetime.time(07,00,00):0, datetime.time(13,00,00):1, datetime.time(14,00,00):2}
df['time']=df['time'].map(mapping)
One approach is to use map
Ex:
val = {"07:00:00":0, "13:00:00":1, "14:00:00":2}
df = pd.DataFrame({'time':["07:00:00", "13:00:00", "14:00:00"] })
df["time"] = df["time"].map(val)
print(df)
Output:
time
0 0
1 1
2 2

Python pandas dataframes - transforming 2 columns with date ranges - into monthly columns for each month

I am new to Python, starting to use Pandas to replace some processes done in MS Excel.
Below is my problem description
Initial dataframe:
Contract Id, Start date, End date
12378, '01-01-2018', '15-05-2018'
45679, '10-03-2018', '31-07-2018'
567982, '01-01-2018', '31-12-2020'
Expected output
Contract Id , Start date, End date, Jan-18,Feb-18,Mar-18,Apr-18,May-18...Dec-18
12378, '01-01-2018', '15-05-2018', 1, 1, 1, 1, 1, 0, 0, 0, 0, .....,0
45679, '10-03-2018', '31-07-2018', 0, 0, 1, 1, 1, 1, 1, 0, 0, 0....,0
567982,'01-01-2018', '31-12-2020', 1, 1, 1, 1.........………..., 1, 1, 1
I would like to create a set of new columns with Month Id as column headers and populate them with a flag (1 or 0) if the contract is active during the specified month.
any help will be highly appreciated. Thank you
I also am new to pandas. Maybe there is a better method to do this, but here is what I have:
data['S_month'] = data['S'].apply(lambda x:int(x.split('-')[1]))
data['E_month'] = data['E'].apply(lambda x:int(x.split('-')[1]))
months = []
for s_e in data[['S_month','E_month']].values:
month = np.zeros(12)
month[s_e[0]-1:s_e[1]] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
Or you could just skip the first two lines and do this:
months = []
for s_e in data[['S','E']].values:
month = np.zeros(12)
month[int(s_e[0].split('-')[1])-1:int(s_e[1].split('-')[1])] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
This approach uses the very rich date functionality in pandas, specifically the PeriodIndex
import pandas as pd
import numpy as np
from io import StringIO
# Sample data (simplified)
df1 = pd.read_csv(StringIO("""
'Contract Id','Start date','End date'
12378,'01-02-2018','15-03-2018'
45679,'10-03-2018','31-05-2018'
567982,'01-01-2018','30-06-2018'
"""), quotechar="'", dayfirst=True, parse_dates=[1,2])
# Establish the month dates as a pandas PeriodIndex, using month end
dates = pd.period_range(df1['Start date'].min(), df1['End date'].max(), freq="M")
# create new dataframe with date matches with apply
# Match the start dates to the periods using the Period dates comparisons
# AND the result elementwise using numpy logial _nd
data = df1.apply(lambda r: pd.Series(np.logical_and(r[1] <= dates, r[2] >= dates).astype(int)), axis=1)
# Data frame with named month columns
df2 = pd.DataFrame(data)
df2.columns = dates
# Cooncatenate
result = pd.concat([df1, df2], axis=1)
result
# Contract Id Start date End date 2018-01 2018-02 2018-03 2018-04 2018-05 2018-06
#0 12378 2018-02-01 2018-03-15 0 1 1 0 0 0
#1 45679 2018-03-10 2018-05-31 0 0 1 1 1 0
#2 567982 2018-01-01 2018-06-30 1 1 1 1 1 1
Pandas comes with a lot of built-in date/time handling methods that can be put to good use here.
# SETUP
# -----
import pandas as pd
# Initialize input dataframe
data = [
[12378, '01-01-2018', '15-05-2018'],
[45679, '10-03-2018', '31-07-2018'],
[567982, '01-01-2018', '31-12-2020'],
]
columns = ['Contract Id', 'Start date', 'End date']
df = pd.DataFrame(data, columns=columns)
# SOLUTION
# --------
# Convert strings to datetime objects
df['Start date'] = pd.to_datetime(df['Start date'], format='%d-%m-%Y')
df['End date'] = pd.to_datetime(df['End date'], format='%d-%m-%Y')
# For each month in year 2018 ...
for x in pd.date_range('2018-01', '2018-12', freq='MS'):
# Create a column with contract-active flags
df[x.strftime("%b-%y")] = (df['Start date'].dt.month <= x.month) & (x.month <= df['End date'].dt.month)
# Optional: convert True/False values to 0/1 values
df[x.strftime("%b-%y")] = df[x.strftime("%b-%y")].astype(int)
which gives as result:
In [1]: df
Out[1]:
Contract Id Start date End date Jan-18 Feb-18 Mar-18 Apr-18 May-18 Jun-18 Jul-18 Aug-18 Sep-18 Oct-18 Nov-18 Dec-18
0 12378 2018-01-01 2018-05-15 1 1 1 1 1 0 0 0 0 0 0 0
1 45679 2018-03-10 2018-07-31 0 0 1 1 1 1 1 0 0 0 0 0
2 567982 2018-01-01 2020-12-31 1 1 1 1 1 1 1 1 1 1 1 1

Removing the words DateTimeIndex from a list of dates

I have a multiple list of dates in a pandas dataframe in this format:
col1 col2
1 [DatetimeIndex(['2018-10-01', '2018-10-02',
'2018-10-03', '2018-10-04'],
dtype='datetime64[ns]', freq='D')
I would like to take off the words DatetimeIndex and dtype='datetime64[ns]', freq='D' and turn the list into a set. The format I would be looking for is:
{'2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04}
Pandas is not designed to hold collections within series values, so what you are looking to do is strongly discouraged. A much better idea, especially if you have a consistent number of values in each DatetimeIndex series value, is to join extra columns:
D = pd.DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({'col1': [1], 'col2': [D]})
df = df.join(pd.DataFrame(df.pop('col2').values.tolist()))
print(df)
col1 0 1 2 3
0 1 2018-10-01 2018-10-02 2018-10-03 2018-10-04
If you really want a set as each series value, you can do so via map + set:
df['col2'] = list(map(set, df['col2'].values))
print(df)
col1 col2
0 1 {2018-10-01 00:00:00, 2018-10-02 00:00:00, 201...
Have you tried:
set(index_object.tolist())
I suspect this will return you a set of timestamp objects rather than strings so depends on your use case whether this is something you want
if it's the strings you want you can modify the code as follows:
set(index_object.dt.strftime("%Y-%m-%d").tolist())
For your specific format (which I don't necessarily approve of!) you can try this:
import itertools
string_lists = col2.apply(lambda x: x.dt.strftime("%Y-%m-%d").tolist())
unique_set = set(itertools.chain.from_iterable(string_lists.tolist()))

Categories