Python - Pandas resample dataframe with strings and floats - python

I have a dataframe where the index is a datetimeindex, and every row is every day over the course of a couple years. I need to resample the dataframe by month where the two float columns are summed, but the string columns are all the unique values during that month. I can do the resampling to a single column, but I don't know how to do it to everything, or how to combine them back together if I do it one at a time.
For the floats I was trying:
# go through the column list
for col in col_list:
# process all run time columns for month
if "float" in str(col):
# resample for one month and sum
df[col] = df[col].resample('M').sum()
# rename the column
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
and for the strings:
elif "string" in str(col):
# get all the unique jobs run during the month
df[col] = df[col].groupby(pd.Grouper(freq='M')).unique()
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
these were resulting in the monthly data being inserted into tho the dataframe with every day still existing though, and was hard to find and not what I need.
Some sample data:
float_1 float_2 string_1 string_2
12/30/2019 1 2 a a
12/31/2019 1 3 a b
1/1/2020 2 4 a c
1/2/2020 3 5 b d
The expected output would be:
12/2019 2 5 a a, b
1/2020 5 9 a, b c, d
Not sure if it matters but the real data does have NaN in random days throughout the data.

Try aggregate numeric columns and non numeric columns separately and then join them back:
df.index = pd.to_datetime(df.index)
numerics = df.select_dtypes('number').resample('M').sum()
strings = df.select_dtypes('object').resample('M').agg(lambda x: ','.join(set(x)))
numerics.join(strings)
# float_1 float_2 string_1 string_2
#2019-12-31 2 5 a a,b
#2020-01-31 5 9 a,b d,c

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

How to remove a row in a dataframe if the entry of a specific column is numeric

It is not difficult to remove a row in a dataframe according to the fact that a specific column is non numerical.
But in my case I have the opposite: I must remove the lines which correspond to a numerical entry of a specific column.
Convert column to numeric with errors='coerce' and test missing values for remove numeric values:
df= pd.DataFrame(data={'tested col': [1,'b',2,'3'],
'A': [1,2,3,4]})
print (df)
tested col A
0 1 1
1 b 2
2 2 3
3 3 4
df1 = df[pd.to_numeric(df['tested col'], errors='coerce').isna()]
print (df1)
tested col A
1 b 2

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Sort a column containing string in Pandas

I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string. My data frame looks something like this:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.
You can first convert to_datetime column year_week, then sort it by sort_values and last use factorize:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
key year_week date num
1 1 2015_1 2015-01-11 0
0 0 2015_10 2015-03-15 1
2 2 2015_11 2015-03-22 2
5 5 2016_3 2016-01-24 3
3 3 2016_9 2016-03-06 4
6 6 2016_9 2016-03-06 4
4 4 2016_10 2016-03-13 5
7 7 2016_10 2016-03-13 5
## 1st method :-- This apply for large dataset
## Split the "year_week" column into 2 columns
df[['year', 'week']] =df['year_week'].str.split("_",expand=True)
## Change the datatype of newly created columns
df['year'] = df['year'].astype('int')
df['week'] = df['week'].astype('int')
## Sort the dataframe by newly created column
df= df.sort_values(['year','week'],ascending=True)
## Drop years & months column
df.drop(['year','week'],axis=1,inplace=True)
## Sorted dataframe
df
## 2nd method:--
## This apply for small dataset
## Change the datatype of column
df['year_week'] = df['year_week'].astype('str')
## Categories the string, the way you want
cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']
# Use pd.categorical() to categories it
df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)
## Sort the 'year_week' column
df= df.sort_values('year_week')
## Sorted dataframe
df

Fastest way to add an extra row to a groupby in pandas

I'm trying to create a new row for each group in a dataframe by copying the last row and then modifying some values. My approach is as follows, the concat step appears to be the bottleneck (I tried append too). Any suggestions?
def genNewObs(df):
lastRowIndex = df.obsNumber.idxmax()
row = pd.DataFrame(df.ix[lastRowIndex].copy())
# changes some other values in row here
df = pd.concat([df,row], ignore_index=True)
return df
df = df.groupby(GROUP).apply(genNewObs)
Edit 1: Basically I have a bunch of data with the last observation on different dates. I want to create a final observation for all groups on the current date.
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
If we pretend the current date is 1/31/2014 this becomes:
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
A 1/31/2014 21
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
B 1/31/2014 4
I've tried setting with enlargement and it is the slowest of all techniques. Any ideas?
Thanks to user1827356, I sped it up by a factor of 100 by taking the operation out of the apply. For some reason first was dropping by Group column, so I used idxmax instead.
def genNewObs(df):
lastRowIndex = df.groupby(Group).Date.idxmax()
rows = df.ix[lastRowIndex]
df = pd.concat([df,rows], ignore_index=True)
df = df.sort([Group, Date], ascending=True)
return df

Categories