Combining similar dataframe rows - python

I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5

I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index

Related

Identify invalid dates in pandas dataframe columns

Suppose we had the following dataframe-
How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe?
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 None
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1, Date2
You can select the Dates column with filter (or any other method, including a manual list), compute a Series of invalid dates by converting to_datetime and sub-selecting the NaN values (i.e. invalid dates) with isna,then stack and join to the original DataFrame:
s = (df
.filter(like='Date') # keep only "Date" columns
# convert to datetime, NaT will be invalid dates
.apply(lambda s: pd.to_datetime(s, format='%d-%m-%Y', errors='coerce'))
.isna()
# reshape to long format (Series)
.stack()
)
out = (df
.join(s[s].reset_index(level=1) # keep only invalid dates
.groupby(level=0)['level_1'] # for all initial indices
.agg(','.join) # join the column names
.rename('Invalid Dates')
)
)
alternative with melt to reshape the DataFrame:
cols = df.filter(like='Date').columns
out = df.merge(
df.melt(id_vars='Name', value_vars=cols, var_name='Invalid Dates')
.assign(value=lambda d: pd.to_datetime(d['value'], format='%d-%m-%Y',
errors='coerce'))
.loc[lambda d: d['value'].isna()]
.groupby('Name')['Invalid Dates'].agg(','.join),
left_on='Name', right_index=True, how='left'
)
output:
Name Date1 Date2 Invalid Dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2
Use DataFrame.filter for filter columns with substring Date, then convert to datetimes by to_datetime all columns of df1 with errors='coerce' for missing values if no match, so possible test them by DataFrame.isna and by DataFrame.dot extract columnsnames separated by ,:
df1 = df.filter(like='Date')
df['Invalid dates']=((df1.apply(lambda x:pd.to_datetime(x,format='%d-%m-%Y',errors='coerce'))
.isna() & df1.notna())
.dot(df1.columns + ',')
.str[:-1]
.replace('', np.nan))
print (df)
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2

pandas reset index after performing groupby and retain selective columns

I want to take a pandas dataframe, do a count of unique elements by a column and retain 2 of the columns. But I get a multi-index dataframe after groupby which I am unable to (1) flatten (2) select only relevant columns. Here is my code:
import pandas as pd
df = pd.DataFrame({
'ID':[1,2,3,4,5,1],
'Ticker':['AA','BB','CC','DD','CC','BB'],
'Amount':[10,20,30,40,50,60],
'Date_1':['1/12/2018','1/14/2018','1/12/2018','1/14/2018','2/1/2018','1/12/2018'],
'Random_data':['ax','','nan','','by','cz'],
'Count':[23,1,4,56,34,53]
})
df2 = df.groupby(['Ticker']).agg(['nunique'])
df2.reset_index()
print(df2)
df2 still comes out with two levels of index. And has all the columns: Amount, Count, Date_1, ID, Random_data.
How do I reduce it to one level of index?
And retain only ID and Random_data columns?
Try this instead:
1) Select only the relevant columns (['ID', 'Random_data'])
2) Don't pass a list to .agg - just 'nunique' - the list is what is causing the multi index behaviour.
df2 = df.groupby(['Ticker'])['ID', 'Random_data'].agg('nunique')
df2.reset_index()
Ticker ID Random_data
0 AA 1 1
1 BB 2 2
2 CC 2 2
3 DD 1 1
Use SeriesGroupBy.nunique and filter columns in list after groupby:
df2 = df.groupby('Ticker')['Date_1','Count','ID'].nunique().reset_index()
print(df2)
Ticker Date_1 Count ID
0 AA 1 1 1
1 BB 2 2 2
2 CC 2 2 2
3 DD 1 1 1

Add indicator to inform where the data came from Python

Many thanks for reading.
I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.
This would be my desired result:
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
jon smith 0 0 0 1
charlie jim 1 0 0 1
ian james 0 1 0 0
For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).
I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).
Many thanks for any help.
You can use:
first concat with parameter keys for identify DataFrames
reset_index for columns from MultiIndex
groupby and join indicators
create indicators by str.get_dummies
reindex if need append 0 columns for missing categories
reset_index for columns from Index
df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
.reset_index()
.groupby(['Forename','Surname'])['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1
More general solution with groupby by all columns:
df = pd.concat(dfs, keys=inds)
print (df)
Forename Surname
Ind_1 0 charlie jim
Ind_2 0 ian james
Ind_4 0 charlie jim
1 jon smith
df1 =(df.reset_index()
.groupby(df.columns.tolist())['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df1)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1

Python dataframe transpose where some rows have multiple values

I've a dataframe:
field,value
a,1
a,2
b,8
I want to pivot it to this form
a,b
1,8
2,8
set_index with a cumcount on each field group + field
unstack + ffill
df.set_index(
[df.groupby('field').cumcount(), 'field']
).value.unstack().ffill().astype(df.value.dtype)
field a b
0 1 8
1 2 8
You can do like so:
# df = pd.read_clipboard(sep=',')
df.pivot(columns=field, values=value).bfill().dropna()
print (df)
0 1
0 a 1
1 a 2
2 b 8
Solution with creating groups for new index by GroupBy.cumcount, then pivot and fill forward missing values:
g = df.groupby(0).cumcount()
df1 = pd.pivot(index=g, columns=df[0], values=df[1]).ffill().astype(int)
.rename_axis(None, axis=1)
print (df1)
a b
0 1 8
1 2 8
Another solution creates groups with apply and reshape by unstack:
print (df.groupby(0).apply(lambda x: pd.Series(x[1].values)).unstack(0).ffill().astype(int)
.rename_axis(None, axis=1))
a b
0 1 8
1 2 8
A much simpler solution would just be to do DataFrame.T (transpose)
df_new = df.T

Get data into monthly datetime index

I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.
You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64

Categories