Combining similar dataframe rows

Combining similar dataframe rows - python

I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5

I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index

Related

Identify invalid dates in pandas dataframe columns

Suppose we had the following dataframe-
How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe?
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 None
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1, Date2

You can select the Dates column with filter (or any other method, including a manual list), compute a Series of invalid dates by converting to_datetime and sub-selecting the NaN values (i.e. invalid dates) with isna,then stack and join to the original DataFrame:
s = (df
.filter(like='Date') # keep only "Date" columns
# convert to datetime, NaT will be invalid dates
.apply(lambda s: pd.to_datetime(s, format='%d-%m-%Y', errors='coerce'))
.isna()
# reshape to long format (Series)
.stack()
)
out = (df
.join(s[s].reset_index(level=1) # keep only invalid dates
.groupby(level=0)['level_1'] # for all initial indices
.agg(','.join) # join the column names
.rename('Invalid Dates')
)
)
alternative with melt to reshape the DataFrame:
cols = df.filter(like='Date').columns
out = df.merge(
df.melt(id_vars='Name', value_vars=cols, var_name='Invalid Dates')
.assign(value=lambda d: pd.to_datetime(d['value'], format='%d-%m-%Y',
errors='coerce'))
.loc[lambda d: d['value'].isna()]
.groupby('Name')['Invalid Dates'].agg(','.join),
left_on='Name', right_index=True, how='left'
)
output:
Name Date1 Date2 Invalid Dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2

Use DataFrame.filter for filter columns with substring Date, then convert to datetimes by to_datetime all columns of df1 with errors='coerce' for missing values if no match, so possible test them by DataFrame.isna and by DataFrame.dot extract columnsnames separated by ,:
df1 = df.filter(like='Date')
df['Invalid dates']=((df1.apply(lambda x:pd.to_datetime(x,format='%d-%m-%Y',errors='coerce'))
.isna() & df1.notna())
.dot(df1.columns + ',')
.str[:-1]
.replace('', np.nan))
print (df)
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2

pandas reset index after performing groupby and retain selective columns

I want to take a pandas dataframe, do a count of unique elements by a column and retain 2 of the columns. But I get a multi-index dataframe after groupby which I am unable to (1) flatten (2) select only relevant columns. Here is my code:
import pandas as pd
df = pd.DataFrame({
'ID':[1,2,3,4,5,1],
'Ticker':['AA','BB','CC','DD','CC','BB'],
'Amount':[10,20,30,40,50,60],
'Date_1':['1/12/2018','1/14/2018','1/12/2018','1/14/2018','2/1/2018','1/12/2018'],
'Random_data':['ax','','nan','','by','cz'],
'Count':[23,1,4,56,34,53]
})
df2 = df.groupby(['Ticker']).agg(['nunique'])
df2.reset_index()
print(df2)
df2 still comes out with two levels of index. And has all the columns: Amount, Count, Date_1, ID, Random_data.
How do I reduce it to one level of index?
And retain only ID and Random_data columns?

Try this instead:
1) Select only the relevant columns (['ID', 'Random_data'])
2) Don't pass a list to .agg - just 'nunique' - the list is what is causing the multi index behaviour.
df2 = df.groupby(['Ticker'])['ID', 'Random_data'].agg('nunique')
df2.reset_index()
Ticker ID Random_data
0 AA 1 1
1 BB 2 2
2 CC 2 2
3 DD 1 1

Use SeriesGroupBy.nunique and filter columns in list after groupby:
df2 = df.groupby('Ticker')['Date_1','Count','ID'].nunique().reset_index()
print(df2)
Ticker Date_1 Count ID
0 AA 1 1 1
1 BB 2 2 2
2 CC 2 2 2
3 DD 1 1 1

Add indicator to inform where the data came from Python

Many thanks for reading.
I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.
This would be my desired result:
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
jon smith 0 0 0 1
charlie jim 1 0 0 1
ian james 0 1 0 0
For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).
I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).
Many thanks for any help.

You can use:
first concat with parameter keys for identify DataFrames
reset_index for columns from MultiIndex
groupby and join indicators
create indicators by str.get_dummies
reindex if need append 0 columns for missing categories
reset_index for columns from Index
df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
.reset_index()
.groupby(['Forename','Surname'])['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1
More general solution with groupby by all columns:
df = pd.concat(dfs, keys=inds)
print (df)
Forename Surname
Ind_1 0 charlie jim
Ind_2 0 ian james
Ind_4 0 charlie jim
1 jon smith
df1 =(df.reset_index()
.groupby(df.columns.tolist())['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df1)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1

Python dataframe transpose where some rows have multiple values

I've a dataframe:
field,value
a,1
a,2
b,8
I want to pivot it to this form
a,b
1,8
2,8

set_index with a cumcount on each field group + field
unstack + ffill
df.set_index(
[df.groupby('field').cumcount(), 'field']
).value.unstack().ffill().astype(df.value.dtype)
field a b
0 1 8
1 2 8

You can do like so:
# df = pd.read_clipboard(sep=',')
df.pivot(columns=field, values=value).bfill().dropna()

print (df)
0 1
0 a 1
1 a 2
2 b 8
Solution with creating groups for new index by GroupBy.cumcount, then pivot and fill forward missing values:
g = df.groupby(0).cumcount()
df1 = pd.pivot(index=g, columns=df[0], values=df[1]).ffill().astype(int)
.rename_axis(None, axis=1)
print (df1)
a b
0 1 8
1 2 8
Another solution creates groups with apply and reshape by unstack:
print (df.groupby(0).apply(lambda x: pd.Series(x[1].values)).unstack(0).ffill().astype(int)
.rename_axis(None, axis=1))
a b
0 1 8
1 2 8

A much simpler solution would just be to do DataFrame.T (transpose)
df_new = df.T

Get data into monthly datetime index

I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.

You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining similar dataframe rows - python

I currently have a dataframe which looks like this User Date FeatureA FeatureB John DateA 1 2 John DateB 3 5 Is there anyway that I can combine the 2 rows such that it becomes User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2 John DateA DateB 1 2 3 5

Related

Identify invalid dates in pandas dataframe columns

pandas reset index after performing groupby and retain selective columns

Add indicator to inform where the data came from Python

Python dataframe transpose where some rows have multiple values

Get data into monthly datetime index

Categories

Resources