Related
I have a dataframe looks like this
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A','C','D','C','D','D','A', 'A'],
})
I wanna create a unique id based on the group of the type column, but it will still cumsum when the type equals to 'A'
Eventually this output dataframe will look like this
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A','C','D','C','D','D','A', 'A'],
'id': [1, 2, 3, 4, 4, 4, 5,6, 7, 8, 9, 10, 10, 11, 12],
})
Any help would be much appreciated
You can try with shift with cumsum create the key , then assign the A with unique key
s = df.groupby(df.type.ne(df.type.shift()).cumsum()).cumcount().astype(str)
df['new'] = df['type']
df.loc[df.new.eq('A'),'new'] += s
df['new'] = df['new'].ne(df['new'].shift()).cumsum()
df
Out[58]:
type new
0 A 1
1 A 2
2 A 3
3 B 4
4 B 4
5 B 4
6 A 5
7 A 6
8 C 7
9 D 8
10 C 9
11 D 10
12 D 10
13 A 11
14 A 12
I'm creating an additional column "Total_Count" to store the cumulative count record by Site and Count_Record column information. My coding is almost done for total cumulative count. However, the Total_Count column is shift for a specific Card as below. Could someone help with code modification, thank you!
Expected Output:
Current Output:
My Code:
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])
df_append_temp=[]
total = 0
preCard = ''
preSite = ''
preCount = 0
for pc in df1['card'].unique():
df2 = df1[df1['card'] == pc].sort_values(['date'])
total = 0
for i in range(0, len(df2)):
site = df2.iloc[i]['site']
count = df2.iloc[i]['count_record']
if site == preSite:
total += (count - preCount)
else:
total += count
preCount = count
preSite = site
df2.loc[i, 'Total_Count'] = total #something wrong using loc here
df_append_temp.append(df2)
df3 = pd.DataFrame(pd.concat(df_append_temp), columns=df2.columns)
df3
To modify the current implementation we can use groupby to create our df2 which allows us to apply a function to each grouped DataFrame to create the new column. This should offer similar performance as the current implementation but produce correctly aligned Series:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
total = 0
pre_count = 0
pre_site = ''
lst = []
for c, s in zip(df2['count_record'], df2['site']):
if s == pre_site:
total += (c - pre_count)
else:
total += c
pre_count = c
pre_site = s
lst.append(total)
return pd.Series(lst, index=df2.index, name='Total_Count')
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Alternatively we can use groupby, then within groups Series.shift to get the previous site, and count_record. Then use np.where to conditionally determine each row's value and ndarray.cumsum to calculate the cumulative total of the resulting values:
def calc_total_count(df2: pd.DataFrame) -> pd.Series:
return pd.Series(
np.where(df2['site'] == df2['site'].shift(),
df2['count_record'] - df2['count_record'].shift(fill_value=0),
df2['count_record']).cumsum(),
index=df2.index,
name='Total_Count'
)
df3 = pd.concat([
df1,
df1.sort_values('date').groupby('card').apply(calc_total_count).droplevel(0)
], axis=1)
Either approach produces df3:
site card date count_record Total_Count
0 A C1 12-Oct 5 5
1 A C1 13-Oct 10 10
2 A C1 14-Oct 18 18
3 A C1 15-Oct 21 21
4 A C1 16-Oct 29 29
5 B C2 12-Oct 11 11
6 A C2 13-Oct 2 13
7 A C2 14-Oct 7 18
8 A C2 15-Oct 13 24
9 B C2 16-Oct 4 28
Setup and imports:
import numpy as np # only needed if using np.where
import pandas as pd
df1 = pd.DataFrame(columns=['site', 'card', 'date', 'count_record'],
data=[['A', 'C1', '12-Oct', 5],
['A', 'C1', '13-Oct', 10],
['A', 'C1', '14-Oct', 18],
['A', 'C1', '15-Oct', 21],
['A', 'C1', '16-Oct', 29],
['B', 'C2', '12-Oct', 11],
['A', 'C2', '13-Oct', 2],
['A', 'C2', '14-Oct', 7],
['A', 'C2', '15-Oct', 13],
['B', 'C2', '16-Oct', 4]])
I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output
I already have a dictionary of data frames, I would like to loop over each data frame of the dictionary and and group them based on the column named: Size and then store for each group of the data in a new data frames B.
My problem is: for each iteration, B will be replaced by a newer data frame. I would like to have all the data frames for all possible groups. Anyone has any ideas on how to do that?
Small example:
data = {'Name':['Tom', 'nick', 'krish', 'jack','Kody','Kim'], 'Age':[20, 21, 19, 18,6,6],'Size':['M','M','L','S','S','M']}
data2={'Name':['Jason', 'Damon', 'Ronda', 'Kylie','Ron','Harry'], 'Age':[20, 12, 11, 13,6,5],'Size':['L','M','L','M','L','L']}
df = pd.DataFrame(data)
df2=pd.DataFrame(data2)
A={}
A[0] = df
A[1]=df2
B={}
for x in range(0,2):
A[x]=A[x].groupby(["Size"])
KeysA=list(A[x].groups.keys())
display(len(KeysA))
for z in range(0, len(KeysA)):
B[z]= A[x].get_group(str(KeysA[z]))
I want to have this output: (see picture),
with my code the data frames are overwritten with each iteration. So I have in the end three data frames instead of five.
Is this what you want?
import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'jack', 'Kody', 'Kim'], 'Age': [20, 21, 19, 18, 6, 6],
'Size': ['M', 'M', 'L', 'S', 'S', 'M']}
data2 = {'Name': ['Jason', 'Damon', 'Ronda', 'Kylie', 'Ron', 'Harry'], 'Age': [20, 12, 11, 13, 6, 5],
'Size': ['L', 'M', 'L', 'M', 'L', 'L']}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
A = {}
A[0] = df
A[1] = df2
B = {}
new_df = pd.concat(A.values())
groups = new_df.groupby(["Size"])
for group in groups:
B[group[0]] = group[1]
for k, v in B.items():
print(f"{k}: {v}")
output:
L: Name Age Size
2 krish 19 L
0 Jason 20 L
2 Ronda 11 L
4 Ron 6 L
5 Harry 5 L
M: Name Age Size
0 Tom 20 M
1 nick 21 M
5 Kim 6 M
1 Damon 12 M
3 Kylie 13 M
S: Name Age Size
3 jack 18 S
4 Kody 6 S
For 5 Dataframes (in a list) do this:
import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'jack', 'Kody', 'Kim'], 'Age': [20, 21, 19, 18, 6, 6],
'Size': ['M', 'M', 'L', 'S', 'S', 'M']}
data2 = {'Name': ['Jason', 'Damon', 'Ronda', 'Kylie', 'Ron', 'Harry'], 'Age': [20, 12, 11, 13, 6, 5],
'Size': ['L', 'M', 'L', 'M', 'L', 'L']}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
A = {}
A[0] = df
A[1] = df2
B = []
for key, value in A.items():
groups = value.groupby(["Size"])
for group in groups:
B.append(group[1])
for x in B:
print(x)
output:
Name Age Size
2 krish 19 L
Name Age Size
0 Tom 20 M
1 nick 21 M
5 Kim 6 M
Name Age Size
3 jack 18 S
4 Kody 6 S
Name Age Size
0 Jason 20 L
2 Ronda 11 L
4 Ron 6 L
5 Harry 5 L
Name Age Size
1 Damon 12 M
3 Kylie 13 M
I have 3 dataframes like below,
` df1=[ 1q2 123
1q3 212
1d4 234...]
df2=[ 1q1 223
1q2 126
1q3 42
1d4 314...]
df3=[ 1q2 923
1q4 121
1d3 423...] `
How can I get result like
dfans=[1q1 0 223 0
1q2 123 126 923
1q3 212 42 423
1d4 234 314 121....]
column1 contains the id column 2 get the correct value for id's after matching ids,from df1;
similarly values matched from df2 in column 3 and value matching from df3.
if no values for that id is available place a 0 in that location.
is there any way?
That is a simple merge.
import pandas as pd
df1 = pd.DataFrame({'id' : ['A', 'B', 'D', 'E'], 'var1' : [1, 2, 5, 8]})
df2 = pd.DataFrame({'id' : ['B', 'C', 'D', 'E'], 'var2' : [1, 3, 8, 8]})
df3 = pd.DataFrame({'id' : ['A', 'B', 'C', 'D'], 'var3' : [2, 4, 6, 7]})
dfAns = df1.merge(df2, on = 'id', how = 'outer').merge(df3, on='id', how='outer').fillna(0)
The output will look like -