I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()
Related
I am trying to transform this Dataframe.
To look like the following:
Here is the code to create the sample df
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
symbol_list = [['A', 'B', ''], ['C','D','E']]
The end result dataframe is grouped by id field with symbol column turns into multiple columns with symbol ordering mapped to the user input list.
I was using .apply() method to construct each datarow for the above dataframe but it is taking very long time for 10000+ datapoints.
I am trying to find a more efficient way to transform the dataframe. I am thinking that I will need to use pivot function to unstack the data frame with the combination of resetting index (to turn category value into column). Appreciate any help on this!
Use GroupBy.cumcount with DataFrame.unstack for reshape, then extract date by DataFrame.pop with max per rows, flatten columns and last add new column date by DataFrame.assign:
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
#IMPORTANT all values from symbol_list are in column symbol (without empty strings)
symbol_list = [['A', 'B', ''], ['C','D','E']]
order = [y for x in symbol_list for y in x if y]
print (order)
['A', 'B', 'C', 'D', 'E']
#convert all values to Categoricals with specified order by flatten lists
df['symbol'] = pd.Categorical(df['symbol'], ordered=True, categories=order)
df['date'] = pd.to_datetime(df['date'])
#sorting by id and symbol
df = df.sort_values(['id','symbol'])
df1 = df.set_index(['id',df.groupby('id').cumcount()]).unstack()
date_max = df1.pop('date').max(axis=1)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.assign(date = date_max)
print (df1)
symbol_0 symbol_1 symbol_2 value_0 value_1 value_2 date
id
1 A B NaN 0.0 1.0 NaN 2021-07-02
2 C D E 2.0 3.0 4.0 2021-07-02
I have the following dataframe -
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
This is my desired output -
desired_df = pd.DataFrame({
'ID': [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'Prior_Current': ['a', 'a1', 'b', 'c', 'c1', 'd', 'e', 'f', 'f1', 'g',
'g1'],
'Start_Date': ['', '1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019'],
'End_Date': ['1/1/2019', '', '5/1/2019', '10/2/2019', '', '15/3/2019',
'6/5/2019', '7/9/2019', '', '16/11/2019', '']
})
I tried the following -
keys = ['Prior', 'Current']
df2 = (
pd.melt(df, id_vars='ID', value_vars=keys, value_name='Prior_Current')
.merge(df[['ID', 'Date']], how='left', on='ID')
)
df2['Start_Date'] = np.where(df2['variable'] == 'Prior', df2['Date'], '')
df2['End_Date'] = np.where(df2['variable'] == 'Current', df2['Date'], '')
df2.sort_values(['ID'], ascending=True, inplace=True)
But this does not seem be working. Please help.
you can use stack and pivot_table:
k = df.set_index(['ID', 'Date']).stack().reset_index()
df = k.pivot_table(index = ['ID',0], columns = 'level_2', values = 'Date', aggfunc = ''.join, fill_value= '').reset_index()
df.columns = ['ID', 'prior-current', 'start-date', 'end-date']
OUTPUT:
ID prior-current start-date end-date
0 1 a 1/1/2019
1 1 a1 1/1/2019
2 2 b 5/1/2019
3 2 c 5/1/2019 10/2/2019
4 2 c1 10/2/2019
5 3 d 15/3/2019
6 3 e 15/3/2019 6/5/2019
7 3 f 6/5/2019 7/9/2019
8 3 f1 7/9/2019
9 4 g 16/11/2019
10 4 g1 16/11/2019
Explaination:
After stack / reset_index df will look like this:
ID Date level_2 0
0 1 1/1/2019 Prior a
1 1 1/1/2019 Current a1
2 2 5/1/2019 Prior b
3 2 5/1/2019 Current c
4 2 10/2/2019 Prior c
5 2 10/2/2019 Current c1
6 3 15/3/2019 Prior d
7 3 15/3/2019 Current e
8 3 6/5/2019 Prior e
9 3 6/5/2019 Current f
10 3 7/9/2019 Prior f
11 3 7/9/2019 Current f1
12 4 16/11/2019 Prior g
13 4 16/11/2019 Current g1
Now, we can use ID and column 0 as index / level_2 as column / Date column as value.
Finally, we need to rename the columns to get the desired result.
My approach is to build and attain the target df step by step. The first step is an extension of your code using melt() and merge(). The merge is done based on the columns 'Current' and 'Prior' to get the start and end date.
df = pd.DataFrame({
'ID': [1, 2, 2, 3, 3, 3, 4],
'Prior': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'Current': ['a1', 'c', 'c1', 'e', 'f', 'f1', 'g1'],
'Date': ['1/1/2019', '5/1/2019', '10/2/2019', '15/3/2019', '6/5/2019',
'7/9/2019', '16/11/2019']
})
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID')
df2 = df2.merge(df[['Current', 'Date']], how='left', left_on='Prior_Current', right_on='Current').drop('Current',1)
df2 = df2.merge(df[['Prior', 'Date']], how='left', left_on='Prior_Current', right_on='Prior').drop('Prior',1)
df2 = df2.fillna('').reset_index(drop=True)
df2.columns = ['ID', 'Prior_Current', 'Start_Date', 'End_Date']
Alternative way is to define a custom function to get date, then use lambda function:
def get_date(x, col):
try:
return df['Date'][df[col]==x].values[0]
except:
return ''
df2 = pd.melt(df, id_vars='ID', value_vars=['Prior', 'Current'], value_name='Prior_Current').drop('variable',1).drop_duplicates().sort_values('ID').reset_index(drop=True)
df2['Start_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Current'))
df2['End_Date'] = df2['Prior_Current'].apply(lambda x: get_date(x, 'Prior'))
Output
Example Dataframe =
df = pd.DataFrame({'ID': [1,1,2,2,2,3,3,3],
... 'Type': ['b','b','b','a','a','a','a']})
I would like to return the counts grouped by ID and then a column for each unique ID in Type and the count of each Type for that grouped row:
pd.DataFrame({'ID': [1,2,3],'Count_TypeA': [0,2,3], 'CountTypeB':[2,1,0]}, 'TotalCount':[2,3,3])
Is there an easy way to do this using the groupby function in pandas?
For what you need you can use the method get_dummies from pandas. This will convert categorical variable into dummy/indicator variables. You can check the reference here.
Check if this meets your requirements:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 2, 2, 3, 3, 3],
'Type': ['b', 'b', 'b', 'a', 'a', 'a', 'a', 'a']})
dummy_var = pd.get_dummies(df["Type"])
dummy_var.rename(columns={'a': 'CountTypeA', 'b': 'CountTypeB'}, inplace=True)
df1 = pd.concat([df['ID'], dummy_var], axis=1)
df_group1 = df1.groupby(by=['ID'], as_index=False).sum()
df_group1['TotalCount'] = df_group1['CountTypeA'] + df_group1['CountTypeB']
print(df_group1)
This will print the following result:
ID CountTypeA CountTypeB TotalCount
0 1 0 2 2
1 2 2 1 3
2 3 3 0 3
I have the following pandas DataFrame with mixed data types: string and integer values. I want to sort values of this DataFrame in descending order using multiple columns: Price and Name. The string values (i.e. Name) should be sorted in the alphabetical order, or actually can be ignored at all, because the most important ones are numerical values.
The problem is that the list of target columns can contain both string and integer columns, e.g. target_columns = ["Price","Name"]
d = {'1': ['25', 'AAA', 2], '2': ['30', 'BBB', 3], '3': ['5', 'CCC', 2], \
'4': ['300', 'DDD', 2], '5': ['30', 'DDD', 3], '6': ['100', 'AAA', 3]}
columns=['Price', 'Name', 'Class']
target_columns = ['Price', 'Name']
order_per_cols = [False] * len(target_columns)
df = pd.DataFrame.from_dict(data=d, orient='index')
df.columns = columns
df.sort_values(list(target_columns), ascending=order_per_cols, inplace=True)
Currently, this code fails with the following message:
TypeError: '<' not supported between instances of 'str' and 'int'
The expected output:
Price Name Class
300 DDD 2
100 AAA 3
30 DDD 3
30 BBB 3
25 AAA 2
5 CCC 2
If I understand you correctly, you want a generic way that excludes the object columns from your selection.
We can use DataFrame.select_dtypes for this, then sort on the numeric columns:
# df['Price'] = pd.to_numeric(df['Price'])
numeric = df[target_columns].select_dtypes('number').columns.tolist()
df = df.sort_values(numeric, ascending=[False]*len(numeric))
Price Name Class
4 300 DDD 2
6 100 AAA 3
2 30 BBB 3
5 30 DDD 3
1 25 AAA 2
3 5 CCC 2
One more solution could be -
Using 'by' parameter in sort_values function
d = ({'1': ['25', 'AAA', 2], '2': ['30', 'BBB', 3], '3': ['5', 'CCC', 2], \
'4': ['300', 'DDD', 2], '5': ['30', 'DDD', 3], '6': ['100', 'AAA', 3]})
df = pd.DataFrame.from_dict(data=d,columns=['Price','Name','Class'],orient='index')
df['Price'] = pd.to_numeric(df['Price'])
df.sort_values(**by** = ['Price','Name'],ascending=False)
I am attempting to calculate the differences between two groups that may have mismatched data in an efficient manner.
The following dataframe, df,
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
has two types that have mismatched "codes" -- notably code 3 is not present for the 'W' type and code 4 is not present for the 'A' type. I have wrapped codes as strings as in my particular case they are sometimes strings.
I would like to substract the values for matching codes between the two types so that we obtain,
result = pd.DataFrame({'code': ['1', '2', '3', '4'],
'diff': [0, 15, 25, -40]})
Where the sign would indicate which type had the greater value.
I have spent some time examining variations on groupby diff methods here, but have not seen anything that deals with the particular issue of subtracting between two potentially mismatched columns. Instead, most questions appear to be appropriate for the intended use of the diff() method.
The route I've tried most recently is using a list comprehension on the df.groupby['type'] to split into two dataframes, but then I remain with a similar problem regarding subtracting mismatched cases.
Groupby on code, then substitute the missing value with 0
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
def my_func(x):
# What if there are more than 1 value for a type/code combo?
a_value = x[x.type == 'A']['values'].max()
w_value = x[x.type == 'W']['values'].max()
a_value = 0 if np.isnan(a_value) else a_value
w_value = 0 if np.isnan(w_value) else w_value
return a_value - w_value
df_new = df.groupby('code').apply(my_func)
df_new = df_new.reset_index()
df_new = df_new.rename(columns={0:'diff'})
print(df_new)
code diff
0 1 0
1 2 15
2 3 25
3 4 -40