Pandas: idiomatic way to perform multiple complex aggregations? - python

I have a table as follows:
ID SCORE
A NaN
A NaN
B 1
B 2
C 5
I want the following output:
ID SUM_SCORE SIZE_SCORE
A NaN 2
B 3 2
C 5 1
Since I want to preserve NaN's, I need to use sum(min_count=1). So I have the following thus far:
grp = df.groupby('ID')
sum_score = grp['SCORE'].sum(min_count=1).reset_index()
size_score = grp['SCORE'].size().reset_index()
result = pd.merge(sum_score, size_score, on=['ID'])
This feels really inelegant. Is there a better way to get the result I'm looking for?

s=df.groupby('ID').SCORE.agg([('sum_score',lambda x : x.sum(min_count=1)),
('size_score','size')] ).reset_index()
ID sum_score size_score
0 A NaN 2
1 B 3.0 2
2 C 5.0 1

You can aggregate using the following:
df_agg = df.groupby("ID", as_index=False).agg(["sum","count"])
# rename your columns
df_agg.columns = ["ID","SUM_SCORE", "SIZE_SCORE"]

Related

pandas, update dataframe values ​with a not in the same format dataframe

i have two dataframes. The second dataframe contains the values ​​to be updated in the first dataframe. df1:
data=[[1,"potential"],[2,"lost"],[3,"at risk"],[4,"promising"]]
df=pd.DataFrame(data,columns=['id','class'])
id class
1 potential
2 lost
3 at risk
4 promising
df2:
data2=[[2,"new"],[4,"loyal"]]
df2=pd.DataFrame(data2,columns=['id','class'])
id class
2 new
4 loyal
expected output:
data3=[[1,"potential"],[2,"new"],[3,"at risk"],[4,"loyal"]]
df3=pd.DataFrame(data3,columns=['id','class'])
id class
1 potential
2 new
3 at risk
4 loyal
The code below seems to be working, but I believe there is a more effective solution.
final=df.append([df2])
final = final.drop_duplicates(subset='id', keep="last")
addition:
Is there a way for me to write the previous value in a new column?
like this:
id class prev_class modified date
1 potential nan nan
2 new lost 2022.xx.xx
3 at risk nan nan
4 loyal promising 2022.xx.xx
Your solution is good, here is alternative with concat and added DataFrame.sort_values:
df = (pd.concat([df, df2])
.drop_duplicates(subset='id', keep="last")
.sort_values('id', ignore_index=True))
print (df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
Solution is change if need add previous class values and today:
df3 = pd.concat([df, df2])
mask = df3['id'].duplicated(keep='last')
df31 = df3[mask]
df32 = df3[~mask]
df3 = (df32.merge(df31, on='id', how='left', suffixes=('','_prev'))
.sort_values('id', ignore_index=True))
df3.loc[df3['class_prev'].notna(), 'modified date'] = pd.to_datetime('now').normalize()
print (df3)
id class class_prev modified date
0 1 potential NaN NaT
1 2 new lost 2022-03-31
2 3 at risk NaN NaT
3 4 loyal promising 2022-03-31
We can use DataFrame.update
df = df.set_index('id')
df.update(df2.set_index('id'))
df = df.reset_index()
Result
print(df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
You can operate along your id's by setting them as your index, and use combine_first to perform this operation. Then assigning youre prev_class is extremely straightforward because you've properly used the Index!
df = df.set_index('id')
df2 = df2.set_index('id')
out = (
df2.combine_first(df)
.assign(
prev_class=df2["class"],
modified=lambda d:
d["prev_class"].where(
d["prev_class"].isna(), pd.Timestamp.now()
)
)
)
print(out)
class prev_class modified
id
1 potential NaN NaN
2 new new 2022-03-31 06:51:20.832668
3 at risk NaN NaN
4 loyal loyal 2022-03-31 06:51:20.832668

How would I go about creating a new data frame that has the unique values of a a column and it counts them?

Let's suppose I have a python data frame that looks something like this:
Factor_1 Factor_2 Factor_3 Factor_4 Factor_5
A B A Nan Nan
B D F A Nan
F A D B A
Something like this in which I have 5 columns that have different factors. I would like to create a column that counts how many of this factors appear in the dataframe so the expected output would be something like this:
Factor Count
A 5
B 3
D 2
F 2
Nan 3
I've been trying to use a groupby but havent been able to get the desired output, using somehting like this.
df['Counts'] = df.groupby(['Factor_1'])['Factor_2', 'Factor_3', 'Factor_4', 'Factor_5'].transform('count')
I actually don't know what else to do so if some one could help me it would be great.
Try stack with value_counts:
df.stack(dropna=False).value_counts(dropna=False)
A 5
B 3
NaN 3
D 2
F 2
dtype: int64
Here is another potential solution using melt and value_counts:
df.melt().value.value_counts()
Output:
A 5
B 3
Nan 3
F 2
D 2

How can one merge or concatenate Pandas series with different lengths and empty value?

I have a number of series with blanks as some values. Something like this
import pandas as pd
serie_1 = pd.Series(['a','','b','c','',''])
serie_2 = pd.Series(['','d','','','e','f','g'])
There is no problem in filtering blanks in each series, something like serie_1 = serie_1[serie_1 != '']
However, when I combine them in one df, either building the df from them or either building two one-column df and concatting them, I'm not obtaining what I'm looking for.
I'm looking for a table like this:
col1 col2
0 a d
1 b e
2 c f
3 nan g
But I am obtaining something like this
0 a nan
1 nan d
2 b nan
3 c nan
4 nan e
5 nan f
6 nan g
How could I obtain the table I'm looking for?
Thanks in advance
Here is one approach, if I understand correctly:
pd.concat([
serie_1[lambda x: x != ''].reset_index(drop=True).rename('col1'),
serie_2[lambda x: x != ''].reset_index(drop=True).rename('col2')
], axis=1)
col1 col2
0 a d
1 b e
2 c f
3 NaN g
The logic is: select non-empty entries (with the lambda expression). Re-start index numbering from 0 (with reset index). Set the column names (with rename). Create a wide table (with axis=1 in the merge function).
One way using pandas.concat:
ss = [serie_1, serie_2]
df = pd.concat([s[s.ne("")].reset_index(drop=True) for s in ss], 1)
print(df)
Output:
0 1
0 a d
1 b e
2 c f
3 NaN g
I would just filter out the blank values before creating the dataframe like this:
import pandas as pd
def filter_blanks(string_list):
return [e for e in string_list if e]
serie_1 = pd.Series(filter_blanks(['a','','b','c','','']))
serie_2 = pd.Series(filter_blanks(['','d','','','e','f','g']))
pd.concat([serie_1, serie_2], axis=1)
Which results in:
0 1
0 a d
1 b e
2 c f
3 NaN g

How to merge two 2-dimensional dataframes into a multi-indexed multidimensional pandas dataframe?

I have two same-sized data-frames, as follows:
cost_type1 = pd.DataFrame([[1,2,3,4], [100,200,300,400]]).transpose()
cost_type2 = pd.DataFrame([[1,4,9,25], [10,40,90,250]]).transpose()
As these data-frames both relate to costs, I would want to merge them in one structure, so that I can say something like cost[i] and get the cost matrix for type i.
I tried to use multi-index as follows:
timestamps =["2014-01-01", "2014-02-01"]
categories = ["A", "B","C","D"]
idx = pd.MultiIndex.from_product([timestamps,categories], names=["ts",
"cat"])
df = pd.DataFrame(index=idx, columns=["col1", "col2"])
I get a nice empty data-frame like this: (out)
col1 col2
ts cat
2014-01-01 A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN
2014-02-01 A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN
However, I can't manage to fill the "large" data frame with the two "smaller" ones that I already have. I tried something like this, but I wasn't successful:
df.loc["2014-01-01",:] = newdf1
df.loc["2014-02-01",:] = newdf2
Any of you knows how to solve this? Thanks!
Use concat with creating new index for each DataFrame, so empty DataFrame is not necessary:
timestamps = ["2014-01-01", "2014-02-01"]
categories = ["A", "B","C","D"]
idx = pd.MultiIndex.from_product([timestamps,categories], names=["ts", "cat"])
df = pd.concat([cost_type1.set_index([categories]),
cost_type2.set_index([categories])], keys=timestamps)
df.columns=["col1", "col2"]
df.index.names=['ts','cat']
If input are list of DataFrames use list comprehension:
dfs = [cost_type1, cost_type2]
df = pd.concat([x.set_index([categories]) for x in dfs], keys=timestamps)
df.columns=["col1", "col2"]
df.index.names=['ts','cat']
print (df)
col1 col2
ts cat
2014-01-01 A 1 100
B 2 200
C 3 300
D 4 400
2014-02-01 A 1 10
B 4 40
C 9 90
D 25 250

code multiple columns based on lists and dictionaries in Python

I have the following dataframe in Pandas
OfferPreference_A OfferPreference_B OfferPreference_C
A B A
B C C
C S G
I have the following dictionary of unique values under all the columns
dict1={A:1, B:2, C:3, S:4, G:5, D:6}
I also have a list of the columnames
columnlist=['OfferPreference_A', 'OfferPreference_B', 'OfferPreference_C']
I Am trying to get the following table as the output
OfferPreference_A OfferPreference_B OfferPreference_C
1 2 1
2 3 3
3 4 5
How do I do this.
Use:
#if value not match get NaN
df = df[columnlist].applymap(dict1.get)
Or:
#if value not match get original value
df = df[columnlist].replace(dict1)
Or:
#if value not match get NaN
df = df[columnlist].stack().map(dict1).unstack()
print (df)
OfferPreference_A OfferPreference_B OfferPreference_C
0 1 2 1
1 2 3 3
2 3 4 5
You can use map for this like shown below, assuming the values will match always
for col in columnlist:
df[col] = df[col].map(dict1)

Categories