Sum with rows from two dataframes - python

I have two dataframes. One has months 1-5 and a value for each month, which are the same for ever ID, the other has an ID and a unique multiplier e.g.:
data = [['m', 10], ['a', 15], ['c', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Unique'])
data2=[[1,0.2],[2,0.3],[3,0.01],[4,0.5],[5,0.04]]
df2 = pd.DataFrame(data2, columns=['Month', 'Value'])
I want to do sum ( value / (1+unique)^(Month/12) ). E.g. for ID m, I want to do (value/(1+10)^(Month/12)), for every row in df2, and sum them. I wrote a for-loop to do this but since my real table has 277,000 entries this takes too long!
df['baseTotal']=0
for i in df.index.unique():
for i in df2.Month.unique():
df['base']= df2['Value']/pow(1+df.loc[i,'Unique'],df2['Month']/12.0)
df['baseTotal']=df['baseTotal']+df['base']
Is there a more efficient way to do this?

df['Unique'].apply(lambda x: (df2['Value']/((1+x) ** (df2['Month']/12))).sum())
0 0.609983
1 0.563753
2 0.571392
Name: Unique, dtype: float64

Related

How to sum same columns (differentiated by suffix) in pandas?

I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample

Filtering and grouping rows in one DataFrame, by another DataFrame

I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.
data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)
data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)
DF1 - id (15k rows)
DF2 - id, col1 (50M rows)
Desired output
data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)
def func(df1):
temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on=
['id'])
temp1 = temp3.B.values
return temp1
df1['B']=df1.apply(func,axis=1))
I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?
Are you looking for a simple filter and grouped listification?
df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)
id B
0 1 [ab, bc]
1 2 [ca, as]
2 3 [ad, ds, sd]
Note that grouping as lists is considered suboptimal in terms of performance.

Pandas, create MultIndex column from a column in data frame

Suppose I have a data frame like this. The index of this dataframe is a MultiIndex alredy, date/id.
Column N tells the price information is N periods before. How could I turn column['N'] into a MultiIndex?
In this example, suppose columns N has two unique value [0, 1], the final result would have 6 columns and it should look like [0/priceClose] [0/priceLocal] [0/priceUSD] [1/priceClose] [1/priceLocal] [1/priceUSD]
I finally found the following method works:
step 1: melt
step 2: pivot
df = pd.melt(df, id_vars=['date', 'id', 'N'],
value_vars=[p for p in df if p.startswith('price')],
value_name='price')
df = pd.pivot_table(df, values='price', index=['date', 'id'],
columns=['variable', 'N'], aggfunc='max')

Casting columns of categories to one string column in Python

This is a follow-up to a previously asked question (asked by me :)) Oneliner to create string column from multiple columns
I want to merge a subset columns in a dataframe to a new create a new string-column. #Zero was kind enough to give me the solution to this problem
import pandas as pd
df = pd.DataFrame({'gender' : ['m', 'f', 'f'],\
'code' : ['K2000', 'K2000', 'K2001']})
col_names = df.columns
df_str = df[col_names].astype(str).apply('_'.join, axis=1)
df_str
Out[17]:
0 K2000_m
1 K2000_f
2 K2001_f
dtype: object
However if I introduce interval data this fails
df = pd.DataFrame({'gender' : ['m', 'f', 'f'],\
'code' : ['K2000', 'K2000', 'K2001'],\
'num' : pd.cut([3, 6, 9], [0, 5, 10])})
col_names = df.columns
df_str = df[col_names].astype(str).apply('_'.join, axis=1)
Ideally I would also like to transform the data to categorical data (which also fails)
df_cat = pd.concat([df['gender'].astype('category'), \
df['code'].astype('category'), \
df['num'].astype('category')], axis=1)
df_cat_str = df_cat[col_names].astype(str).apply('_'.join, axis=1)
What is going on here? And how can i acheive the desired output
0 K2000_m_(0, 5]
1 K2000_f_(5, 10]
2 K2001_f_(5, 10]
As with the previous question col_names should be a list containing any subset of the columns (not necessarily all columns as in this example)
You need convert each column to str separately in lambda function:
df_str = df[col_names].apply(lambda x: '_'.join(x.astype(str)), axis=1)
print (df_str)
0 K2000_m_(0, 5]
1 K2000_f_(5, 10]
2 K2001_f_(5, 10]
dtype: object

Python: Divide each row of a DataFrame by another DataFrame vector

I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?
In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]
You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4
Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)
If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.
to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)

Categories