How to sum same columns (differentiated by suffix) in pandas? - python

I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!

Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})

assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want

import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5

Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5

I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample

Related

Fill pandas column based on range and category of another

I have two dataframes.
DF1
DF2
I want to add a column to DF1, 'Speed', that references the track category, and the LocationFrom and LocationTo range, to result in the below.
I have looked at merge_asof, and IntervalIndex, but unable to figure out how to reference the category before the range.
Thanks.
Check Below code: SQLITE
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
DF1.to_sql('DF1', con = conn, index = False)
DF2.to_sql('DF2', con = conn, index = False)
pd.read_sql("""Select DF1.*, DF2.Speed
From DF1
join DF2 on DF1.Track = Df2.Track
AND DF1.Location BETWEEN DF2.LocationFrom and DF2.LocationTo""", con=conn)
Output:
As hinted in your question, this is a perfect use case for merge_asof:
pd.merge_asof(df1, df2, by='Track',
left_on='Location', right_on='LocationTo',
direction='forward'
)#.drop(columns=['LocationFrom', 'LocationTo'])
output:
Track Location LocationFrom LocationTo Speed
0 A 1 0 5 45
1 A 2 0 5 45
2 A 6 5 10 50
3 B 24 20 50 100
NB. uncomment the drop to remove the extra columns.
It works, but I would like to see someone do this without a for loop and without creating mini dataframes.
import pandas as pd
data1 = {'Track': list('AAAB'), 'Location': [1, 2, 6, 24]}
df1 = pd.DataFrame(data1)
data2 = {'Track': list('AABB'), 'LocationFrom': [0, 5, 0, 20], 'LocationTo': [5, 10, 20, 50], 'Speed': [45, 50, 80, 100]}
df2 = pd.DataFrame(data2)
speeds = []
for k in range(len(df1)):
track = df1['Track'].iloc[k]
location = df1['Location'].iloc[k]
df1_track = df1.loc[df1['Track'] == track]
df2_track = df2.loc[df2['Track'] == track]
speeds.append(df2_track['Speed'].loc[(df2_track['LocationFrom'] <= location) & (location < df2_track['LocationTo'])].iloc[0])
df1['Speed'] = speeds
print(df1)
Output:
Track Location Speed
0 A 1 45
1 A 2 45
2 A 6 50
3 B 24 100
This approach is probably not viable if your tables are large. It creates an intermediate table which has a merge of all pairs of matching Tracks between df1 and df2. Then it removes rows where the location is not between the boundaries. Thanks #Aeronatix for the dfs.
The all_merge intermediate table gets really big really fast. If a1 rows of df1 are Track A, a2 in df2 etc.. then the total rows in all_merge will be a1*a2+b1*b2+c1*c2...+z1*z2 which might or might not be gigantic depending on your dataset
all_merge = df1.merge(df2)
results = all_merge[all_merge.Location.between(all_merge.LocationFrom,all_merge.LocationTo)]
print(results)

How to transform an index into a column in Python?

I am struggling with the following: I have on dataset and have transposed it. After transposing, the first column was set as an index automatically, and from now one, this "index" column is not recognized as a variable. Here is an example of what I mean;
df = Date A B C
1/1/2021 1 2 3
1/2/2021 4 5 6
input: df_T = df.t
output: index 1/1/2021 1/2/2021
A 1 4
B 2 5
C 3 6
I would like to have a variable, and name if it is possible, instead of the generated "index".
To reproduce this dataset, I have used this chunk of code:
data = [['1/1/2021', 1, 2, 3], ['3/1/2021', 4, 5, 6]]
df = pd.DataFrame(data)
df.columns = ['Date', 'A', 'B', 'C']
df.set_index('Date', inplace=True)
To have meaningfull column inplace of index, the next line can be run:
df_T = df.T.reset_index()
To rename the column 'index', method rename can be used:
df_T.rename(columns={'index':'Variable'})
A Pandas Data Frame typically has names both for the colmns and for the rows. The list of names for the rows is called the "Index". When you do a transpose, rows and columns switch places. In your case, the dates column is for some reason the index, so it becomes the column names for the new data frame. You need to create a new index and turn the "Date" column into a regular column. As #sophocies wrote above, this is achived with df.reset_index(...). I hope the code example below will be helpful.
import pandas as pd
df = pd.DataFrame(columns=['Date', 'A', 'B', 'C'], data=[['1/1/2021', 1, 2,3], ['1/2/2021', 4,5,6]])
df.set_index('Date', inplace=True)
print(df.transpose(copy=True))
#Recreated the problem
df.reset_index(inplace=True)
print(df.transpose())
Output
0
1
Date
1/1/2021
1/2/2021
A
1
4
B
2
5
C
3
6
I hope this is what you wanted!

Change column values based on other dataframe columns

I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724

How to create a class - function for astype function python?

There are lot of columns in dataframe like :
df_train_data['material'] = df_train_data['material'].astype('category',ordered=False)
df_train_data['aircon'] = df_train_data['aircon'].astype('category',ordered=False)
df_train_data['building_quality'] = df_train_data['building_quality'].astype('category',ordered=True)
df_train_data['fireplace'] = df_train_data['fireplace'].astype('category',ordered=False)
.
.
.
df_test_data.....
For both train and test dataframes.
So, instead of writing 20 - 30 odd lines for each column in train and each in test again, How to write them in a function where we can pass only the column names (comma separated) and ordered as an argument in a function?
I can only think of one way : ( new to programming)
def data_type(df_name,col,ord_type):
return df_name[col] = df_name[col].astype('category',ordered = ord_type)
How to do this for multiple column names at once?
Actually your answer is working for multiple columns, just use lists instead of single values :
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def data_type(df_name,col,ord_type):
return df_name[col].astype('category',ordered = ord_type)
cols = ['a', 'b']
df[cols] = data_type(df, cols, [True, False])
df is now :
a b c
0 1 2 3
1 4 5 6
2 7 8 9
with dtypes :
a category
b category
c int32
dtype: object
It may help
data = pd.read_excel(r"<file_location>.xlsx")
def data_type(df, as_type, ordered, *cols):
for col in cols:
df[col] = df[col].astype(as_type, ordered=False)
return df
df = data_type(data, 'category', 'False', data.columns)
If focusing only on setting/changing type for a large number of columns (all columns) at once for several dataframes: pandas.DataFrame.astype allows passing a dict of column name -> data type (as a 1st argument):
from itertools import zip_longest
...
df_train_data.astype(dict(zip_longest(df_train_data.columns, ('category',), fillvalue='category')))
df_test_data.astype(dict(zip_longest(df_test_data.columns, ('category',), fillvalue='category')))

Creating Pivot DataFrame using Multiple Columns in Pandas

I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...

Categories