Average one field over two columns - python

I have searched around and have my own solution, but I believe there is a better way to achieve the result.
I have a dataframe with the following columns:
from_country to_country score
from_country and to_country columns have the same set of entries, e.g. US, UK, China, and so on. For each combination of from-to, there is a specific score.
I need to calculate the average of score for each country, regardless appearing in the from_country or the to_country field.
df_from = df[["from_country", "score"]].copy()
df_from.rename(columns={"from_country":"country"}, inplace=True)
df_to = df[["to_country", "score"]].copy()
df_to.rename(columns={"to_country":"country"}, inplace=True)
df_countries = pd.concat([df_from, df_to])
and then finally calculated the average over the new dataframe.
Is there a way to do it better?
Thanks

You can first stack the columns and then a simple groupby will get you all of the averages.
df.set_index('score').stack().reset_index().groupby(0).score.mean()
Here's an example, which renames the columns
import pandas as pd
df = pd.DataFrame({'from_country': ['A', 'B', 'C', 'D', 'E', 'G'],
'to_country': ['G', 'C', 'Z', 'X', 'A', 'A'],
'score': [1, 2, 3, 4, 5, 6]})
stacked = df.set_index('score').stack().to_frame('country').reset_index().drop(columns='level_1')
# score country
#0 1 A
#1 1 G
#2 2 B
#3 2 C
#4 3 C
#5 3 Z
#...
stacked.groupby('country').score.mean()
Outputs:
country
A 4.0
B 2.0
C 2.5
D 4.0
E 5.0
G 3.5
X 4.0
Z 3.0
Name: score, dtype: float64

Another way with set_index + concat:
pd.concat((
df.set_index('from_country').score,
df.set_index('to_country').score
)).groupby(level=0).mean()
A 4.0
B 2.0
C 2.5
D 4.0
E 5.0
G 3.5
X 4.0
Z 3.0

Related

Aggregate values in pandas dataframe based on lists of indices in a pandas series

Suppose you have a dataframe with an "id" column and a column of values:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df1
id vals
0 a 1
1 b 2
2 c 3
You also have a series that contains lists of "id" values that correspond to those in df1:
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2
id
0 [b, c]
1 [a, c]
2 [a, b]
Now, you need a computationally efficient method for taking the mean of the "vals" column in df1 using the corresponding ids in df2 and creating a new column in df1. For instance, for the first row (index=0) we would take the mean of the values for ids "b" and "c" in df1 (since these are the id values in df2 for index=0):
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
You could do it this way:
df1['avg_vals'] = df2.apply(lambda x: df1.loc[df1['id'].isin(x), 'vals'].mean())
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
...but suppose it is too slow for your purposes. I.e., I need something much more computationally efficient if possible! Thanks for your help in advance.
Let us try
df1['new'] = pd.DataFrame(df2.tolist()).replace(dict(zip(df1.id,df1.vals))).mean(1)
df1
Out[109]:
id vals new
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Try something like:
df1['avg_vals'] = (df2.explode()
.map(df1.set_index('id')['vals'])
.groupby(level=0)
.mean()
)
output:
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Thanks to #Beny and #mozway for their answers. But, these still were not performing as efficiently as I needed. I was able to take some of mozway's answer and add a merge and groupby to it which sped things up:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2 = df2.explode().reset_index(drop=False)
df1['avg_vals'] = pd.merge(df1, df2, left_on='id', right_on=0, how='right').groupby('index').mean()['vals']
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5

Summing multiple columns using a regular expression to select which columns to sum

I would like to perform the following:"
test = pd.DataFrame({'A1':[1,1,1,1],
'A2':[1,2,2,1],
'A3':[1,1,1,1],
'B1':[1,1,1,1],
'B2':[pd.NA, 1,1,1]})
result = pd.DataFrame({'A': test.filter(regex='A').sum(axis=1),
'B': test.filter(regex='B').sum(axis=1)})
I was wondering whether there is a better method to do this, when we have more columns and more "regex"-matches.
Use dict comprehension instead multiple repeat code like:
L = ['A','B']
df = pd.DataFrame({x: test.filter(regex=x).sum(axis=1) for x in L})
Or if possible simplify solution by select only first letters use:
df = test.groupby(lambda x: x[0], axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0
If regexes should ne joined by | and gt all columns substrings use:
vals = test.columns.str.extract('(A|B)', expand=False)
print (vals)
Index(['A', 'A', 'A', 'B', 'B'], dtype='object')
df = test.groupby(vals, axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0

Find the mean of columns with matching column names

I have a dataframe similar to the following but with thousands of rows and columns:
x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2
x y 2 3 2 1 3
x y 5 7 6 2 1
I would like my output to look like this:
ghb_00hr hl_06hr
2.3 2
6 1.5
My goal is to find the average of the matching columns. I have come up with this: temp = df.groupby(name, axis=1).agg('mean') But I am not sure how to define 'name' as the matching columns.
My previous strategy was the following:
name = pd.Series(['_'.join(i.split('_')[:-1])
for i in df.columns[3:]],
index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
avg = pd.concat([df.iloc[:, :3], temp],
axis=1
)
However the number of 'replicates' ranges from 1-4 so grouping by index location isn't an option.
Not sure if there is a better way to do this or if I am on the right track.
An option is to groupby level=0:
(df.set_index(['name','x','y'])
.groupby(level=0, axis=1)
.mean().reset_index()
)
Output:
name x y ghb_00hr ghl_06hr
0 gene1 x y 2.333333 2.0
1 gene2 x y 6.000000 1.5
Update: for the modified question:
d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest
names = d.columns.str.rsplit('_', n=1).str[0]
d.groupby(names, axis=1).mean()
Output:
ghb_00hr ghl_06hr
0 2.333333 2.0
1 6.000000 1.5
You can convert df.columns to set then iterate:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])
for column in set(df.columns):
print(column, df[common_name].mean(axis=1))
will outputs
a 0 2.0
dtype: float64
b 0 5.0
dtype: float64
Use sorted if the order matters:
for column in sorted(set(df.columns)):
From here you can get the output in pretty much any format you want.

merge new data into old data by replacing old data while appending new rows

I have 2 data frames with the same column names. Old data frame old_df and the new data frame is new_df with 1 column as a key.
I am trying to merge the 2 data frames into a single data frame which following conditions.
If the key is missing in the new table, then data from old_df should be taken
if the key is missing in old table, then data from new_df should be added.
If the key is present in both the tables then the data from new_df should overwrite the data from old_df.
Below is my code snippet that I am trying to play with.
new_data = pd.read_csv(filepath)
new_data.set_index(['Name'])
old_data = pd.read_sql_query("select * from dbo.Details", con=engine)
old_data.set_index(['Name'])
merged_result = pd.merge(new_data[['Name','RIC','Volatility','Sector']],
old_data,
on='Name',
how='outer')
I am thinking of using np.where from this point onwards but not sure how to proceed. please advice.
I believe you need DataFrame.combine_first with DataFrame.set_index for match by Name columns:
merged_result = (new_data.set_index('Name')[['RIC','Volatility','Sector']]
.combine_first(old_data.set_index('Name'))
.reset_index())
Sample data:
old_data = pd.DataFrame({'RIC':range(6),
'Volatility':[5,3,6,9,2,4],
'Name':list('abcdef')})
print (old_data)
RIC Volatility Name
0 0 5 a
1 1 3 b
2 2 6 c
3 3 9 d
4 4 2 e
5 5 4 f
new_data = pd.DataFrame({'RIC':range(4),
'Volatility':[10,20,30, 40],
'Name': list('abhi')})
print (new_data)
RIC Volatility Name
0 0 10 a
1 1 20 b
2 2 30 h
3 3 40 i
merged_result = (new_data.set_index('Name')
.combine_first(old_data.set_index('Name'))
.reset_index())
print (merged_result)
Name RIC Volatility
0 a 0.0 10.0
1 b 1.0 20.0
2 c 2.0 6.0
3 d 3.0 9.0
4 e 4.0 2.0
5 f 5.0 4.0
6 h 2.0 30.0
7 i 3.0 40.0
#jezrael's answer looks good. You may also try splitting dataset upon conditions and concatenating the old and new dataframes.
In the following example, I'm taking col1 as index and producing results that comply with your question's rules for combination.
import pandas as pd
old_data = {'col1': ['a', 'b', 'c', 'd', 'e'], 'col2': ['A', 'B', 'C', 'D', 'E']}
new_data = {'col1': ['a', 'b', 'e', 'f', 'g'], 'col2': ['V', 'W', 'X', 'Y', 'Z']}
old_df = pd.DataFrame(old_data)
new_df = pd.DataFrame(new_data)
old_df:
new_df:
Now,
df = pd.concat([new_df, old_df[~old_df['col1'].isin(new_df['col1'])]], axis=0).reset_index(drop=True)
Which gives us
df:
Hope this helps.

Selecting a subset using dropna() to select multiple columns

I have the following DataFrame:
df = pd.DataFrame([[1,2,3,3],[10,20,2,],[10,2,5,],[1,3],[2]],columns = ['a','b','c','d'])
From this DataFrame, I want to drop the rows where all values in the subset ['b', 'c', 'd'] are NA, which means the last row should be dropped.
The following code works:
df.dropna(subset=['b', 'c', 'd'], how = 'all')
However, considering that I will be working with larger data frames, I would like to select the same subset using the range ['b':'d']. How do I select this subset?
IIUC, use loc, retrieve those columns, and pass that to dropna.
c = df.loc[0, 'b':'d'].columns # retrieve only the 0th row for efficiency
df = df.dropna(subset=c, how='all')
print(df)
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
Similar to #ayhan's idea - using df.columns.slice_indexer:
In [25]: cols = df.columns[df.columns.slice_indexer('b','d')]
In [26]: cols
Out[26]: Index(['b', 'c', 'd'], dtype='object')
In [27]: df.dropna(subset=cols, how='all')
Out[27]:
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
You could also slice the column list numerically:
c = df.columns[1:4]
df = df.dropna(subset=c, how='all')
If using numbers is impractical (i.e. too many to count), there is a somewhat cumbersome work-around:
start, stop = df.columns.get_loc('b'), df.columns.get_loc('d')
c = df.columns[start:stop+1]
df = df.dropna(subset=c, how='all')

Categories