Append min value of two columns in pandas data frame - python

df
Purchase
1
3
2
5
4
7
df2
df2 = pd.DataFrame(columns=['Mean','Median','Max','Col4'])
df2 = df2.append({'Mean': (df['Purchase'].mean()),'Median':df['Purchase'].median(),'Max':(df['Purchase'].max()),'Col4':(df2[['Mean','Median']].min(axis=1))}, ignore_index=True)
Output obtained
Mean Median Max Col4
3.66 3.5 7 Series([], dtype: float64)
Output expected
Mean Median Max Col4
3.66 3.5 7 3.5 #Value in Col4 is Min(Mean, Median of df2)
Can anyone help?

Use np.minimum and passed mean with median:
df2 = pd.DataFrame(columns=['Mean','Median','Max','Col4'])
df2 = (df2.append({'Mean': df['Purchase'].mean(),
'Median':df['Purchase'].median(),
'Max': df['Purchase'].max(),
'Col4': np.minimum(df['Purchase'].mean(), df['Purchase'].median())},
ignore_index=True))
print (df2)
Mean Median Max Col4
0 3.666667 3.5 7.0 3.5
Or better is use Series.agg with new value of min in next step, last create one row DataFrame:
s = df['Purchase'].agg(['mean','median','max'])
s.loc['col4'] = s[['mean','median']].min()
df = s.to_frame(0).T
print (df)
mean median max col4
0 3.666667 3.5 7.0 3.5

Related

How to get a new df constituted by partialy transposed fragments of another dataframe

I am struggling to get my dataframe transposed, not simply transposed but I want to limit the number of columns to the number of rows in index slices, in order to well explain my problem I give you my dataframe here :
df=pd.DataFrame({
'n' : [0,1,2, 0,1,2, 0,1,2],
'col1' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'col2' : [9.6,10.4, 11.2, 3.3, 6, 4, 1.94, 15.44, 6.17]
})
It has the following display :
n col1 col2
0 0 A 9.60
1 1 A 10.40
2 2 A 11.20
3 0 B 3.30
4 1 B 6.00
5 2 B 4.00
6 0 C 1.94
7 1 C 15.44
8 2 C 6.17
From that dataframe I want to get the following new_df:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
What I tried so far :
new_df = df.values.reshape(3, 9)
new_w = [x.reshape(3,3).T for x in new_df]
df_1 = pd.DataFrame(new_w[0])
df_1.index = ['n', 'col1', 'col2']
df_2 = pd.DataFrame(new_w[1])
df_2.index = ['n', 'col1', 'col2']
df_3 = pd.DataFrame(new_w[2])
df_3.index = ['n', 'col1', 'col2']
new_df = df_1.append(df_2)
new_df = new_df.append(df_3)
new_df[new_df.index!='n']
The code I tried works but it looks long, I want another shorter solution for that.
Any help from your side will be highly appreciated, thanks.
Identify the unique values in "col1" with factorize, then melt to combine the two columns and pivot:
(df.assign(idx=pd.factorize(df['col1'])[0]).melt(['n', 'idx'])
.pivot(index=['idx', 'variable'], columns='n', values='value')
.droplevel('idx').rename_axis(index=None, columns=None) # optional
)
Or with groupby.cumcount:
(df.assign(idx=df.groupby('n').cumcount()).melt(['n', 'idx'])
.pivot(index=['idx', 'variable'], columns='n', values='value')
.droplevel('idx').rename_axis(index=None, columns=None)
)
Output:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
In the following method I extract 3 dataframes so that I can concatenate them later. I have to do a bit of manipulation to get it into the correct format:
Select every 3 rows
Transpose these 3 rows
Get the column names from the first row
Remove the first row
append to a list
Once I have the 3 dataframes in a list, they can be concatenated using pd.concat
Code:
t_df = []
for i in range (int(len(df)/3)):
temp = df.iloc[i*3:(i+1)*3].T
temp.columns = temp.iloc[0]
temp = temp[1:]
t_df.append(temp)
new_df = pd.concat(t_df)
print(new_df)
Output:
n 0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
The logic is:
Group by "col1" and iterate the grouper.
Transpose the sub_group obtained in iteration.
Concat all transposed sub_groups.
df_arr = []
for key, sub_df in df.groupby("col1"):
df_arr.append(sub_df.set_index("n").T)
df = pd.concat(df_arr).rename_axis("", axis="columns")
Output:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17

merge two dataframes on common cell values of different columns

I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?
Perform the merges in the preferred order, and use combine_first to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2 to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C
I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64

Getting data from one column based on the value of other column

I am having trouble coming up with an algorithm for the following problem:
I have two data frames, df1 and df2 (the following are just an example):
import pandas as pd
df1 = pd.DataFrame({'Col1': [1, 7, 10, 50, 73, 80 ], 'Col2': [1,2,3,4,5,6]})
df2 = pd.DataFrame({'Col1': [0, 4, 10, 80], 'Col3': [7,6,8,9]})
As you can see, they both have the Col1, but the values aren't always coincident, but they are in ascending order. I want to create a function that will create a new column on df1, let's call it Col4. The values on this column have to come from df2 following these rules:
1) If df1 and df2 have the same value in Col1, the value in Col4 should be the corresponding value in Col3.
2)If they do not share the same value in Col1, Col4 should be the average between values in Col3 that correspond to the values immediately before and after it.
For example:
As df2 does not have a value in Col1 for 1, the first entry in Col4 should be the average between 7 and 6 (1 is between 0 and 4).
I don't know if I made myself very clear, but the final result for Col4 should be:
(7+6)/2, (6+8)/2, 8, (8+9)/2, (8+9)/2, 9
It would be nice to have a function because I will have to make this operation on many different data frames.
I know it is a weird problem, but thanks for the help!
You can accomplish what you want with pandas.merge_asof
You merge df1 with df2 on Col1 in both directions, forward and backward. Then you simply average the results. I've concatenated the two merges into one df column-wise and renamed the columns so they don't wind up with the same names.
import pandas as pd
df = pd.concat([pd.merge_asof(df1, df2, on='Col1').rename(columns={'Col3': 'Col4_1'}),
pd.merge_asof(df1, df2, on='Col1', direction='forward')[['Col3']].rename(columns={'Col3': 'Col4_2'})], axis=1)
print(df)
# Col1 Col2 Col4_1 Col4_2
#0 1 1 7 6
#1 7 2 6 8
#2 10 3 8 8
#3 50 4 8 9
#4 73 5 8 9
#5 80 6 9 9
# Calculate the average you want, drop helper columns.
df['Col4'] = (df.Col4_1 + df.Col4_2)/2
df.drop(columns=['Col4_1', 'Col4_2'], inplace=True)
print(df)
# Col1 Col2 Col4
#0 1 1 6.5
#1 7 2 7.0
#2 10 3 8.0
#3 50 4 8.5
#4 73 5 8.5
#5 80 6 9.0

python pandas DataFrame iterate through rows and compare two columns and apply function

I have a DataFrame with two columns:
df:
ix Col1 Col2
1 11.0 'JPY'
2 51.0 'EUR'
..
1000,000 27.0 'CAD'
I have a list of currencies l1 = ['JPY','EUR',...,'CAD']
I have a list of conversions l2 = [5.0, 1.0, ..., 0.5]
I have a function as well that I created:
def convert_currency(symbol, amount):
index_value = list_of_symbols.index(symbol)
rate = list_of_values[index_value]
converted = amount * rate
return converted
and I would like to apply this funcion as follows:
for index, row in df.iterrows():
if row['currency'] != 'GBP':
row['price_inc'] = convert_currency(row['currency'], row['price_inc'])
but it does not work.
what would be fast working solution to apply function to col1 values based on the col2 values and that function intakes col1 value and return value which replaces col1 values
IIUC you can use the following vectorized approach:
Source data sets:
In [108]: d1
Out[108]:
ix Col1 Col2
0 1 11.0 JPY
1 2 51.0 EUR
2 3 27.0 CAD
In [109]: l1 = ['JPY','EUR','CAD']
In [110]: l2 = [5.0, 1.0, 0.5]
Helper "exchange rate" Series:
In [111]: d2 = pd.Series(l2, l1)
In [112]: d2
Out[112]:
JPY 5.0
EUR 1.0
CAD 0.5
dtype: float64
Solution:
In [113]: d1.Col1 *= d1.Col2.map(d2)
In [114]: d1
Out[114]:
ix Col1 Col2
0 1 55.0 JPY
1 2 51.0 EUR
2 3 13.5 CAD
I'm not sure I fully understand, but it seems you want to multiply your Col1 by some rate, which is different for different values of Col2. I would recommend doing this by using an "apply" function to create a new column called 'rate', which is the corresponding rate for each Col2 row. Then, multiplying Col1 by 'rate' is the solution. Here's some working code. I chose to store the mapping between Col2 and rate in a dictionary (whereas you do it in two lists), but the idea is the same.
df=pd.DataFrame([[11.0,'JPY'],[51.0,'EUR'],[27.0,'CAD']],columns=['Col1','Col2'])
mydict = {'JPY':5.0,'EUR':1.0,'CAD':0.5}
def get_rate(symbol):
return mydict[symbol]
df['rate'] = df['Col2'].apply(get_rate)
df['price_inc'] = df['Col1'] * df['rate']
Out[87]:
Col1 Col2 rate price_inc
0 11.0 JPY 5.0 55.0
1 51.0 EUR 1.0 51.0
2 27.0 CAD 0.5 13.5

what would be a good approach to store a dictionary of dataframes in MySQL?

Now I have a dictionary, where each key is corresponding to a 3-row dataframe(same kind), what would be a good way of storing this data in MySQL?
I am using python, pandas.
Thanks for any help!
Edit:
Here is the format of each dataframe
Col1 Col2
1 A 0.2
2 B 0.3
3 C 0.25
The purpose of the data is for searching. When we request by a key, I want to get all the information stored in its associated dataframe. Storing it in one table would be enough for future usage.
Consider the following approach:
Setup:
In [10]: df1
Out[10]:
Col1 Col2
1 A 0.20
2 B 0.30
3 C 0.25
In [11]: df2
Out[11]:
Col1 Col2
1 A 10.20
2 B 10.30
3 C 10.25
In [12]: df3
Out[12]:
Col1 Col2
1 A 20.20
2 B 20.30
3 C 20.25
In [13]: dfs = {'df1':df1, 'df2':df2, 'df3':df3}
We can merge all of our DFs into one DF and add additional column containing the key:
In [15]: df = pd.concat([df.assign(idx=key) for key, df in dfs.items()], ignore_index=True)
In [16]: df
Out[16]:
Col1 Col2 idx
0 A 0.20 df1
1 B 0.30 df1
2 C 0.25 df1
3 A 10.20 df2
4 B 10.30 df2
5 C 10.25 df2
6 A 20.20 df3
7 B 20.30 df3
8 C 20.25 df3
and save it to MySQL:
df.to_sql('table_name', engine, if_exists='...')
PS i would also consider adding an index based on the 'idx' column in order to speed up access/searching in MySQL...

Categories