Python pandas: apply on separated values - python

How can I sum values in dataframe that a separated by semicolon?
Got:
col1 col2
2018-03-05 2.1 8
2018-03-06 8 3.1;2
2018-03-07 1;1 8;1
Need:
col1 col2
2018-03-05 2.1 8
2018-03-06 8 5.1
2018-03-07 2 9

You can use apply for processes each column with split, cast to float and sum per columns:
df = df.apply(lambda x: x.str.split(';', expand=True).astype(float).sum(axis=1))
Or process each value separately by applymap:
df = df.applymap(lambda x: sum(map(float, x.split(';'))))
print (df)
col1 col2
2018-03-05 2.1 8.0
2018-03-06 8.0 5.1
2018-03-07 2.0 9.0
EDIT:
If numeric with strings columns is possible use select_dtypes for exclude numeric and working only with strings columns with ;:
print (df)
col1 col2 col3
2018-03-05 2.1 8 1
2018-03-06 8 3.1;2 2
2018-03-07 1;1 8;1 8
cols = df.select_dtypes(exclude=np.number).columns
df[cols] = df[cols].apply(lambda x: x.str.split(';', expand=True).astype(float).sum(axis=1))
print (df)
col1 col2 col3
2018-03-05 2.1 8.0 1
2018-03-06 8.0 5.1 2
2018-03-07 2.0 9.0 8

You can use numpy.vectorize if performance is an issue:
res = pd.DataFrame(np.vectorize(lambda x: sum(map(float, x.split(';'))))(df.values),
columns=df.columns, index=df.index)
Performance benchmarking
def jpp(df):
res = pd.DataFrame(np.vectorize(lambda x: sum(map(float, x.split(';'))))(df.values),
columns=df.columns, index=df.index)
return res
def jez(df):
return df.applymap(lambda x: sum(map(float, x.split(';'))))
df = pd.concat([df]*1000)
%timeit jpp(df) # 11 ms per loop
%timeit jez(df) # 21.3 ms per loop

You can use:
df['col2'] = df.col2.map(lambda s: sum(float(e) for e in s.split(';')))

Related

How to get a new df constituted by partialy transposed fragments of another dataframe

I am struggling to get my dataframe transposed, not simply transposed but I want to limit the number of columns to the number of rows in index slices, in order to well explain my problem I give you my dataframe here :
df=pd.DataFrame({
'n' : [0,1,2, 0,1,2, 0,1,2],
'col1' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'col2' : [9.6,10.4, 11.2, 3.3, 6, 4, 1.94, 15.44, 6.17]
})
It has the following display :
n col1 col2
0 0 A 9.60
1 1 A 10.40
2 2 A 11.20
3 0 B 3.30
4 1 B 6.00
5 2 B 4.00
6 0 C 1.94
7 1 C 15.44
8 2 C 6.17
From that dataframe I want to get the following new_df:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
What I tried so far :
new_df = df.values.reshape(3, 9)
new_w = [x.reshape(3,3).T for x in new_df]
df_1 = pd.DataFrame(new_w[0])
df_1.index = ['n', 'col1', 'col2']
df_2 = pd.DataFrame(new_w[1])
df_2.index = ['n', 'col1', 'col2']
df_3 = pd.DataFrame(new_w[2])
df_3.index = ['n', 'col1', 'col2']
new_df = df_1.append(df_2)
new_df = new_df.append(df_3)
new_df[new_df.index!='n']
The code I tried works but it looks long, I want another shorter solution for that.
Any help from your side will be highly appreciated, thanks.
Identify the unique values in "col1" with factorize, then melt to combine the two columns and pivot:
(df.assign(idx=pd.factorize(df['col1'])[0]).melt(['n', 'idx'])
.pivot(index=['idx', 'variable'], columns='n', values='value')
.droplevel('idx').rename_axis(index=None, columns=None) # optional
)
Or with groupby.cumcount:
(df.assign(idx=df.groupby('n').cumcount()).melt(['n', 'idx'])
.pivot(index=['idx', 'variable'], columns='n', values='value')
.droplevel('idx').rename_axis(index=None, columns=None)
)
Output:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
In the following method I extract 3 dataframes so that I can concatenate them later. I have to do a bit of manipulation to get it into the correct format:
Select every 3 rows
Transpose these 3 rows
Get the column names from the first row
Remove the first row
append to a list
Once I have the 3 dataframes in a list, they can be concatenated using pd.concat
Code:
t_df = []
for i in range (int(len(df)/3)):
temp = df.iloc[i*3:(i+1)*3].T
temp.columns = temp.iloc[0]
temp = temp[1:]
t_df.append(temp)
new_df = pd.concat(t_df)
print(new_df)
Output:
n 0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17
The logic is:
Group by "col1" and iterate the grouper.
Transpose the sub_group obtained in iteration.
Concat all transposed sub_groups.
df_arr = []
for key, sub_df in df.groupby("col1"):
df_arr.append(sub_df.set_index("n").T)
df = pd.concat(df_arr).rename_axis("", axis="columns")
Output:
0 1 2
col1 A A A
col2 9.6 10.4 11.2
col1 B B B
col2 3.3 6.0 4.0
col1 C C C
col2 1.94 15.44 6.17

merge two dataframes on common cell values of different columns

I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?
Perform the merges in the preferred order, and use combine_first to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2 to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C
I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64

Append min value of two columns in pandas data frame

df
Purchase
1
3
2
5
4
7
df2
df2 = pd.DataFrame(columns=['Mean','Median','Max','Col4'])
df2 = df2.append({'Mean': (df['Purchase'].mean()),'Median':df['Purchase'].median(),'Max':(df['Purchase'].max()),'Col4':(df2[['Mean','Median']].min(axis=1))}, ignore_index=True)
Output obtained
Mean Median Max Col4
3.66 3.5 7 Series([], dtype: float64)
Output expected
Mean Median Max Col4
3.66 3.5 7 3.5 #Value in Col4 is Min(Mean, Median of df2)
Can anyone help?
Use np.minimum and passed mean with median:
df2 = pd.DataFrame(columns=['Mean','Median','Max','Col4'])
df2 = (df2.append({'Mean': df['Purchase'].mean(),
'Median':df['Purchase'].median(),
'Max': df['Purchase'].max(),
'Col4': np.minimum(df['Purchase'].mean(), df['Purchase'].median())},
ignore_index=True))
print (df2)
Mean Median Max Col4
0 3.666667 3.5 7.0 3.5
Or better is use Series.agg with new value of min in next step, last create one row DataFrame:
s = df['Purchase'].agg(['mean','median','max'])
s.loc['col4'] = s[['mean','median']].min()
df = s.to_frame(0).T
print (df)
mean median max col4
0 3.666667 3.5 7.0 3.5

How to convert objects to numeric

I have very inconsistent data in one of DataFrame columns:
col1
12.0
13,1
NaN
20.3
abc
"12,5"
200.9
I need to standardize these data and find a maximum value among numeric values, which should be less than 100.
This is my code:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if x.isdigit() else x)
num_temps = pd.to_numeric(df[col],errors='coerce')
temps = num_temps[num_temps<10]
print(temps.max())
It fails when, for example, x is float AttributeError: 'float' object has no attribute 'isdigit'.
Cast value to string by str(x), but then for test is necessary also replace . and , to empty value for use isdigit:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if str(x).replace(',', '').replace('.', '').isdigit() else x)
But here is possible cast values to strings and then use Series.str.replace:
num_temps = pd.to_numeric(df["col1"].astype(str).str.replace(',', '.'), errors='coerce')
print (df)
col1
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
temps = num_temps[num_temps<100]
print(temps.max())
20.3
Alternative:
def f(x):
try:
return float(str(x).replace(',','.'))
except ValueError:
return np.nan
num_temps = df["col1"].apply(f)
print (num_temps)
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
Name: col1, dtype: float64
This works:
df.replace(",", ".", regex=True).replace("[a-zA-Z]+", np.NaN, regex=True).dropna().max()

python pandas DataFrame iterate through rows and compare two columns and apply function

I have a DataFrame with two columns:
df:
ix Col1 Col2
1 11.0 'JPY'
2 51.0 'EUR'
..
1000,000 27.0 'CAD'
I have a list of currencies l1 = ['JPY','EUR',...,'CAD']
I have a list of conversions l2 = [5.0, 1.0, ..., 0.5]
I have a function as well that I created:
def convert_currency(symbol, amount):
index_value = list_of_symbols.index(symbol)
rate = list_of_values[index_value]
converted = amount * rate
return converted
and I would like to apply this funcion as follows:
for index, row in df.iterrows():
if row['currency'] != 'GBP':
row['price_inc'] = convert_currency(row['currency'], row['price_inc'])
but it does not work.
what would be fast working solution to apply function to col1 values based on the col2 values and that function intakes col1 value and return value which replaces col1 values
IIUC you can use the following vectorized approach:
Source data sets:
In [108]: d1
Out[108]:
ix Col1 Col2
0 1 11.0 JPY
1 2 51.0 EUR
2 3 27.0 CAD
In [109]: l1 = ['JPY','EUR','CAD']
In [110]: l2 = [5.0, 1.0, 0.5]
Helper "exchange rate" Series:
In [111]: d2 = pd.Series(l2, l1)
In [112]: d2
Out[112]:
JPY 5.0
EUR 1.0
CAD 0.5
dtype: float64
Solution:
In [113]: d1.Col1 *= d1.Col2.map(d2)
In [114]: d1
Out[114]:
ix Col1 Col2
0 1 55.0 JPY
1 2 51.0 EUR
2 3 13.5 CAD
I'm not sure I fully understand, but it seems you want to multiply your Col1 by some rate, which is different for different values of Col2. I would recommend doing this by using an "apply" function to create a new column called 'rate', which is the corresponding rate for each Col2 row. Then, multiplying Col1 by 'rate' is the solution. Here's some working code. I chose to store the mapping between Col2 and rate in a dictionary (whereas you do it in two lists), but the idea is the same.
df=pd.DataFrame([[11.0,'JPY'],[51.0,'EUR'],[27.0,'CAD']],columns=['Col1','Col2'])
mydict = {'JPY':5.0,'EUR':1.0,'CAD':0.5}
def get_rate(symbol):
return mydict[symbol]
df['rate'] = df['Col2'].apply(get_rate)
df['price_inc'] = df['Col1'] * df['rate']
Out[87]:
Col1 Col2 rate price_inc
0 11.0 JPY 5.0 55.0
1 51.0 EUR 1.0 51.0
2 27.0 CAD 0.5 13.5

Categories