I have two pandas dataframes that look like this:
df1
id 10 20 30
0 2020-04-01-10001 0.0 0.000000 0.000000
1 2020-04-01-10003 0.0 0.026587 0.053174
2 2020-04-01-10005 0.0 0.030884 0.061768
3 2020-04-01-1001 0.0 0.035875 0.071751
4 2020-04-01-1003 0.0 0.041673 0.083346
...
df2
10 20 30
id
2020-04-01-10001 0.15 0.17 0.18
2020-04-01-10003 0.55 0.57 0.61
2020-04-01-10005 0.36 0.37 0.38
2020-04-01-1001 0.00 0.00 0.02
2020-04-01-1003 0.00 0.02 0.04
...
I want to use rows of df1 to update the rows of df2, so I am trying to use df2.update(df1). Unfortunately, this doesn't change the rows of df2. I suspect that it has something to do with the extra numbers on the left side of df1. I'm not sure what those are. I also noticed that the two tables give different results when I run df.to_dict:
'10': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},,...
{'10': {'2020-04-01-10001': 0.15,
'2020-04-01-10003': 0.55,
'2020-04-01-10005': 0.36,
'2020-04-01-1001': 0.0,
'2020-04-01-1003': 0.0},...
How would I go about converting df1 to the format of df2?
Use set_index with replace and fillna:
import numpy as np
df1 = df1.set_index('id').replace(0, np.NaN).fillna(df2)
10 20 30
id
2020-04-01-10001 0.15 0.170000 0.180000
2020-04-01-10003 0.55 0.026587 0.053174
2020-04-01-10005 0.36 0.030884 0.061768
2020-04-01-1001 0.00 0.035875 0.071751
2020-04-01-1003 0.00 0.041673 0.083346
Related
I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)
pretty new to python and pandas, i have a 15000 values in a column of my dataframe like this.
col1
col2
5
0.05964
19
0.00325
31
0.0225
12
0.03325
14
0.00525
I want to get in output a result like this :
0.00 to 0.01 = 55 values,
0.01 to 0.02 = 365 values,
0.02 to 0.03 = 5464 values etc... from 0.00 to 1.00
Im a bit lost with groupby or count.values etc...
thanks for the help !
IIUC, use pd.cut:
out = df.groupby(pd.cut(df['col2'], np.linspace(0, 1, 101)))['col1'].sum()
print(out)
# Output
col2
(0.0, 0.01] 33
(0.01, 0.02] 0
(0.02, 0.03] 31
(0.03, 0.04] 12
(0.04, 0.05] 0
..
(0.95, 0.96] 0
(0.96, 0.97] 0
(0.97, 0.98] 0
(0.98, 0.99] 0
(0.99, 1.0] 0
Name: col1, Length: 100, dtype: int64
I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012
I would need to plot as bar chart the following columns
%_Var1 %_Var2 %_Val1 %_Val2 Class
2 0.00 0.00 0.10 0.01 1
3 0.01 0.01 0.07 0.05 0
17 0.00 0.00 0.02 0.01 0
24 0.00 0.00 0.11 0.04 0
27 0.00 0.00 0.02 0.03 1
44 0.00 0.00 0.05 0.02 0
53 0.00 0.00 0.03 0.01 1
67 0.00 0.00 0.06 0.02 0
87 0.00 0.00 0.22 0.01 1
115 0.00 0.00 0.03 0.02 0
comparing the values having Class 1 and Class 0 respectively (i.e. bars which show each column of the dataframe, putting one beside the other the column for only Class 1 ad the column for Class 0.
So I should have 8 bars: 4 where 4 bars are for Class 1 and the remaining 4 for Class 0.
One column of Class 1 should be beside the same column for Class 0.
I tried as follows:
ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].plot(kind='bar')
but the output is completely wrong, also writing ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].Label.plot(kind='bar')
I think I should consider a groupby in my code, in order to group by Classes, but I do not know how to set the order (plots are not my top skill)
If you want to try the seaborn way, melt the dataframe to long format and then hue on the class.
data = df.melt(id_vars=['class'], value_vars=['var1','var2','val1','val2'])
import seaborn as sns
sns.barplot(x='variable', y='value', hue='class', data=data, ci=0)
gives:
Or if you want to get the plot based on the class, simply change the hue and x axis..
sns.barplot(x='class', y='value', hue='variable', data = data, ci=0)
gives:
Using groupby:
df.groupby('Class').mean().plot.bar()
With pivot_table method you can summarise the data per group as well.
df.pivot_table(index='Class').plot.bar()
# df.pivot_table(columns='Class').plot.bar() # invert order
By default, it calculates the mean of your target-columns, but you can specify another aggregation method with aggfunc='myfunc' parameter.
I have the following dataframe (df):
Row Number
Row 0 0.24 0.16 -0.18 -0.20 1.24
Row 1 0.18 0.12 -0.73 -0.36 -0.54
Row 2 -0.01 0.25 -0.35 -0.08 -0.43
Row 3 -0.43 0.21 0.53 0.55 -1.03
Row 4 -0.24 -0.20 0.49 0.08 0.61
Row 5 -0.19 -0.29 -0.08 -0.16 0.34
I am attempting to sum all the negative and positive numbers respectively, e.g. sum(neg_numbers) = n and sum(pos_numbers) = x
I have tried:
df.groupby(df.agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
to no avail.
How would I sum these values?
Thank you in advance!
You can do
sum_pos = df[df>0].sum(1)
sum_neg = df[df<0].sum(1)
if you want to get the sums per row. If you want to sum all values regardless of rows/columns, can use np.nansum
sum_pos = np.nansum(df[df>0])
You can do with
df.mul(df.gt(0)).sum().sum()
Out[447]: 5.0
df.mul(~df.gt(0)).sum().sum()
Out[448]: -5.5
If need sum by row
df.mul(df.gt(0)).sum()
Out[449]:
1 0.42
2 0.74
3 1.02
4 0.63
5 2.19
dtype: float64
Yet another way for the total sums:
sum_pos = df.to_numpy().flatten().clip(min=0).sum()
sum_neg = df.to_numpy().flatten().clip(max=0).sum()
And for sums by columns:
sum_pos_col = sum(df.to_numpy().clip(min=0))
sum_neg_col = sum(df.to_numpy().clip(max=0))
If you have string columns in dataframe and want to get the sum for particular column, then
df[df['column_name']>0]['column_name'].sum()
df[df['column_name']<0]['column_name'].sum()