I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)
Related
Original question/answer for more context can be found here.
Hello, I am working with a dataframe that looks like the following:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
I am assigning the year of "given" and then checking how many years to >=70%. The "given" value is mapped to the lower year if the value is less than 75% of the distance between the two years on either side of "given". I map the "given" column to a "year" column using the following lines:
thresholds = df + df.diff(-1, axis=1).abs() * 0.75
below_75 = (df['Given'].to_numpy()[:, None] - thresholds.to_numpy()) < 0
min_year = thresholds.where(below_75).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float)
min_year = df.where(df > 0.7).drop(columns=['Given']).idxmin(axis=1).str.replace('Year ', '').astype(float) - min_year
This works perfectly in most cases, except for in the case where "given" maps to a value already above 70%. In this case, a row such as the following
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.69 0.24 0.5 0.61 0.7 0.74
would map to "year 4", and then checks the next column (Year 5) and sees that it is above 0.7, so will output "1" (since it thinks it is 1 year until a value >70%).
But since it is originally mapped to "year 4" which is already above 70%, I would like it to output "done". I feel like this is an extremely easy fix but I am at a loss.
All help appreciated.
Quick summary:
Essentially I am trying to map the "given" value to a "year" column. If the "given" value is <= 3/4 of the way to the next year (i.e. if year 3 is 10%, year 4 is 20%, and "given" is 17%, it would map to year 3 since 17% < 17.5%). Then it calculates how many years until > 70%.
Most of the code was already solved in an earlier question, I am just trying to work on the part of the code that assigns to a specific year.
I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012
I have two pandas dataframes that look like this:
df1
id 10 20 30
0 2020-04-01-10001 0.0 0.000000 0.000000
1 2020-04-01-10003 0.0 0.026587 0.053174
2 2020-04-01-10005 0.0 0.030884 0.061768
3 2020-04-01-1001 0.0 0.035875 0.071751
4 2020-04-01-1003 0.0 0.041673 0.083346
...
df2
10 20 30
id
2020-04-01-10001 0.15 0.17 0.18
2020-04-01-10003 0.55 0.57 0.61
2020-04-01-10005 0.36 0.37 0.38
2020-04-01-1001 0.00 0.00 0.02
2020-04-01-1003 0.00 0.02 0.04
...
I want to use rows of df1 to update the rows of df2, so I am trying to use df2.update(df1). Unfortunately, this doesn't change the rows of df2. I suspect that it has something to do with the extra numbers on the left side of df1. I'm not sure what those are. I also noticed that the two tables give different results when I run df.to_dict:
'10': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},,...
{'10': {'2020-04-01-10001': 0.15,
'2020-04-01-10003': 0.55,
'2020-04-01-10005': 0.36,
'2020-04-01-1001': 0.0,
'2020-04-01-1003': 0.0},...
How would I go about converting df1 to the format of df2?
Use set_index with replace and fillna:
import numpy as np
df1 = df1.set_index('id').replace(0, np.NaN).fillna(df2)
10 20 30
id
2020-04-01-10001 0.15 0.170000 0.180000
2020-04-01-10003 0.55 0.026587 0.053174
2020-04-01-10005 0.36 0.030884 0.061768
2020-04-01-1001 0.00 0.035875 0.071751
2020-04-01-1003 0.00 0.041673 0.083346
I have the following dataframe (df):
Row Number
Row 0 0.24 0.16 -0.18 -0.20 1.24
Row 1 0.18 0.12 -0.73 -0.36 -0.54
Row 2 -0.01 0.25 -0.35 -0.08 -0.43
Row 3 -0.43 0.21 0.53 0.55 -1.03
Row 4 -0.24 -0.20 0.49 0.08 0.61
Row 5 -0.19 -0.29 -0.08 -0.16 0.34
I am attempting to sum all the negative and positive numbers respectively, e.g. sum(neg_numbers) = n and sum(pos_numbers) = x
I have tried:
df.groupby(df.agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
to no avail.
How would I sum these values?
Thank you in advance!
You can do
sum_pos = df[df>0].sum(1)
sum_neg = df[df<0].sum(1)
if you want to get the sums per row. If you want to sum all values regardless of rows/columns, can use np.nansum
sum_pos = np.nansum(df[df>0])
You can do with
df.mul(df.gt(0)).sum().sum()
Out[447]: 5.0
df.mul(~df.gt(0)).sum().sum()
Out[448]: -5.5
If need sum by row
df.mul(df.gt(0)).sum()
Out[449]:
1 0.42
2 0.74
3 1.02
4 0.63
5 2.19
dtype: float64
Yet another way for the total sums:
sum_pos = df.to_numpy().flatten().clip(min=0).sum()
sum_neg = df.to_numpy().flatten().clip(max=0).sum()
And for sums by columns:
sum_pos_col = sum(df.to_numpy().clip(min=0))
sum_neg_col = sum(df.to_numpy().clip(max=0))
If you have string columns in dataframe and want to get the sum for particular column, then
df[df['column_name']>0]['column_name'].sum()
df[df['column_name']<0]['column_name'].sum()
I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84