I have the following dataframe (df):
Row Number
Row 0 0.24 0.16 -0.18 -0.20 1.24
Row 1 0.18 0.12 -0.73 -0.36 -0.54
Row 2 -0.01 0.25 -0.35 -0.08 -0.43
Row 3 -0.43 0.21 0.53 0.55 -1.03
Row 4 -0.24 -0.20 0.49 0.08 0.61
Row 5 -0.19 -0.29 -0.08 -0.16 0.34
I am attempting to sum all the negative and positive numbers respectively, e.g. sum(neg_numbers) = n and sum(pos_numbers) = x
I have tried:
df.groupby(df.agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
to no avail.
How would I sum these values?
Thank you in advance!
You can do
sum_pos = df[df>0].sum(1)
sum_neg = df[df<0].sum(1)
if you want to get the sums per row. If you want to sum all values regardless of rows/columns, can use np.nansum
sum_pos = np.nansum(df[df>0])
You can do with
df.mul(df.gt(0)).sum().sum()
Out[447]: 5.0
df.mul(~df.gt(0)).sum().sum()
Out[448]: -5.5
If need sum by row
df.mul(df.gt(0)).sum()
Out[449]:
1 0.42
2 0.74
3 1.02
4 0.63
5 2.19
dtype: float64
Yet another way for the total sums:
sum_pos = df.to_numpy().flatten().clip(min=0).sum()
sum_neg = df.to_numpy().flatten().clip(max=0).sum()
And for sums by columns:
sum_pos_col = sum(df.to_numpy().clip(min=0))
sum_neg_col = sum(df.to_numpy().clip(max=0))
If you have string columns in dataframe and want to get the sum for particular column, then
df[df['column_name']>0]['column_name'].sum()
df[df['column_name']<0]['column_name'].sum()
Related
I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)
I am trying to use Seaborn to plot a simple bar plot using data that was transformed. The data started out looking like this (text follows):
element 1 2 3 4 5 6 7 8 9 10 11 12
C 95.6 95.81 96.1 95.89 97.92 96.71 96.1 96.38 96.09 97.12 95.12 95.97
N 1.9 1.55 1.59 1.66 0.53 1.22 1.57 1.63 1.82 0.83 2.37 2.13
O 2.31 2.4 2.14 2.25 1.36 1.89 2.23 1.8 1.93 1.89 2.3 1.71
Co 0.18 0.21 0.16 0.17 0.01 0.03 0.13 0.01 0.02 0.01 0.14 0.01
Zn 0.01 0.03 0.02 0.03 0.18 0.14 0.07 0.17 0.14 0.16 0.07 0.18
and after importing using:
df1 = pd.read_csv(r"C:\path.txt", sep='\t',header = 0, usecols=[0, 1, 2,3,4,5,6,7,8,9,10,11,12], index_col='element').transpose()
display(df1)
When I plot the values of an element versus the first column (which represents an observation), the first column of data corresponding to 'C' is used instead. What am I doing wrong and how can I fix it?
I also tried importing, then pivoting the dataframe, which resulted in an undesired shape that repeated the element set as columns 12 times.
ax = sns.barplot(x=df1.iloc[:,0], y='Zn', data=df1)
edited to add that I am not married to using any particular package or technique. I just want to be able to use my data to build a bar plot with 1-12 on the x axis and elemental compositions on the y.
you have different possibilities here. The problem you have is because 'element' is the index of your dataframe, so x=df1.iloc[:,0] is the column of 'C'.
1)
ax = sns.barplot(x=df.index, y='Zn', data=df1)
df.reset_index(inplace=True) #now 'element' is the first column of the df1
ax = sns.barplot(x=df.iloc[:,0], y='Zn', data=df1)
#equal to
ax = sns.barplot(x='element', y='Zn', data=df1
I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012
I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91