Truncate all values in dataframe - python

I have a pandas dataframe with the first column being dates and then lots of adjusted stock prices (some of which have 16 decimals). I would like to truncate all the dataframe values to 8 decimals so I tried the following:
df = df.set_index("Day").pipe(lambda x: math.trunc(100000000 * x) / 100000000).reset_index()
But I get the following error:
type DataFrame doesn't define __trunc__ method

Have you tried formatting?
dec = [1.2736,9.3745,5.412783,8.25389]
to_3dp = lambda x: '%.3f'%(x)
rounded = [to_3dp(i) for i in dec]
print(rounded) # [1.273, 9.374, 5.412, 8.253]
So in your case:
df['column'] = df['column'].apply(lambda x: '%.8f'%(x))
If you want to round:
df['column'] = df['column'].apply(lambda x: round(x,8))

Use numpy.trunc for a vectorial solution:
n = 10**8
out = np.trunc(df.set_index("Day").mul(n)).div(n).reset_index()

IIUC, you are trying to apply the truncate-based lambda function on multiple columns at once. That's the reason for the error, try using applymap which applies your function on each cell independently. You have to first set your date column as index, leaving only the float columns in the dataframe. Try this -
f = lambda x: math.trunc(100000000 * x) / 100000000 #<-- your function
df.set_index("Day").applymap(f).reset_index() #<-- applied on each cell
Since I don't have the sample dataset you are using, here is a working dummy example.
import math
import pandas as pd
#Dummy dataframe
df = pd.DataFrame(np.random.random((10,3)),
columns = ['col1','col2','col3'])
f = lambda x: math.trunc(100 * x) / 100
df.applymap(f)
col1 col2 col3
0 0.80 0.76 0.14
1 0.40 0.48 0.85
2 0.58 0.40 0.76
3 0.82 0.04 0.10
4 0.23 0.04 0.91
5 0.57 0.41 0.12
6 0.72 0.71 0.71
7 0.32 0.59 0.99
8 0.11 0.70 0.32
9 0.95 0.80 0.24
Another simpler solution is to just use df.set_index("Day").round(8) directly, if that works for you but that would be rounding your numbers to 8 digits instead of truncating.

Related

Cannot do inplace boolean setting on mixed-types with a non np.nan value

I want to transform Data. This is my DataFrame (expense_overview)
Expense
Proportion
0
Salaries
0.62
1
Advertising
0.15
2
Office Rent
0.15
3
Equipment
0.03
4
Utilities
0.03
5
Supples
0.01
6
Food
0.01
I want to replace the values of the Expense variable that has corresponding values in Proportion variable that are less then 5% with the value "Other".
mask = expense_overview.isin(expense_overview[expense_overview["Proportion"] < 0.05]["Expense"])
expense_overview[mask] = "Other"
But I get the following Error:
Cannot do inplace boolean setting on "
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Are you trying to do this:
expense_overview.loc[expense_overview["Proportion"] < 0.05, "Expense"] = "Other"
Result:
Expense Proportion
0 Salaries 0.62
1 Advertising 0.15
2 Office Rent 0.15
3 Other 0.03
4 Other 0.03
5 Other 0.01
6 Other 0.01
Pandas only solution:
expense_overview[lambda x: x.Proportion < 0.05] = expense_overview[lambda x: x.Proportion < 0.05].assign(Expense="Other")
Or using np.where:
expense_overview["Expense"] = np.where(expense_overview["Proportion"] < 0.05, "Other", expense_overview["Expense"])
I found a solution:
mask = expense_overview*[„Expense“]*.isin(expense_overview[expense_overview["Proportion"] < 0.05]["Expense"])
expense_overview[mask, *„Expense“*] = "Other"

How to divide one dataframe by the other without converting to numpy first?

I have a dataframe with two columns, x and y, and a few hundred rows.
I have another dataframe with only one row and two columns, x and y.
I want to divide column x of the big dataframe by the value in x of the small dataframe, and column y by column y.
If I divide one dataframe by the other, I get all NaNs. For the division to work, I must convert the small dataframe to numpy.
Why can't I divide one dataframe by the other? What am I missing? I have a toy example below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
r = int(10)
df['x'] = np.arange(0,r)
df['y'] = df['x'] * 2
other_df = pd.DataFrame()
other_df['x'] = [100]
other_df['y'] = [400]
# This doesn't work - I get all nans
new = df / other_df
# this works - it gives me what I want
new2 = df / [100,400]
# this also works
new3 = df / other_df.to_numpy()
You can convert one row DataFrame to Series for correct align columns, e.g. by selecting first row by DataFrame.iloc:
new = df / other_df.iloc[0]
print (new)
x y
0 0.00 0.000
1 0.01 0.005
2 0.02 0.010
3 0.03 0.015
4 0.04 0.020
5 0.05 0.025
6 0.06 0.030
7 0.07 0.035
8 0.08 0.040
9 0.09 0.045
You can use numpy.divide() to divide as numpy has a great property that is Broadcasting.
new = np.divide(df,other_df)
Please check this link for more details.

Merging pandas dataframes, alternating rows without soritng rows

I'm trying to mimic an spss style correlation table in my Pandas output to make it easier to read for supervisors who are used to seeing matrices laid out this way (and are annoyed that I don't use SPSS anymore because it's harder for them to read).
This means that there is a table where the p-value is placed directly above the correlation coeff in the table. I have easily produced both the p-values and the coeffs and saved each into a separate dataframes like the ones below.
pvals
T 4 Rw Af
T |0.00|0.05|0.24|0.01
4 |0.05|0.00|0.76|0.03
Rw|0.24|0.76|0.00|0.44
...
rs
T 4 Rw Af
T |1.00|0.65|0.28|0.44
4 |0.65|1.00|0.01|0.03
Rw|-0.03|0.01|1.00|0.32
...
What I'd like to do is make a table where the two dataframes are merged without changing the order of the index. It would look like
T |P |0.00|0.05|0.24|0.01
|r |1.00|0.65|0.28|0.44
Rw|P |0.05|0.00|0.76|0.03
|r |0.65|1.00|0.01|0.03
...
Now, I understand that if my columns had alphabetically ordered names I could use something like
pd.concat([pvals, rs]).sort_index(kind='merge')
However, my columns are named with descriptive, non-ordered names and so this doesn't work because it reorders the index into alphabetical order. I also know that
df.corr()
will produce a matrix like the rs example I've given above but this is not what I'm looking for.
If anyone has any advice I'd really appreciate it.
Kev
You can use helper MultiIndex with np.arange and DataFrame.set_index with append=True, add keys parameter for P, r values, sorting by ranges, remove this level and last change order of levels by DataFrame.swaplevel:
s1 = pvals.set_index(np.arange(len(pvals)), append=True)
s2 = rs.set_index(np.arange(len(rs)), append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
print (df)
T 4 Rw Af
T P 0.00 0.05 0.24 0.01
r 1.00 0.65 0.28 0.44
4 P 0.05 0.00 0.76 0.03
r 0.65 1.00 0.01 0.03
Rw P 0.24 0.76 0.00 0.44
r -0.03 0.01 1.00 0.32
Asker Edit
This answer worked once the code was changed to
s1 = pvals.assign(a = np.arange(len(pvals))).set_index('a', append=True)
s2 = rs.assign(a = np.arange(len(rs))).set_index('a', append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
which was recomended by the answerer.

Extract a list with values whose position is indicated by the values in another column from a Pandas DataFrame

It's probably pretty obvious in the end but I cannot think of a way to do this. Thanks for your help!
I did a prediction task and as a result I have a DataFrame with the percentages and a column with the predicted class, like so:
Class1 Class2 Class3 Prediction
0 0.99 0.01 0.00 Class1
1 0.15 0.14 0.71 Class3
2 0.05 0.80 0.15 Class2
Now I want to access the probability with which a class was predicted. So I want to have a list like below so I can work with it further.
0 0.99
1 0.71
2 0.80
I have problems finding a way to access only one value of df.Predicted at a time and have no idea how to search for it. How do I get this value or alternatively my desired list? I tried this:
values = []
for row in df.Predicted:
values.append(row)
print(values)
but it returns the whole column for each iteration. It also doesn't feel very pandas-like. I am using python 3.5 in case it makes a difference
IIUC:
In [15]: df.lookup(df.index, df.Prediction)
Out[15]: array([ 0.99, 0.71, 0.8 ])
or
In [23]: df.max(axis=1)
Out[23]:
0 0.99
1 0.71
2 0.80
dtype: float64
or
In [24]: df.filter(regex='^Class').idxmax(axis=1)
Out[24]:
0 Class1
1 Class3
2 Class2
dtype: object

create matrix structure using pandas

I have loaded the below CSV file containing code and coefficient data into the below dataframe df:
CODE|COEFFICIENT
A|0.5
B|0.4
C|0.3
import pandas as pd
import numpy as np
df= pd.read_csv('cod_coeff.csv', delimiter='|', encoding="utf-8-sig")
giving
ITEM COEFFICIENT
0 A 0.5
1 B 0.4
2 C 0.3
From the above dataframe, I need to create a final dataframe as below which has a matrix structure with the product of the coefficients:
A B C
A 0.25 0.2 0.15
B 0.2 0.16 0.12
C 0.15 0.12 0.09
I am using np.multiply but I am not successful in producing the result.
numpy as a faster alternative
pd.DataFrame(np.outer(df, df), df.index, df.index)
Timing
Given sample
30,000 rows
df = pd.concat([df for _ in range(10000)], ignore_index=True)
You want to do the math between a vector and its tranposition. Transpose with .T and apply the matrix dot function between the two dataframes.
df = df.set_index('CODE')
df.T
Out[10]:
CODE A B C
COEFFICIENT 0.5 0.4 0.3
df.dot(df.T)
Out[11]:
CODE A B C
CODE
A 0.25 0.20 0.15
B 0.20 0.16 0.12
C 0.15 0.12 0.09

Categories