read and write array at the same time python - python

I want to recalculate column "a" of a given dataframe = df. But my way of doing it does not fill in the new, calculated values over the old ones.
import pandas as pd
import numpy as np
from numpy.random import randn
df = pd.DataFrame(randn(100))
df["a"] = np.nan
df["b"] = randn()
df.a[0] = 0.5
df.a= df.a.shift(1) * df.b
Do you have any ideas how I can fill to solve that?
I want to calculate "a" depending on its previous value an "b":
a b
0.5 2 #set as starting value with df.a[0] = 0.5 since there is no value for a prior to that, there's no calculation performed.
1.5 3 # a = previous value of a *b (0.5*3) =1.5
15 10 # a = previous value of a *b (1.5*10) =15
45 3 # a = previous value of a *b (15*3) =45
The problem is that the calcation does not perform / results of calcutation do not overwrite previously set values.

How about this?
df = pd.DataFrame({'a': [None] * 4, 'b': [2, 3, 10, 3]})
df.a.iloc[0] = 0.5
df.a.iloc[1:] = (df.b.shift(-1).cumprod() * df.a.iat[0])[:-1].values
>>> df
a b
0 0.5 2
1 1.5 3
2 15 10
3 45 3

You can do it using a for loop like this:
for i in df.index[1:]:
df.a.ix[i] = df.b.ix[i]*df.a.ix[i-1]
If anyone knows a vectorized way I'd be interested to see it.

Related

Finding the percentage of each unique values for every column in Pandas

I know that to count each unique value of a column and turning it into percentage I can use:
df['name_of_the_column'].value_counts(normalize=True)*100
I wonder how can I do this for all the columns as a function and then drop the column where a unique value in a given column has above 95% of all values? Note that the function should also count the NaN values.
You can try this:
l=df.columns
for i in l:
res=df[i].value_counts(normalize=True)*100
if res.iloc[0]>=95:
del df[i]
You can write a small wrapper around value_counts that returns False if any value is above some threshold, and True if the counts look good:
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A": [1] * 20, # should NOT survive
"B": [1, 0] * 10, # should survive
"C": [np.nan] * 20, # should NOT survive
"D": [1,2,3,4] * 5, # should survive
"E": [0] * 18 + [np.nan, np.nan] # should survive
})
print(df.head())
Implementation
def threshold_counts(s, threshold=0):
counts = s.value_counts(normalize=True, dropna=False)
if (counts >= threshold).any():
return False
return True
column_mask = df.apply(threshold_counts, threshold=0.95)
clean_df = df.loc[:, column_mask]
print(clean_df.head())
B D E
0 1 1 0.0
1 0 2 0.0
2 1 3 0.0
3 0 4 0.0
4 1 1 0.0

how do i use np.nanmin when comparing one column of pandas dataframe with a integer?

import pandas as pd
import numpy as np
a = np.array([[1, 2], [3, np.nan]])
np.nanmin(a, axis=0)
array([1., 2.])
I want to use same logic but on pandas dataframe columns and comparing each value of column with an integer.
use case:
MC_cond = df['MODEL'].isin(["MC"])
df_lgd_type = df['LGD_TYPE'].isin(["FIXED"])
df_without_lgd_type = ~(df_lgd_type)
x = np.nanmin((1,df.loc[MC_cond & df_without_lgd_type,'A'] + df.loc[MC_cond &
df_without_lgd_type,'B']))
comparing sum of column A and column B with 1.
This should do the trick even without np.nanmin. I hope I've understood everything correctly from your sparse description.
I'm assuming you also want to replace those NaN values that are left after summation. So we fill those with 1 and then clip all values to max at 1.
a = df.loc[MC_cond & df_without_lgd_type, 'A']
b = df.loc[MC_cond & df_without_lgd_type, 'B']
x = (a + b).fillna(1).clip(upper=1)
Example:
df = pd.DataFrame({
'A': [-1, np.nan, 2, 3, 4],
'B': [-4, 5, np.nan, 7, -8]
})
(df.A + df.B).fillna(1).clip(upper=1)
# Output:
# 0 -5.0
# 1 1.0
# 2 1.0
# 3 1.0
# 4 -4.0
# dtype: float64
In case you don't want NaN values in one column leading to row sum being NaN too, just fill them before:
x = (a.fillna(0) + b.fillna(0)).fillna(1).clip(upper=1)
Just for completeness, this would be a pure numpy solution resembling your approach:
a = df.loc[MC_cond & df_without_lgd_type, 'A'].to_numpy()
b = df.loc[MC_cond & df_without_lgd_type, 'B'].to_numpy()
# optionally fill NaNs with 0
# a = np.nan_to_num(a)
# b = np.nan_to_num(b)
s = a + b
x = np.nanmin(np.stack(s, np.ones_like(s))), axis=0)

Use previous row value for calculating log

I have a Dataframe as presented in the Spreadsheet, It has a column A.
https://docs.google.com/spreadsheets/d/1h3ED1FbkxQxyci0ETQio8V4cqaAOC7bIJ5NvVx41jA/edit?usp=sharing
I have been trying to create a new column like A_output which uses the previous row value and current row value for finding the Natual Log.
df.apply(custom_function, axix=1) #on a function
But I am not sure, How to access the previous value of the row?
The only thing I have tried is converting the values into the list and perform my operation and appending it back to the dataframe something like this.
output = []
previous_value = 100
for value in df['A'].values:
output.append(np.log(value/previous_value))
previous_value = value
df['A_output'] = output
This is going to be extremely expensive operation, What's the best way to approach this problem?
Another way with rolling():
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 1))
df = pd.DataFrame(columns=['A'], data=data)
df['output'] = df['A'].rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output'][0] = np.log(df['A'][0] / init_val) # <-- manually assign value for the first item
print(df)
# A output
# 0 7.257160 0.883376
# 1 4.579390 -0.460423
# 2 4.630148 0.011023
# 3 5.153198 0.107029
# 4 6.004917 0.152961
# 5 6.633857 0.099608
If you want to apply the same operation on multiple columns:
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 2))
df = pd.DataFrame(columns=['A', 'B'], data=data)
df[['output_A', 'output_B']] = df.rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output_A'][0] = np.log(df['A'][0] / init_val)
df['output_B'][0] = np.log(df['B'][0] / init_val)
print(df)
# A B output_A output_B
# 0 7.289657 4.986245 0.887844 0.508071
# 1 5.690721 5.010605 -0.247620 0.004874
# 2 5.773812 5.129814 0.014495 0.023513
# 3 4.417981 6.395500 -0.267650 0.220525
# 4 4.923170 5.363723 0.108270 -0.175936
# 5 5.279008 5.327365 0.069786 -0.006802
We can use Series.shift and after use .loc to assign the first value with the base value
Let's assume we have the following dataframe:
df = pd.DataFrame({'A':np.random.randint(1, 10, 5)})
print(df)
A
0 8
1 3
2 3
3 1
4 5
df['A_output'] = np.log(df['A'] / df['A'].shift())
df.loc[0, 'A_output'] = np.log(df.loc[0, 'A'] / 100)
print(df)
A A_output
0 8 -2.525729
1 3 -0.980829
2 3 0.000000
3 1 -1.098612
4 5 1.609438

Nearest neighbor matching in Pandas

Given two DataFrames (t1, t2), both with a column 'x', how would I append a column to t1 with the ID of t2 whose 'x' value is the nearest to the 'x' value in t1?
t1:
id x
1 1.49
2 2.35
t2:
id x
3 2.36
4 1.5
output:
id id2
1 4
2 3
I can do this by creating a new DataFrame and iterating on t1.groupby() and doing look ups on t2 then merging, but this take incredibly long given a 17 million row t1 DataFrame.
Is there a better way to accomplish? I've scoured the pandas docs regarding groupby, apply, transform, agg, etc. But an elegant solution has yet to present itself despite my thought that this would be a common problem.
Using merge_asof
df = pd.merge_asof(df1.sort_values('x'),
df2.sort_values('x'),
on='x',
direction='nearest',
suffixes=['', '_2'])
print(df)
Out[975]:
id x id_2
0 3 0.87 6
1 1 1.49 5
2 2 2.35 4
Method 2 reindex
df1['id2']=df2.set_index('x').reindex(df1.x,method='nearest').values
df1
id x id2
0 1 1.49 4
1 2 2.35 3
convert to list t1 and t2 and sort them after this
and with the zip() function match the id
list1 = t1.values.tolist()
list2 = t2.values.tolist()
list1.sort() // ASC ORD DESC YOU DECIDE
list2.sort()
list3 = zip(list1,list2)
print(list3)
//after that you must see the output like (1,4),(2,3)
You can calculate a new array with the distance from each element in t1 to each element in t2, and then take the argmin along the rows to get the right index. This has the advantage that you can choose whatever distance function you like, and it does not require the dataframes to be of equal length.
It creates one intermediate array of size len(t1) * len(t2). Using a pandas builtin might be more memory-efficient, but this should be as fast as you can get as everything is done on the C side of numpy. You could always do this method in batches if memory is a problem.
import numpy as np
import pandas as pd
t1 = pd.DataFrame({"id": [1, 2], "x": np.array([1.49, 2.35])})
t2 = pd.DataFrame({"id": [3, 4], "x": np.array([2.36, 1.5])})
Now comes the part doing the actual work. The .to_numpy() bit is important since otherwise Pandas tries to merge on the indices. The first line uses broadcasting to create horizontal and vertical "repetitions" in a memory-efficient way.
dist = np.abs(t1["x"][np.newaxis, :] - t2["x"][:, np.newaxis])
closest_idx = np.argmin(dist, axis=1)
closest_id = t2["id"][closest_idx].to_numpy()
output = pd.DataFrame({"id1": t1["id"], "id2": closest_id})
print(output)
Alternatively, you can use round to 1 precision
t1 = {'id': [1, 2], 'x': [1.49,2.35]}
t2 = {'id': [3, 4], 'x': [2.36,1.5]}
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
df = df1.round(1).merge(df2.round(1), on='x', suffixes=('','2')).drop('x',1)
print(df)
id id2
0 1 4
1 2 3
add .drop('x',1) to remove the output for the binding column 'x'.
add suffixes=('','2') to rename the column titles.

When doing Pandas DataFrame Calculations top row returns all zeros and all other rows correct

I am writing a function that takes a dataframe and concatenates a second dataframe next to the original dataframe with a simple calculation of percentages. I want to have the rows simply be values followed by percentages. Here is an example:
A, B, A (%), B (%)
1, 1, 0.50, 0.50
1, 1, 0.50, 0.50
But instead my code is returning:
A, B, A (%), B (%)
1, 1, 0 , 0
1, 1, .50 , .50
The first row in and sized dataframe that I do this with returns a row of zeros and then the calculations that follow in later rows are all correct. The code I am running deals with a dataframe that has 3 columns containing values... Count, IV, P are their titles.
I have attached the code below:
column_list = []
for column in frame.columns[1:]:
column_list.append(column + ' (%)')
percentages = pd.DataFrame(columns = column_list)
for i in range(frame.shape[0]):
percentages.loc[i] = [float(frame.iloc[i,1])/float(frame['Count'].sum()),
float(frame.iloc[i,2])/float(frame['IV'].sum()),
float(frame.iloc[i,3])/float(frame['P'].sum())]
return_frame = pd.concat([frame,percentages], axis = 1)
return return_frame
I'm not sure where the bug in your code is, but here is a concise way to achieve your desired output:
import pandas as pd
df = pd.DataFrame({'A': [1, 3], 'B': [9, 7]})
df_percent = df.apply(lambda r: r/sum(r), axis=1).add_suffix('( %%)')
df_result = pd.concat([df, df_percent], axis=1)
Contents of df_result:
A B A (%) B (%)
0 1 9 0.1 0.9
1 3 7 0.3 0.7
Also, you may want to multiply the df_percent values by 100 to convert what are technically fractions into percentages.
EDIT: To get column-wise percentages instead of row-wise, change axis=1 to axis=0. The contents of df_result is then:
A B A( %) B( %)
0 1 9 0.25 0.5625
1 3 7 0.75 0.4375

Categories