Element-by-element division in pandas dataframe with "/"? - python

Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.

When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.

Related

How to calculate the max row value for each column through Vaex

I have an application that uses a Pandas dataframe to calculate each min/max row value for each column. For example:
col_a col_b col_c
2 8 7
10 4 3
6 5 1
calling df.max() produces
col_a 10
col_b 8
col_c 7
Just as a reference the I'm trying to conver the following code:
bin_stats = {'min': df.min(),
'max': df.max(),
'binwidth': (df.max()-df.min()+10**-6)/bincount}
# Transform data into bin positions for fast binning
data = ((df - in_stats['min'])/bin_stats['binwidth']).apply(np.floor)
I'm converting my functionality to Vaex and I need to print out the max row value for every column in my dataframe like above.I have tried df.max(column_names) but I get the error:
ValueError: Could not find a class (AggMax_object), seems object is not supported. How do I get an array of max values?
In vaex you can do df.max(). You need to pass an expression or a list of expressions for which you want to get the maximum value.
Consider this example:
import vaex
df = vaex.example()
columns = df.get_column_names(dtype='numeric')
df.max(columns)
# returns array([ 3.2000000e+01, 1.3049751e+02, 6.0022778e+01, 5.4506802e+01,
6.3641956e+02, 5.7964453e+02, 5.3974872e+02, 3.5941863e+04,
3.7393040e+03, 1.7840929e+03, -3.0200911e-01], dtype=float32)
Note that vaex has a df.minmax() method that can get you the min and max values in a single pass over the data (i.e. faster if you data is larger).
float_columns = df.get_column_names(dtype='float')
df.minmax(float_columns)
Having said all of this, vaex excels at binning stuff, so it might be worth looking into how to achieve what you want in a "vaex-native" way, instead of straight up translating pandas code into vaex. It should work, but you might not get optimal performance.

Removing columns if values are not in an ascending order python

Given a data like so:
Symbol
One
Two
1
28.75
25.10
2
29.00
25.15
3
29.10
25.00
I want to drop the column which does not have its values in an ascending order (though I want to allow for gaps) across all rows. In this case, I want to drop column 'Two'.I tried to following code with no luck:
df.drop(df.columns[df.all(x <= y for x,y in zip(df, df[1:]))])
Thanks
Dropping those columns that give at least one (any) negative value (lt(0)) when their values are differenced by 1 lag (diff(1)) after NaNs are neglected (dropna):
columns_to_drop = [col for col in df.columns if df[col].diff(1).dropna().lt(0).any()]
df.drop(columns=columns_to_drop)
Symbol One
0 1 28.75
1 2 29.00
2 3 29.10
An expression that works with gaps (NaN)
A.loc[:, ~(A.iloc[1:, :].reset_index() > A.iloc[:-1, :].reset_index()).any()]
Without gaps it would be equivalent to
A.loc[:, (A.iloc[1:, :].reset_index() <= A.iloc[:-1, :].reset_index()).all()]
Without loops to take better advantage of the framework for bigger dataframes.
A.iloc[1:, :] returns a dataframe without the first line
A.iloc[:-1, :] returns a dataframe without the last line
Slices in a dataframe keep the indices for corresponding rows, so the different slices have different indices, reset_index will create another index counting [0,1,...], thus making the two sides of the inequality compatible. You can pass drop=True if you want to remove the previous index.
Any (implicitly with axis=0) check for every column if any value is true, if so, it means that a number was followed by another.
A.loc[:, mask] select the columns where mask is true, drops the columns where mask is false.
The logic is could be read as not any value smaller than its predecessor or all values greater than its predecessor.
Check out code and only logic is:
map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)]
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'Symbol': [1, 2, 3],
'One': [28.75, 29.00, 29.10],
'Two': [25.10, 25.15, 25.10],
}
)
print(df.loc[:,map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)])

Calculate the mean value of a dataframe column twice based on a valid other column

I would like to calculate the mean value of a column twice: one time for all values where name contains a valid string and second time for values if name contains an empty string or np.nan.
import pandas as pd
import numpy as np
data = [[np.nan,1],['kkk',4],['ggg',2], ['',3]]
df = pd.DataFrame(data,columns=['name','value'])
Here:
mean 1 (valid cols mean): (4+2)/2 = 3
mean 2 (in-valid cols mean): (1+3)/2 = 2
I could do this via iterating over each row but this is not a very pythonic way. I guess that must be a much more pythonic and smooth solution for this?
Here you go:
print(df.loc[~df.name.isin([np.nan, '']), 'value'].mean())
print(df.loc[df.name.isin([np.nan, '']), 'value'].mean())
Output:
3.0
2.0
You can first unify your non-valid rows by replacing empty strings with np.nan, then extract all the rows with a np.nan in the name column and take the mean of the value column. Afterwards, you could do the inverse of the above to get the mean of the valid rows.
data = [[np.nan,1],['kkk',4],['ggg',2], ['',3]]
df = pd.DataFrame(data,columns=['name','value'])
replaced_empties = df.replace("", np.nan)
mean_2 = replaced_empties[replaced_empties.name.isnull()].value.mean()
mean_1 = replaced_empties[~replaced_empties.name.isnull()].value.mean()
print(mean_1) # 3.0
print(mean_2) # 2.0

How to use 2 methods of filling NA in 1 column in Python

I have a data frame with 1 column.
- There are many NA values at the beginning and at the end that I would like to eliminate them completely.
- At the same time, there are some NA values in the between of 2 available values that I would like to fill them by the mean of 2 closed available values.
For illustration, I attach the image here for your imagine.
I can not think of any solution. Just wonder if anyone can please help me with that.
Thank you for your help]1
Try this,i have reproduced example by using random numbers
import pandas as pd
import numpy as np
random_index = np.random.randint(0,100,size=(5, 1))
random_range = np.arange(10,15)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
df.loc[10:15,'A'] = "#N/A"
for c in random_index:
df.loc[c,"A"] = "#N/A"
// replacing start from here
df[df=="#N/A"]= np.nan
index = list(np.where(df['A'].isna()))[0]
drops = []
for i in index:
if pd.isnull(df.loc[(i-1),"A"]) is False and pd.isnull(df.loc[(i+1),"A"]) is False:
df.loc[i,"A"] = (df.loc[(i-1),"A"]+df.loc[(i+1),"A"])/2
else:
drops.append(i)
df = df.drop(df.index[drops]).reset_index(drop=True)
First, if each N/A is in string format, replace either with np.nan.The most straightforward possible way is to use isnan on the given column, then extract true indices(such as using the result on a np.arange array). From there you can either use a for to iterate indices to check if they are sequential or not, or calculate the distance between consecutive elements to find the ones not equal to 1.

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.
I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.
Any ideas?
Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object
import pandas as pd
import numpy as np
x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])
print(out_labels)
print(out_nolabels)
out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')
As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories
Any NA values will be NA in the result. Out of bounds values will be
NA in the resulting Categorical object
You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.
We can use np.where here to overcome that
x = pd.cut(df['id'],intervals, labels=['<=4','>4'])
np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)
Or add_categories to the values i.e
x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')
[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]
If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe
df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})
df.groupby(df.id.astype(str)).mean()
Output :
id value
id
1.0 1.0 5.0
4.0 4.0 7.0
nan NaN 4.5

Categories