Python Pandas Panel counting value occurence - python

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series

To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30

Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

How to optimize Pandas group by +- margin [duplicate]

This question already exists:
Group by +- margin using python
Closed 7 months ago.
I want to optimize my code where Group by +- margin using python. I want to group my Dataframe composed of 2 columns ['1', '2'] based on a margin +-1 (1) and +-10 (2)
For example, a really simplified overlook
[[273, 10],[274, 14],[275, 15]]
Expected output:
[[273, 10],[274, 14]],[[274, 14],[275, 15]]
My data is much more complex with nearly 1 million data points looking like this 652.125454455
This kind of code for example take me for ever, with no results
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
print("Random numbers were created")
df = pd.DataFrame({'1': a, '2':b})
df['id'] = df.index
1_MARGIN = 1
2_MARGIN = 10
tic = time.time()
group = []
for index, row in df.iterrows():
filtered_df = df[(row['1'] - 1_MARGIN < df['1']) & (df['1'] < row['1'] + 1_MARGIN) &
(row['2'] - 2_MARGIN < df['2']) & (df['2'] < row['2'] + 2_MARGIN)]
group.append(filtered_df[['id', '1']].values.tolist())
toc = time.time()
print(f"for loop: {str(1000*(toc-tic))} ms")
I also tried
data = df.groupby('1')['2'].apply(list).reset_index(name='irt')
but in this case there is no margin
I tried my best to understand what you wanted and I arrived at a very slow solution but at least it's a solution.
import pandas as pd
import numpy as np
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
df = pd.DataFrame({'1': a, '2':b})
dfbl1=np.sort(df['1'].apply(int).unique())
dfbl2=np.sort(df['2'].apply(int).unique())
MARGIN1 = 1
MARGIN2 = 10
marg1array=np.array(range(dfbl1[0],dfbl1[-1],MARGIN1))
marg2array=np.array(range(dfbl2[0],dfbl2[-1],MARGIN2))
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)&(df['2']>low2) & (df['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
I also tried to do each loop seperately and intersect them which should be faster but since we're storing .values.tolist() I couldn't figure out a faster way than below.
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)])
newgroup=[]
for subgroup in groupmarg1:
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
newgroup.append(subgroup.loc[(subgroup['2']>low2) & (subgroup['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
which runs in ~9mins on my machine.
Oh and you need to filter out the empty dataframes, and if you want them as values.tolist() you can do it while filtering like this
gr2=[grp.values.tolist() for grp in newgroup if not grp.empty]

How to calculate the mean and standard deviation of multiple dataframes at one go?

I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?
You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')
One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories