Output unique values from a pandas dataframe without reordering the output - python

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))

Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df
There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.

Check for existence of data from two Dataframe's columns in List

I'm searching for difference between columns in DataFrame and a data in List.
I'm doing it this way:
# pickled_data => list of dics
pickled_names = [d['company'] for d in pickled_data] # get values from dictionary to list
diff = df[~df['company_name'].isin(pickled_names)]
which works fine, but I realized that I need to check not only for company_name but also for place, because there could be two companies with the same name.
df contains also column place as well as pickled_data contains place key in the dictionary.
I would like to be able to do something like this
pickled_data = [(d['company'], d['place']) for d in pickled_data]
diff = df[~df['company_name', 'place'].isin(pickled_data)] # For two values in same row
You can convert values to MultiIndex by MultiIndex.from_tuples, then convert both columns too and compare:
pickled_data = [(d['company'], d['place']) for d in pickled_data]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
Sample:
data = {'company_name':['A1','A2','A2','A1','A1','A3'],
'place':list('sdasas')}
df = pd.DataFrame(data)
pickled_data = [('A1','s'),('A2','d')]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
print (diff)
company_name place
2 A2 a
4 A1 a
5 A3 s
You can form a set of tuples from your pickled_data for faster lookup later, then using a list comprehension over company_name and place columns of the frame, we get a boolean list of whether they are in the frame or not. Then we use this to index into the frame:
comps_and_places = set((d["company"], d["place"]) for d in pickled_data)
not_in_list = [(c, p) not in comps_and_places
for c, p in zip(df.company_name, df.place)]
diff = df[not_in_list]

Construct one row for each Index value from multiple by appending

I have got some data(42 features) collected from people during some months(maximum - 6; varies for different entries), every month's value is represented in its own row:
There are 9267 unique ID values(set as index) and as many as 50 000 rows in the df.
I want to convert it to 42 * 6 feature vectors for each ID(even though some will have a lot of NaNs there), so that i can train on them, here is how it should look like:
Here is my solution:
def flatten_features(f_matrix, ID):
'''constructs a 1x(6*n) vector from 6xn matrix'''
#check wether it is a series, not dataframe
if(len(f_matrix.shape) == 1):
f_matrix['ID'] = ID
return f_matrix
flattened_vector = f_matrix.iloc[0]
for i in range(1, f_matrix.shape[0]):
vector_append = f_matrix.iloc[i]
vector_append.index = (lambda month, series_names : series_names.map(lambda name : name + '_' + str(month)))\
(i, vector_append.index)
flattened_vector = flattened_vector.append(vector_append)
flattened_vector['ID'] = ID
return flattened_vector
#construct dataframe of flattened vectors for numerical features
new_indices = flatten_features(numerical_f.iloc[:6], 1).index
new_indices
flattened_num_f = pd.DataFrame(columns=new_indices)
flattened_num_f
for label in numerical_f.index.unique():
matr = numerical_f.loc[label]
flattened_num_f = flattened_num_f.append(flatten_features(matr, label))
It yields needed results, however it runs very slow. I wonder, is there a more elegant and fast solution?
if you want to transpose df, you could cam T function.
I assume you have id stored in unique_id variable
new_f = numerical_f.T
new_f.columns = unique_id

Python Pandas Panel counting value occurence

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories