Subset a DataFrame - python

If I have this data frame:
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
How can I select 50% of the rows, so that column "C" is True in 90% of the selected rows and False in 10% of them?

firstly create a dataframe in 1000 rows
import pandas as pd
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
df = pd.concat([df]*100)
print(df)
secondly get the true_row_num and false_row_num
row_num, _ = df.shape
true_row_num = int(row_num * 0.5 * 0.9)
false_row_num = int(row_num * 0.5 * 0.1)
print(true_row_num, false_row_num)
thirdly randomly sample true_df and false_df respectively
true_df = df[df["C"]].sample(true_row_num)
false_df = df[~df["C"]].sample(false_row_num)
new_df = pd.concat([true_df, false_df])
new_df = new_df.sample(frac=1.0).reset_index(drop=True) # shuffle
print(new_df["C"].value_counts())

I think if you calculate the needed sizes ex ante and then perform random sampling per group it might work. Look at something like this:
new=df.query('C==True').sample(int(0.5*len(df)*0.9)).append(df.query('C==False').sample(int(0.5*len(df)*0.1)))

Related

Remove non numeric rows from dataframe

I have a dataframe of patients and their gene expressions. I has this format:
Patient_ID | gene1 | gene2 | ... | gene10000
p1 0.142 0.233 ... bla
p2 0.243 0.243 ... -0.364
...
p4000 1.423 bla ... -1.222
As you see, that dataframe contains noise, with cells that are values other then a float value.
I want to remove every row that has a any column with non numeric values.
I've managed to do this using apply and pd.to_numeric like this:
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this
EDIT: To reproduce something like my data:
arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
df['gene0'][2] = 'bla'
df['gene9998'][4] = 'bla'
Was right it is worth trying numpy :)
I got 30-60x times faster version (bigger array, larger improvement)
Convert to numpy array (.values)
Iterate through all rows
Try to convert each row to row of floats
If it fails (some NaN present), note this in boolean array
Create array based on the results
Code:
import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
def op_version(df):
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
return df.dropna()
def np_version(df):
keep = np.full(len(df), True)
for idx, row in enumerate(df.values[:, 1:]):
try:
row.astype(np.float)
except:
keep[idx] = False
pass # maybe its better to store to_remove list, depends on data
return df[keep]
#profile
def main():
arr = np.random.random_sample((3000, 5000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(5000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)],
columns=['Patient_ID']), df], axis=1)
df['gene0'][2] = 'bla'
df['gene998'][4] = 'bla'
df2 = df.copy()
df = op_version(df)
df2 = np_version(df2)
Note I decreased number of columns so it is more feasible for tests.
Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

pandas dataframe creating columns with loop

I'm trying to add new columns and fill them with data with for loops, take data from Price column and insert 1000 iterations into new dataframe column, after 1000 Price column iterations then make a new column for 1000 more, etc.
import pandas as pd
import matplotlib.pyplot as plt
data_frame = pd.read_csv('candle_data.csv', names=['Time', 'Symbol','Side', 'Size', 'Price','1','2','3','4','5'])
price_df = pd.DataFrame()
count_tick = 0
count_candle = 0
for price in data_frame['Price']:
if count_tick < 1000:
price_df[count_candle] = price
count_tick +=1
elif count_tick == 1000:
count_tick = 0
count_candle +=1
price_df.head()
It's not necessary that you loop through the data frame , you can use slicing to achieve this, look at below sample code. I have loaded a Dataframe with 100 rows and trying to create column -'col3' from first 50 rows of 'col1' and post that column 'col4' from the next 50 rows of 'col1'. You could modify the below code to point to your columns and the values that you want
import pandas as pd
import numpy as np
if __name__ == '__main__':
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
df['col3']= df['col1'][0:50]
df['col4'] = df['col1'][50:100]
print(df)
Solution 2 based on added info from comments
import pandas as pd
import numpy as np
if __name__ == '__main__':
pd.set_option('display.width', 100000)
pd.set_option('display.max_columns', 500)
### partition size for example I have taken a low volums 20
part_size = 20
## number generation for data frame
col1 = np.linspace(0,100,100)
col2 = np.linspace(100, 200, 100)
## create initial data frame
dict = {'col1':col1,'col2':col2}
df = pd.DataFrame(dict)
len = df.shape[0]
## tells you how many new columns you need
rec = int(len/part_size)
_ = {}
## initialize slicing variables
low =0
high=part_size
print(len)
for i in range(rec):
if high >= len:
_['col_name_here{0}'.format(i)] = df[low:]['col1']
break
else:
_['col_name_here{0}'.format(i)] = df[low:high]['col1']
low = high
high+= part_size
df = df.assign(**_)
print(df)

How to add some calculation in columns of the dataframe in python

I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns
Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!
I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))

Python Pandas Panel counting value occurence

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Categories