appending to a pandas dataframe - python

I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df

At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value

df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score

Related

How to drop a column if there is more than 55% repeated values in the column?

I have a dataframe and need to drop all the columns that contain more than 55% of repeated/duplicate values in each column.
Would anyone be able to assist me on how to do this?
Please try this:
Let df1 be your dataframe:
drop_columns = []
drop_threshold = 0.55 #define the percentage criterion for drop
for cols in df1.columns:
df_count = df1[cols].value_counts().reset_index()
df_count['drop_percentage'] = df_count[cols]/df1.shape[0]
df_count['drop_criterion'] = df_count['drop_percentage'] > drop_threshold
if True in df_count.drop_criterion.values:
drop_columns.append(cols)
df1 = df1.drop(columns=drop_columns,axis=1)
Let's use pd.Series.duplciated:
cols_to_keep=df.columns[df.apply(pd.Series.duplicated).mean() <= .55]
df[cols_to_keep]
If you're referring to columns in which the most common value is repeated in more than 55% or rows, here's a solution
from collections import Counter
# assuming some DataFrame named df
bool_idx = df.apply(lambda x: max(Counter(x).values()) < len(x) * .55, axis=0)
df = df.loc[:, bool_idx]
if you're talking about non-unique values, this works:
bool_idx = df.apply(
lambda x: sum(
y for y in Counter(x).values() if y > 1
) < .55 * len(x),
axis=0
)
df = df.loc[:, bool_idx]

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

simplify splitting a dataframe to several dataframes

So I have some dataframes (df0, df1, df2) with various numbers of rows. I wanted to split any dataframe which has a number of rows more than 30 to several dataframes consists of 30 rows only. So for example my dataframe df0 has 156 rows, then I would separated this dataframe into several dataframes like this:
if len(df0) > 30:
df0_A = df0[0:30]
df0_B = df0[31:60]
df0_C = df0[61:90]
df0_D = df0[91:120]
df0_E = df0[121:150]
df0_F = df0[151:180]
else:
df0= df0
The problem with this code is that I need to repeat the code exhaustively many times for the next code like this:
df0= pd.DataFrame(df0)
df0_A = pd.DataFrame(df0_A)
df0_B = pd.DataFrame(df0_B)
df0_C = pd.DataFrame(df0_C)
df0_D = pd.DataFrame(df0_D)
df0_E = pd.DataFrame(df0_E)
df0_F = pd.DataFrame(df0_F)
df0= df0.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_A = df0_A.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_B = df0_B.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_C = df0_C.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_D = df0_D.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_E = df0_E.to_string(header=False,
index=False,
index_names=False).split('\n')
df0_F = idUGS0_F.to_string(header=False,
index=False,
index_names=False).split('\n')
df0= [','.join(ele.split()) for ele in df0]
df0_A = [','.join(ele.split()) for ele in df0_A]
df0_B = [','.join(ele.split()) for ele in df0_B]
df0_C = [','.join(ele.split()) for ele in df0_C]
df0_D = [','.join(ele.split()) for ele in df0_D]
df0_E = [','.join(ele.split()) for ele in df0_E]
df0_F = [','.join(ele.split()) for ele in df0_F]
now imagine I have ten dataframes that I need to split each into five dataframes. Then I need to make the same code for 50 times!
I'm quite new to Python. So, can anyone help me with how to simplify this code, maybe with simple for loop? thanks
You could probably automate it a little bit more, but this should be enough!
import copy
import numpy as np
df0 = pd.DataFrame({'Test' : np.random.randint(100000,999999,size=180)})
len(df0)
if len(df0) > 30:
df_dict = {}
x=0
y=30
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df0_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df0[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 30
y += 30
df_name
else:
df0
for df in df_dict:
print(df)
print('--------------------------------------------------------------------')
print(f'length: {len(df_dict[df])}')
print('--------------------------------------------------------------------')
print(df_dict[df])
print('--------------------------------------------------------------------')
Assuming you have one column for identification,
def split_df(idf, idcol, nsize):
g = idf.groupby(idcol)
# Compute the size for each value of identification column
size = g.size()
dflist = []
for _id,_idcount in size.iteritems():
if _idcount > nsize:
# print(_id, ' = ', _idcount)
idx = idf[ idf[idcol].eq(_id) ].index
# print(idx)
# lets split the array into equal parts of `nsize`
# e.g. [1,2,3,4,5] with nsize = 2 will split into ([1,2], [3,4], [5])
ilist = np.array_split(idx, round(idx.shape[0]/nsize + 0.5))
dflist += ilist
return [idf.loc[idx].copy(deep=True) for idx in dflist]
df = pd.DataFrame(data=np.hstack((np.random.choice(np.arange(1,3), 10).reshape(10, -1), np.random.rand(10,3))), columns=['id', 'a', 'b', 'c'])
df = df.astype({'id': np.int64})
split(df, 'id', 2)
This is a great problem, you can use this (data is the DataFrame here):
# Create subsets of size 30 for the DataFrame
subsets = list(range(0, len(data), 30))
# Create start cutoffs for subsets of the DataFrame
start_cutoff = subsets
# Create end cutoffs for subsets of the DataFrame
end_cutoff = subsets[1:] + [len(data)]
# Zip the start cutoffs and end cutoffs into a List of Cutoffs
cutoffs = list(zip(start_cutoff, end_cutoff))
# List containing Splitted Dataframes
list_dfs = [data.iloc[cutoff[0]: cutoff[-1]] for cutoff in cutoffs]
# convert list to string DFs
string_dfs = [df.to_string(header=False, index=False, index_names=False).split('\n') for df in list_dfs]
final_df_list = [','.join(ele.split()) for string_df in string_dfs for ele in string_df]
Now you can access the DataFrames by:
print(final_df_list[0])
print(final_df_list[1])

Is there a pandas way to compute a function between 2 columns?

I am looking for a faster way to compute some kind of function across multiple columns.
my dataframe looks like:
c = 12*1000
b = int(c/2)
d = int(b/2)
newdf = {'Class': ['c1']*c+['c2']*c+['c3']*c,
'Section': ['A']*b+['B']*b+['C']*b+['D']*b+['E']*b+['F']*b,
'Time': [1,2,3,4,5,6]*d+[3,1,3,4,5,7]*d}
test = pd.DataFrame(newdf)
test['f_x'] = test['Time']**2/5
test['f_x_2'] = test['Time']**2/5+test['f_x']
#working with 1 column
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
cols = ['f_x','f_x_2']
and I know how to calculate for example a value for a series of columns for groups:
test['section_mean'] = test.groupby(['Class','Section'])['f_x'].transform(lambda x: x.mean())
Or eventually do simple operations between more columns:
test['two_col_sum'] = test[['Time','f_x']].apply(lambda x: x.Time+x.f_x,axis=1)
However, what I'm trying to do is some kind of computation over the full column of a grouped instance:
%%time
slopes_df = pd.DataFrame()
grouped = test.groupby(['Class','Section'])
for name, group in grouped:
nd=[]
for col in cols:
ntest = group[['Time',col]]
x = ntest.Time
y = ntest[col]
f=np.polyfit(x,y, deg=1).round(2)
data = [name[0],name[1],col,f[0],f[1]]
nd.append(data)
slopes_df=pd.concat([slopes_df,pd.DataFrame(nd)])
slopes_df.columns=['Class','Section','col','slope','intercept']
slopes_df_p = pd.pivot_table(data=slopes_df,index=['Class','Section'], columns=['col'], values=['slope','intercept']).reset_index()
slopes_df_p.columns = pd.Index(e[0] if e[0] in ['Class','Section'] else e[0]+'_'+e[1] for e in slopes_df_p.columns)
fdf = pd.merge(test, slopes_df_p, on=['Class','Section'])
I tried the solution proposed in this way:
%%time
for col in cols:
df1 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[0])
.rename('slope_'+str(col)))
df2 = (test.groupby(['Class','Section'])
.apply(lambda x: np.polyfit(x['Time'],x[col], deg=1).round(2)[1])
.rename('intercept_'+str(col)))
df1['col']=col
df2['col']=col
test = pd.merge(test,df1, on=['Class','Section'])
test = pd.merge(test,df2, on=['Class','Section'])
but it seems slower, on my pc first loop takes 150ms and second code 300 ms
Andrea
Your loop solution not working by data of groups, so I think you need GroupBy.apply:
def f(x):
for col in cols:
x[f'slope_{col}'], x[f'intercept_{col}'] = np.polyfit(x['Time'],x[col], deg=1).round(2)
return x
df1 = test.groupby(['Class','Section']).apply(f)

Issue with .dropna() in Pandas

In the below function I am working with a Pandas dataframe. I am bringing in a data frame, and immediately resetting the index. I then make a copy of that dataframe so I avoid any Chained Assignment issues.
I then want to use .dropna(inplace=True, subset = [header], axis=0) to remove any rows where my column of interest (header) is nan. However, once I start into the for loop it is clear that the nan values haven't dropped because I keep getting warnings like:
RuntimeWarning: Mean of empty slice
which is a result of my array neighbors having all nan values.
My Question: In the line where I use df_copy.dropna(inplace=True, subset=[header], axis=0) does it seem like I am not actually getting a permanent drop of those rows?
n_samples = 10
tolerance = 1.5
dataframe = pd.read_csv('my_file.csv')
def removeOutliers(dataframe, header):
dataframe.reset_index(inplace=True, drop=True)
df_copy = dataframe.copy()
#Why doesn't the below actually drop the NaNs?
df_copy.dropna(inplace=True, subset=[header], axis=0)
for ii in range(len(df_copy['Lng'])):
a = df_copy.iloc[ii]['Lng'] - df_copy.iloc[:]['Lng']
b = df_copy.iloc[ii]['Lat'] - df_copy.iloc[:]['Lat']
c = np.array((a**2 + b**2)**0.5 )
d = np.zeros((len(df_copy['Lng'])))
e = np.zeros((len(df_copy['Lng'])))
d[:] = df_copy.iloc[:]['Well']
e[:] = df_copy.iloc[:][header]
idx = np.argpartition(c, n_samples+1)
max_loc = np.where(e[idx[0:n_samples+1]] == e[ii])
neighbors = np.delete(e[idx[0:n_samples+1]], max_loc)
avg = np.nanmean(neighbors)
std = np.nanstd(neighbors)
if df_copy.iloc[ii][header] > (avg + tolerance*std) or df_copy.iloc[ii][header] < (avg - tolerance*std):
df_copy.iloc[ii, df_copy.columns.get_loc(header)] = np.nan
return df_copy
test_data = removeOutliers(dataframe, 'myColumn')

Categories