Having this pandas.core.frame.DataFrame:
Gorilla A T C C A G C T
Dog G G G C A A C T
Humano A T G G A T C T
Drosophila A A G C A A C C
Elefante T T G G A A C T
Mono A T G C C A T T
Unicornio A T G G C A C T
I would like to get a data frame like that:
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Basically, what I want is to count the frequent column by column and create the second df as I show.
I want to do this because finally, I would like to get a Consensus string. Should be something like that A T G C A A C T
Could anyone help me or give me some advice?
Try:
result = df.apply(pd.value_counts).fillna(0)
col1 col2 col3 col4 col5 col6 col7 col8
A 5.0 1.0 0.0 0.0 5.0 5.0 0.0 0.0
C 0.0 0.0 1.0 4.0 2.0 0.0 6.0 1.0
G 1.0 1.0 6.0 3.0 0.0 1.0 0.0 0.0
T 1.0 5.0 0.0 0.0 0.0 1.0 1.0 6.0
You could use Series.value_counts by column:
print(df.iloc[:, 1:].apply(pd.Series.value_counts).fillna(0))
Output
1 2 3 4 5 6 7 8
A 5.0 1.0 0.0 0.0 5.0 5.0 0.0 0.0
C 0.0 0.0 1.0 4.0 2.0 0.0 6.0 1.0
G 1.0 1.0 6.0 3.0 0.0 1.0 0.0 0.0
T 1.0 5.0 0.0 0.0 0.0 1.0 1.0 6.0
Related
How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
I have the following dataframe
0 0 0
1 0 0
1 1 0
1 1 1
1 1 1
0 0 0
0 1 0
0 1 0
0 0 0
how do you get a dataframe which looks like this
0 0 0
4 0 0
4 3 0
4 3 2
4 3 2
0 0 0
0 2 0
0 2 0
0 0 0
Thank you for your help.
You may need using for loop here , with tranform, and using cumsum create the key and assign the position back to your original df
for x in df.columns:
df.loc[df[x]!=0,x]=df[x].groupby(df[x].eq(0).cumsum()[df[x]!=0]).transform('count')
df
Out[229]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
Or without for loop
s=df.stack().sort_index(level=1)
s2=s.groupby([s.index.get_level_values(1),s.eq(0).cumsum()]).transform('count').sub(1).unstack()
df=df.mask(df!=0).combine_first(s2)
df
Out[255]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
I have a df like this,
A B C D E
1 2 3 0 2
2 0 7 1 1
3 4 0 3 0
0 0 3 4 3
I am trying to replace all the 0 with mean() value between the first row and the 0 value row for the corresponding column,
My expected output is,
A B C D E
1.0 2.00 3.000000 0.0 2.0
2.0 1.00 7.000000 1.0 1.0
3.0 4.00 3.333333 3.0 1.0
1.5 1.75 3.000000 4.0 3.0
Here is main problem need previous mean value if multiple 0 per column, so realy problematic create vectorized solution:
def f(x):
for i, v in enumerate(x):
if v == 0:
x.iloc[i] = x.iloc[:i+1].mean()
return x
df1 = df.astype(float).apply(f)
print (df1)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
Better solution:
#create indices of zero values to helper DataFrame
a, b = np.where(df.values == 0)
df1 = pd.DataFrame({'rows':a, 'cols':b})
#for first row is not necessary count means
df1 = df1[df1['rows'] != 0]
print (df1)
rows cols
1 1 1
2 2 2
3 2 4
4 3 0
5 3 1
#loop by each row of helper df and assign means
for i in df1.itertuples():
df.iloc[i.rows, i.cols] = df.iloc[:i.rows+1, i.cols].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0 2.0
1 2.0 1.00 7.000000 1 1.0
2 3.0 4.00 3.333333 3 1.0
3 1.5 1.75 3.000000 4 3.0
Another similar solution (with mean of all pairs):
for i, j in zip(*np.where(df.values == 0)):
df.iloc[i, j] = df.iloc[:i+1, j].mean()
print (df)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
IIUC
def f(x):
for z in range(x.size):
if x[z] == 0: x[z] = np.mean(x[:z+1])
return x
df.astype(float).apply(f)
A B C D E
0 1.0 2.00 3.000000 0.0 2.0
1 2.0 1.00 7.000000 1.0 1.0
2 3.0 4.00 3.333333 3.0 1.0
3 1.5 1.75 3.000000 4.0 3.0
I am working with a large array of 1's and need to systematically remove 0's from sections of the array. The large array is comprised of many smaller arrays, for each smaller array I need to replace its upper and lower triangles with 0's systematically. For example we have an array with 5 sub arrays indicated by the index value (all sub-arrays have the same number of columns):
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
I want each group of rows to be modified in its upper and lower triangle such that the resulting matrix is:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
At the moment I am using only numpy to achieve this resulting array, but I think I can speed it up using Pandas grouping. In reality my dataset is very large almost 500,000 rows long. The numpy code is below:
import numpy as np
candidateLengths = np.array([1,2,3,4,5])
centroidLength =3
smallPaths = [min(l,centroidLength) for l in candidateLengths]
# This is the k_values of zeros to delete. To be used in np.tri
k_vals = list(map(lambda smallPath: centroidLength - (smallPath), smallPaths))
maskArray = np.ones((np.sum(candidateLengths), centroidLength))
startPos = 0
endPos = 0
for canNo, canLen in enumerate(candidateLengths):
a = np.ones((canLen, centroidLength))
a *= np.tri(*a.shape, dtype=np.bool, k=k_vals[canNo])
b = np.fliplr(np.flipud(a))
c = a*b
endPos = startPos + canLen
maskArray[startPos:endPos, :] = c
startPos = endPos
print(maskArray)
When I run this on my real dataset it takes nearly 5-7secs to execute. I think this is down to this massive for loop. How can I use pandas groupings to achieve a higher speed? Thanks
New Answer
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
idx = np.append(df.index.values, -1)
w = np.append(-1, np.flatnonzero(idx[:-1] != idx[1:]))
c = np.diff(w)
df * np.vstack([tris(n, 3) for n in c])
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
Old Answer
I define some helper triangle functions
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
def tris_df(df):
n, m = df.shape
return pd.DataFrame(tris(n, m), df.index, df.columns)
Then
df * df.groupby(level=0, group_keys=False).apply(tris_df)
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
I have dataframe like below
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1
first I would like to binning with age
age
[0~4]
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
Then sum up and count days by grouping with type
sum count
a 6 2
b 9 3
c 0 0
d 0 0
e 0 0
f 0 0
Then I would like to apply this method to another binns.
[5~9]
[11~14]
My desired result is below
[0~4] [5~9] [10~14]
sum count sum count sum count
a 6 2 0 0 1 1
b 9 3 0 0 0 0
c 0 0 1 1 0 0
d 0 0 4 1 0 0
e 0 0 0 0 2 1
f 0 0 0 1 0 0
How can this be done?
It is very complicated for me..
Consider a pivot_table with pd.cut if you do not care too much about column ordering as count and sum are not paired together under the bin. With manipulation you can change such ordering.
df['bin'] = pd.cut(df.age, [0,4,9,14])
pvtdf = df.pivot_table(index='type', columns=['bin'], values='days',
aggfunc=('count', 'sum')).fillna(0)
# count sum
# bin (0, 4] (4, 9] (9, 14] (0, 4] (4, 9] (9, 14]
# type
# a 2.0 0.0 1.0 6.0 0.0 1.0
# b 3.0 0.0 0.0 9.0 0.0 0.0
# c 0.0 1.0 0.0 0.0 1.0 0.0
# d 0.0 1.0 0.0 0.0 4.0 0.0
# e 0.0 0.0 1.0 0.0 0.0 2.0
# f 0.0 1.0 0.0 0.0 0.0 0.0
We'll use some stacking and groupby operations to get us to the desired output.
string_ = io.StringIO('''age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1''')
df = pd.read_csv(string_, sep='\s+')
df['age_bins'] = pd.cut(df['age'], [0,4,9,14])
df_stacked = df.groupby(['age_bins', 'type']).agg({'days': np.sum,
'type': 'count'}).transpose().stack().fillna(0)
df_stacked.rename(index={'days': 'sum', 'type': 'count'}, inplace=True)
>>> df_stacked
age_bins (0, 4] (4, 9] (9, 14]
type
sum a 6.0 0.0 1.0
b 9.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 4.0 0.0
e 0.0 0.0 2.0
f 0.0 0.0 0.0
count a 2.0 0.0 1.0
b 3.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 1.0 0.0
e 0.0 0.0 1.0
f 0.0 1.0 0.0
This doesn't produce the exact output you listed, but it's similar, and I think it will be easier to index and retrieve data from. Alternatively, you could do use the following to get something like the desired output.
>>> df_stacked.unstack(level=0)
age_bins (0, 4] (4, 9] (9, 14]
count sum count sum count sum
type
a 2.0 6.0 0.0 0.0 1.0 1.0
b 3.0 9.0 0.0 0.0 0.0 0.0
c 0.0 0.0 1.0 1.0 0.0 0.0
d 0.0 0.0 1.0 4.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 2.0
f 0.0 0.0 1.0 0.0 0.0 0.0