substitute all numbers in a matrix with equivalent letters - python

There is a huge matrix whose elements are numbers in the range of 1 to 15. I want to transform the matrix to the one whose elements be letters such that 1 becomes "a", 2 becomes "b", and so on. As a simple example:
import pandas as pd
import numpy as np, numpy.random
numpy.random.seed(1)
A = pd.DataFrame (np.random.randint(1,16,10).reshape(2,5))
# A 0 1 2 3 4
# 0 6 12 13 9 10
# 1 12 6 1 1 2
The expected output is
# B 0 1 2 3 4
# 0 f l m i j
# 1 l f a a b
I can do it with a loop but for a huge matrix, it doesn't seem logical. There should be a more pythonic way to do it. In R, chartr is the function for such a replacement. For the numbers between 1 to 9, it works like this: chartr("123456789", "ABCDEFGHI", A). What is the equivalent in Python?

You can use chr:
>>> import pandas as pd
>>> import numpy as np
>>> numpy.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(1, 16, 10).reshape(2, 5))
>>> df
0 1 2 3 4
0 6 12 13 9 10
1 12 6 1 1 2
>>> df = df.applymap(lambda n: chr(n + 96))
>>> df
0 1 2 3 4
0 f l m i j
1 l f a a b

This is one way. If possible, I would advise against use of lambda and apply via pandas, as these are loopy and have overheads.
import pandas as pd
import numpy as np
import string
np.random.seed(1)
A = pd.DataFrame(np.random.randint(1,16,10).reshape(2,5))
# 0 1 2 3 4
# 0 6 12 13 9 10
# 1 12 6 1 1 2
d = dict(enumerate(string.ascii_uppercase, 1))
A_mapped = pd.DataFrame(np.vectorize(d.get)(A.values))
# 0 1 2 3 4
# 0 F L M I J
# 1 L F A A B

Related

Good ways to wrap around the indices for slicing in pandas data frame

I want to slice the data frame by rows or columns using iloc, while wrapping around the out of the bound indices. Here is an example:
import pandas as pd
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],columns=['a', 'b', 'c'])
#Slice the rows from 2 to 4, which the dataframe only have 3 rows
print(df.iloc[2:4,:])
Data frame:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
The output will be:
a b c
2 7 8 9
But I want to wrap around the out of the bound index, which is like:
a b c
2 7 8 9
0 1 2 3
In numpy, it is possible to use numpy.take to wrap around the out of the bound index for slicing. (The numpy take link)
import numpy as np
array = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(array.take(range(2,4) , axis = 0, mode='wrap'))
The output is:
[[7 8 9]
[1 2 3]]
A possible solution for wrapping out in pandas is using the numpy.take:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],columns=['a', 'b', 'c'])
# Get the integer indices of the dataframe
row_indices = np.arange(df.shape[0])
# Wrap the slice explicitly
wrap_slice = row_indices.take(range(2,4),axis = 0, mode='wrap')
print(df.iloc[wrap_slice, :])
The output will be the output I want:
a b c
2 7 8 9
0 1 2 3
I looked into pandas.DataFrame.take and there is no "wrap" mode. (The pandas take link). What is a good and easy way to solve this problem? Thank you very much!
Let's try using np.roll:
df.reindex(np.roll(df.index, shift=-2)[0:2])
Output:
a b c
2 7 8 9
0 1 2 3
And, to make it a little more generic:
startidx = 2
endidx = 4
df.iloc[np.roll(df.index, shift=-1*startidx)[0:endidx-startidx]]
You could use remainder division
import numpy as np
start_id = 2
end_id = 4
idx = np.arange(start_id, end_id, 1)%len(df)
df.iloc[idx]
# a b c
#2 7 8 9
#0 1 2 3
This method actually allows you to loop around multiple times:
start_id = 2
end_id = 10
idx = np.arange(start_id, end_id, 1)%len(df)
df.iloc[idx]
# a b c
#2 7 8 9
#0 1 2 3
#1 4 5 6
#2 7 8 9
#0 1 2 3
#1 4 5 6
#2 7 8 9
#0 1 2 3

Pandas DataFrame: resampling along integer index / grouping by groups of n elements

I know about pandas resampling functions using a DateTimeIndex.
But how can I easily resample/group along an integer index?
The following code illustrates the problem and works:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print(df)
A B
0 3 2
1 1 1
2 0 1
3 2 3
4 2 0
5 4 0
6 3 1
7 3 4
8 0 2
9 4 4
# sum of n consecutive elements
n = 3
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print(df_new)
A B
0 4 4
1 8 3
2 6 7
3 4 4
But isn't there a more elegant way to accomplish this?
The code seems a bit heavy-handed to me..
Thanks in advance!
You can floor divide index and aggregate some function:
df1 = df.groupby(df.index // n).sum()
If index is not default (integer, unique) aggregate by floor divided numpy.arange created by len of DataFrame:
df1 = df.groupby(np.arange(len(df)) // n).sum()
You can use group by on the integer division of the index by n. i.e.
df.groupby(lambda i: i//n).sum()
here is the code
import numpy as np
import pandas as pd
n=3
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print('df:')
print(df)
res = df.groupby(lambda i: i//n).sum()
print('using groupby:')
print(res)
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print('using your method:')
print(df_new)
and the output
df:
A B
0 1 0
1 3 0
2 1 1
3 0 4
4 3 4
5 0 1
6 0 4
7 4 0
8 0 2
9 2 2
using groupby:
A B
0 5 1
1 3 9
2 4 6
3 2 2
using you method:
A B
0 5 1
1 3 9
2 4 6
3 2 2

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

(Python, DataFrame): Add a Column and insert the nth smallest value in the row

How do I find the nth smallest number in a row, within a DataFrame, and add that value as an entry in a new column (because I would ultimately like to export the data).
Example Data
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.randint(10, size=(4, 5)), columns=list('ABCDE'))
A B C D E
0 4 8 1 1 9
1 2 8 1 4 2
2 8 2 8 4 9
3 4 3 4 1 5
In all of the following solutions, I assume n = 3
Solution 1
function prt below
Use np.partition to place smallest to the left of a partition and the largest to the right. Then take all to the left and find the max.
df.assign(nth=np.partition(df.values, 3, axis=1)[:, :3].max(1))
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 2
function srt below
More intuitive but more costly time complexity with np.sort
df.assign(nth=np.sort(df.values, axis=1)[:, 2])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 3
function rnk below
Using pd.DataFrame.rank
Concise version that upcast to float
df.assign(nth=df.where(df.rank(1, method='first').eq(3)).stack().values)
A B C D E nth
0 4 8 1 1 9 4.0
1 2 8 1 4 2 2.0
2 8 2 8 4 9 8.0
3 4 3 4 1 5 4.0
Solution 4
function whr below
Using np.where and pd.DataFrame.rank
i, j = np.where(df.rank(1, method='first') == 3)
df.assign(nth=df.values[i, j])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Timing
Notice that srt is quickest but comparable to prt for a bit, then for larger number of columns, the more efficient algorithm of prt kicks in.
res.plot(loglog=True)
prt = lambda df, n: df.assign(nth=np.partition(df.values, n, axis=1)[:, :n].max(1))
srt = lambda df, n: df.assign(nth=np.sort(df.values, axis=1)[:, n - 1])
rnk = lambda df, n: df.assign(nth=df.where(df.rank(1, method='first').eq(n)).stack().values)
def whr(df, n):
i, j = np.where(df.rank(1, method='first').values == n)
return df.assign(nth=df.values[i, j])
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000],
columns='prt srt rnk whr'.split(),
dtype=float
)
for i in res.index:
num_rows = int(np.log(i))
d = pd.DataFrame(np.random.rand(num_rows, i))
for j in res.columns:
stmt = '{}(d, 3)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
You can do this as follows:
df.assign(nth=df.apply(lambda x: np.partition(x, nth)[nth], axis='columns'))
Example:
In[72]: df = pd.DataFrame(np.random.rand(3, 3), index=list('abc'), columns=[1, 2, 3])
In[73]: df
Out[73]:
1 2 3
a 0.436730 0.653242 0.843014
b 0.643496 0.854859 0.531652
c 0.831672 0.575336 0.517944
In[74]: df.assign(nth=df.apply(lambda x: np.partition(x, 1)[1], axis='columns'))
Out[74]:
1 2 3 nth
a 0.436730 0.653242 0.843014 0.653242
b 0.643496 0.854859 0.531652 0.643496
c 0.831672 0.575336 0.517944 0.575336
Here is a method that finds nth smallest item in a list:
def find_nth_in_list(list, n):
return sorted(list)[n-1]
The usage:
list =[10,5,7,9,8,4,6,2,1,3]
print(find_nth_in_list(list, 2))
Output:
2
You can give the row items as a list to this function.
EDIT
You can find rows with this function:
#Returns all rows as a list
def find_rows(df):
rows=[]
for row in df.iterrows():
index, data = row
rows.append(data.tolist())
return rows
Example usage:
rows = find_rows(df) #all rows as a list
smallest_3th = find_nth_in_list(rows[2], 3) #3rd row, 3rd smallest item
generate some random data
dd=pd.DataFrame(data=np.random.rand(7,3))
find minumum value per row using numpy
dd['minPerRow']=dd.apply(np.min,axis=1)
export results
dd['minPerRow'].to_csv('file.csv')

pandas groupby operation with missing data

In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?
Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a

Categories