How to convert a matrix to 3D arrays or vice versa? - python

I want to transform a matrix to 3D arrays, or transforming a 3D arrays to a matrix. How to input the data and how to do the transformation work in Python?
I've searched for many places, but there is no answer. please help me
matrix a:
a b c
d 1 2 3
e 2 3 4
f 4 3 2
array b:
a d 1
a e 2
a f 4
b d 2
b e 3
b f 3
c d 3
c e 4
c f 2
can i use stack() to achieve my goal?
like: Python pandas - pd.melt a dataframe with datetime index results in NaN

So your data is not actually 3 dimensional, but 2 dimensional. You are essentially trying to unpivot your 2d data. This is often called melt. Your best option is to load the data into a pandas data frame.
import pandas as pd
df = pd.DataFrame([['d',1,2,3],['e',2,3,4],['f',4,3,2]], columns=['idx','a','b','c'])
df
# returns:
idx a b c
0 d 1 2 3
1 e 2 3 4
2 f 4 3 2
pd.melt(df, id_vars='index', value_vars=list('abc'))
# returns:
idx variable value
0 d a 1
1 e a 2
2 f a 4
3 d b 2
4 e b 3
5 f b 3
6 d c 3
7 e c 4
8 f c 2

I'm not very familiar with the pandas library but here is a rough solution using the python standard library:
#!/usr/bin/env python2
"""
Convert a matrix to 2D arrays and vice versa
http://stackoverflow.com/questions/43289673
"""
from collections import OrderedDict
TEST_MATRIX = """\
a b c
d 1 2 3
e 2 3 4
f 4 3 2
"""
def parse_matrix(matrix_string):
"""Parse a matrix string and return list of tuples representing data"""
matrix_string = matrix_string.strip()
list_of_lines = matrix_string.splitlines()
parsed_list = []
y_headers = list_of_lines[0].split()
data_rows = [i.split() for i in list_of_lines[1:]]
for y in y_headers:
for row in data_rows:
parsed_list.append((y, row[0], row[y_headers.index(y) + 1]))
return parsed_list
def convert_to_matrix(data):
"""
Convert a parsed matrix (in the form of a list of tuples) to a matrix
(string)
"""
# Messes up ordering
# y_headers = set(i[0] for i in data)
# x_headers = set(i[1] for i in data)
y_headers = OrderedDict()
x_headers = OrderedDict()
[(y_headers.setdefault(i[0]), x_headers.setdefault(i[1])) for i in data]
matrix_string = " " + " ".join(y_headers) # header
for x in x_headers:
row = [x]
for y in y_headers:
val = [i[-1] for i in data if i[0] == y and i[1] == x][0]
row.append(val)
row_string = " ".join(row)
matrix_string += "\n" + row_string
return matrix_string
def main():
print("Test matrix:")
print(TEST_MATRIX)
# parse the test matrix string to a list of tuples
parsed_test_matrix = parse_matrix(TEST_MATRIX)
# print the parsed matrix
print("Parsed matrix:")
for row in parsed_test_matrix:
print " ".join(row)
print
# convert parsed matrix back to the original matrix and print
print("Convert parsed matrix back to matrix:")
print(convert_to_matrix(parsed_test_matrix))
if __name__ == "__main__":
main()

Related

Is there a function to write certain values of a dataframe to a .txt file in Python?

I have a dataframe as follows:
Index A B C D E F
1 0 0 C 0 E 0
2 A 0 0 0 0 F
3 0 0 0 0 E 0
4 0 0 C D 0 0
5 A B 0 0 0 0
Basically I would like to write the dataframe to a txt file, such that every row consists of the index and the subsequent column name only, excluding the zeroes.
For example:
txt file
1 C E
2 A F
3 E
4 C D
5 A B
The dataset is quite big, about 1k rows, 16k columns. Is there any way I can do this using a function in Pandas?
Take a matrix vector multiplication between the boolean matrix generated by "is this entry "0" or not" and the columns of the dataframe, and write it to a text file with to_csv (thanks to #Andreas' answer!):
df.ne("0").dot(df.columns + " ").str.rstrip().to_csv("text_file.txt")
where we right strip the spaces at the end due to the added " " to the last entries.
If you don't want the name Index appearing in the text file, you can chain a rename_axis(index=None) to get rid of it i.e.,
df.ne("0").dot(df.columns + " ").str.rstrip().rename_axis(index=None)
and then to_csv as above.
You can try this (replace '0' with 0 if that are numeric 0 instead of string 0):
# Credits to Pygirl, made the code even better.
df.set_index('Index', inplace=True)
df = df.replace('0',np.nan)
df.stack().groupby(level=0).apply(list)
# Out[79]:
# variable
# 0 [C, E]
# 1 [A, F]
# 2 [E]
# 3 [C, D]
# 4 [A, B]
# Name: value, dtype: object
For the writing to text, you can use pandas as well:
df.to_csv('your_text_file.txt')
You could replace string '0' with empty string '', then so some string-list-join manipulation to get the final results. Finally append each line into a text file. See code:
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['0','0','0','0','E','0'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df = df.replace('0', '')
logfile = open('test.txt', 'a')
for i in range(len(df)):
temp = ''.join(list(df.loc[i,:]))
logfile.write(str(i+1) + ' ' + ' '.join(list(temp)) + '\n')
logfile.close()
Output test.txt
1 C E
2 A F
3 E
4 C D
5 A B

How to print the row and columns of the value you're looking for in dataframe

So I made this dataframe
alp = "abcdefghijklmnopqrstuvwxyz0123456789"
s = "carl"
for i in s:
alp = alp.replace(i,"")
jaa = s+alp
x = list(jaa)
array = np.array(x)
re = np.reshape(array,(6,6))
dt = pd.DataFrame(re)
dt.columns = [1,2,3,4,5,6]
dt.index = [1,2,3,4,5,6]
dt
1 2 3 4 5 6
1 c a r l b d
2 e f g h i j
3 k m n o p q
4 s t u v w x
5 y z 0 1 2 3
6 4 5 6 7 8 9
I want to search a value , and print its row(index) and column.
For example, 'h', the output i want is 2,4.
Is there any way to get that output?
row, col = np.where(dt == "h")
print(dt.index[row[0]], dt.columns[col[0]])

pandas convert text feature to numeric value

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?
I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0
consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

how do I card shuffle a pandas series quickly

suppose I have the pd.Series
import pandas as pd
import numpy as np
s = pd.Series(np.arange(10), list('abcdefghij'))
I'd like to "shuffle" this series like a deck of cards by interweaving the top half with the bottom half.
I'd expect results like this
a 0
f 5
b 1
g 6
c 2
h 7
d 3
i 8
e 4
j 9
dtype: int32
Conclusions
final function
def perfect_shuffle(s):
n = s.values.shape[0] # get length of s
l = (n + 1) // 2 * 2 # get next even number after n
# use even number to reshape and only use n of them after ravel
a = np.arange(l).reshape(2, -1).T.ravel()[:n]
# construct new series slicing both values and index
return pd.Series(s.values[a], s.index.values[a])
demonstration
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
a 0
g 6
b 1
h 7
c 2
i 8
d 3
j 9
e 4
k 10
f 5
dtype: int64
order='F' vs T
I had suggested using T.ravel() as opposed to ravel(order='F')
After investigation, it hardly matters but ravel(order='F') is better for larger arrays.
d = pd.DataFrame(dict(T=[], R=[]))
for n in np.power(10, np.arange(1, 8)):
a = np.arange(n).reshape(2, -1)
stamp = pd.datetime.now()
for _ in range(100):
a.ravel(order='F')
d.loc[n, 'R'] = (pd.datetime.now() - stamp).total_seconds()
stamp = pd.datetime.now()
for _ in range(100):
a.T.ravel()
d.loc[n, 'T'] = (pd.datetime.now() - stamp).total_seconds()
d
d.plot()
Thanks unutbu and Warren Weckesser
In then special case where the length of the Series is even, you can to do a perfectly shuffle by reshaping its values into two rows and then using ravel(order='F') to read the items off in Fortran order:
In [12]: pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
Out[12]:
a 0
b 5
c 1
d 6
e 2
f 7
g 3
h 8
i 4
j 9
dtype: int64
Fortran order makes the left-most axis increment fastest. So in a 2D array the
values are read off by going down the rows of one column before progressing to
the next column. This has the effect of interleaving the values, compared to the
usual C-order.
In the general case where the length of the Series could be odd,
perhaps the fastest way is to reassign the values using shifted slices:
import numpy as np
import pandas as pd
def perfect_shuffle(ser):
arr = ser.values
result = np.empty_like(arr)
N = (len(arr)+1)//2
result[::2] = arr[:N]
result[1::2] = arr[N:]
result = pd.Series(result, index=ser.index)
return result
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
yields
a 0
b 6
c 1
d 7
e 2
f 8
g 3
h 9
i 4
j 10
k 5
dtype: int64
To add to #unutbu's answer, some benchmarks:
>>> import timeit
>>> import numpy as np
>>>
>>> setup = '''
... import pandas as pd
... import numpy as np
... s = pd.Series(list('abcdefghij'), np.arange(10))
... '''
>>>
>>> funcs = ['s[np.random.permutation(s.index)]', "pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)",
... 's.iloc[np.random.permutation(s.index)]', "s.values.reshape(-1, 2, order='F').ravel()"]
>>>
>>> for f in funcs:
... print(f)
... print(min(timeit.Timer(f, setup).repeat(3, 50)))
...
s[np.random.permutation(s.index)]
0.029795593000017107
pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
0.0035402200010139495
s.iloc[np.random.permutation(s.index)]
0.010904800990829244
s.values.reshape(-1, 2, order='F').ravel()
0.00019640100072138011
The final f in funcs is > 99% faster than the first np.random.permutation approach, so that's probably your best bet.

count cases in python

If I have a table like
ID Date Disease
1 03.07 A
1 03.07 B
1 03.09 A
1 03.09 C
1 03.10 D
I wrote a code like:
def combination(listData):
comListData = [];
for datum in listData :
start = listData.index(datum) + 1
while start < len(listData) :
if datum!=listData[start] :
comStr = datum+':'+listData[start]
if not comStr in comListData :
comListData.append(comStr)
start+=1;
return comListData
def insertToDic(dic,comSick):
for datum in comSick :
if dic.has_key(datum) :
dic[datum]+=1
else :
dic[datum] = 1
try:
con = mdb.connect('blahblah','blah','blah','blah')
cur = con.cursor()
sql ="select * from table"
cur.execute(sql);
data = cur.fetchall();
start = 0
end = 1
sick = []
dic = {}
for datum in data :
end = datum[0]
if end!=start:
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
sick = []
sick.append(datum[2])
start = end
comSick = combination(sick)
insertToDic(dic,comSick)
for k,v in dic.items():
a,b = k.split(':')
print >>f, a.ljust(0), b.ljust(0), v
f.close()
then I got:
From To Count
A B 1
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A C 1
A D 1
C D 1
and the final version table I got is (In same ID, same direction such as A --> C count as 1 not 2. Same diseases like A --> A doesn't count. A --> B is different with B --> A)
From To Count
A B 1
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
but what I want is (excluding same date cases version):
From To Count
A A 1
A C 1
A D 1
B A 1
B C 1
B D 1
A D 1
C D 1
and finally
From To Count
A C 1
A D 1
B A 1
B C 1
B D 1
C D 1
which part of my code should I edit?
Let me try to rephrase your question. For each ID (excluding date to make the problem simpler), you want all possible pairs of values in Disease column and how often they occur, in which order of the pair matters. Now, up front there is a builtin function in Python that achieve this:
from itertools import permutations
all_pairs = permutations(diseases, 2)
Given your data, I am guessing it is in csv files. If it is not, please adjust my code yourself (which is kind of trivial Google searches). We will be using the famous library in data-science stacks called Pandas. Here is how it goes:
from itertools import permutations
import pandas as pd
df = pd.read_csv('data.csv', header=0)
pairs_by_did = df.groupby('ID').apply(lambda grp: pd.Series(list(permutations(grp['Disease'], 2))))
all_pairs = pd.concat([v for i, v in pairs_by_did.iterrows()])
pair_counts = all_pairs.value_counts()
print pair_counts
For your example, it prints
>>> print pair_counts
(A, B) 2
(D, A) 2
(A, D) 2
(C, A) 2
(B, A) 2
(A, C) 2
(A, A) 2
(C, B) 1
(D, C) 1
(C, D) 1
(D, B) 1
(B, D) 1
(B, C) 1
Name: 1, dtype: int64
Now group by ID and date at the same time, and see what you get.

Categories