Store python array in each entry of a column - python

I have got an array 'mutlilabel' which looks like this:
[[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
...
[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0]]
and want to store each of those arrays in my target variable as I am facing a multi-label classification task. How can I achieve that? My code:
pd.DataFrame(multilabel)
Outputs multiple columns:
0 1 2 3 4 5 6 7
0 0 0 0 0 1 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
Thanks in advance!

df = pd.DataFrame(list(multilabel))
list_column = df.apply(lambda row: row.values, axis=1)
pd.DataFrame(list_column, columns=['list_column'])
Result df:

Have you consider using the following trick?
import pandas as pd
arr = [[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0]]
pd.DataFrame([arr]).T
Output
0
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
EDIT
In case you are using numpy arrays you can use the following
import numpy as np
pd.DataFrame(np.array(arr))\
.apply(lambda x: np.array(x), axis=1)

So, the real question is why... it doesn't seem like the most useful data structure.
That said, the one-dimensional data type in pandas is the Series:
>>> pd.Series(multilabel)
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
dtype: object
You can then convert it further into a DataFrame:
>>> pd.DataFrame(pd.Series(multilabel))
0
0 [0, 0, 0, 1, 0, 0, 0, 0]
1 [1, 0, 0, 0, 0, 0, 0, 0]
2 [1, 0, 0, 1, 0, 0, 0, 0]
3 [0, 0, 0, 1, 0, 0, 0, 0]
4 [1, 0, 0, 0, 0, 0, 0, 0]
Edit: Per further discussion, this works if multilabel is a nested Python list, but not if it's a NumPy array.

Related

How to convert values to their index

I have a numpy array containing 1's and 0's:
a = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 1, 1, 0, 0, 0]])
I'd like to convert each 1 to the index in the subarray that it's occuring at, to get this:
e = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
So far what I've done is multiply the array by a range:
a * np.arange(a.shape[0])
which is good, but I'm wondering if there's a better, simpler way to do it, like a single function call?
This modifies a in place:
In [4]: i, j = np.nonzero(a)
In [5]: a[i, j] = j
In [6]: a
Out[6]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
Make a copy if you don't want modify a in place.
Or, this creates a new array (in one line):
In [8]: np.arange(a.shape[1])[a]
Out[8]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 7, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 2, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 9],
[0, 0, 2, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 6, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9],
[0, 1, 2, 0, 0, 5, 6, 0, 0, 0]])
Your approach is a fast as it gets but it uses the wrong dimension for the multiplication (it would fait if the matrix wasn't square).
Multiply the matrix by a range of column indexes:
import numpy as np
a = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0]])
e = a * np.arange(a.shape[1])
print(e)
[[ 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 5 0 7 0 0 0]
[ 0 0 2 0 0 0 0 0 0 0 0]
[ 0 1 2 0 0 0 6 0 0 0 0]
[ 0 0 2 0 0 5 0 0 0 9 0]
[ 0 0 0 0 0 5 0 0 0 9 0]
[ 0 0 2 0 0 0 6 0 0 0 10]
[ 0 0 0 3 0 0 6 0 8 0 0]
[ 0 0 0 0 0 0 0 0 0 9 0]
[ 0 1 2 0 0 5 6 0 0 0 0]]
I benchmarked the obligatory np.einsum approach, which was ~1.29x slower for larger arrays (100_000, 1000) than the corrected original solution. The inplace solution was ~8x slower than np.einsum.
np.einsum('ij,j->ij', a, np.arange(a.shape[1]))

Looping and counting python 2d arrays

I have array like this:
[
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0]
]
I want to count value every 3 array, so the result i expected is:
[
[3, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 1, 0, 0, 0, 0, 0]
]
I have no idea to loop it.
UPDATE..
this problem was solved. Thank you. I try Shijith's code this
if len(arr)%3==0:
print([[sum(y) for y in zip(arr[x],arr[x+1],arr[x+2])] for x in range(0, len(arr),3)])
Try:
result = []
for i in range(int(len(a)/3)):
result.append(np.sum(a[i*3:i*3+3], axis=0))
[array([3, 0, 0, 0, 0, 0, 0, 0, 0]),
array([0, 3, 0, 0, 0, 0, 0, 0, 0]),
array([0, 0, 2, 1, 0, 0, 0, 0, 0])]
you can use numpy.sum() along axis=0 , for every three rows in your array.
import numpy as np
if len(arr)%3==0:
print(np.array([np.sum(arr[x:x+3], axis = 0) for x in range(0, len(arr),3) ]))
[[3 0 0 0 0 0 0 0 0]
[0 3 0 0 0 0 0 0 0]
[0 0 2 1 0 0 0 0 0]]
or use simple list comprehension,
if len(arr)%3==0:
print([[sum(y) for y in zip(arr[x],arr[x+1],arr[x+2])] for x in range(0, len(arr),3)])
[[3, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 1, 0, 0, 0, 0, 0]]
You can do this in pandas like this ( It splits the rows into 3, then takes the sum of each set of rows) :
import pandas as pd
import numpy as np
df=pd.DataFrame([
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0]
])
df.groupby(np.arange(len(df))//3).sum().to_numpy().tolist()
output:
[[3, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 1, 0, 0, 0, 0, 0]]
For a pure non import way:
combine=[]
for x in range(3):
combine.append(list(sum((a[x*3:x*3+3]))))
list(combine)
output:
[[3, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 1, 0, 0, 0, 0, 0]]
result = [[sum(three) for three in
zip(arr[first], arr[first + 1], arr[first + 2])]
for first in range(0, len(array)-len(array)%3, 3)]
print(result)
Output
[[3, 0, 0, 0, 0, 0, 0, 0, 0], [0, 3, 0, 0, 0, 0, 0, 0, 0], [0, 0, 2, 1, 0, 0, 0, 0, 0]]

Tensorflow Initialize list with 0s and 1s similar to tf.one_hot

I currently have a list of values over a time sequence say the values are [1, 3, 5, 7, 3]. Currently I am using tf.one_hot to get a one hot vector/tensor representative for each value within the list.
1 = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
5 = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
7 = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
3 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
In Tensorflow is there a way/function that would allow me to do something similar but initialize all values from 0 to the value with 1s?
Desired Result:
1 = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
5 = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
7 = [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
3 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
You are looking for tf.sequence_mask.
TensorFlow API

Github-API: How do I plot the number of commits to a repo?

I am a Python newbie, and I am trying to plot the number of commits, per day of the week.
I have the output as a pandas dataframe. However, I am unable to figure out how to plot this data using matplotlib.
Any help is greatly appreciated!
import requests
import json
import pandas as pd
r = requests.get('https://api.github.com/repos/thisismetis/dsp/stats/commit_activity')
raw = r.text
results = json.loads(raw)
print pd.DataFrame(results)
Results:
days total week
0 [0, 0, 0, 0, 0, 0, 0] 0 1431216000
1 [0, 0, 0, 0, 0, 8, 0] 8 1431820800
2 [0, 19, 1, 4, 18, 8, 0] 50 1432425600
3 [0, 3, 23, 1, 0, 0, 0] 27 1433030400
4 [1, 0, 0, 0, 1, 0, 0] 2 1433635200
5 [0, 0, 0, 0, 0, 0, 0] 0 1434240000
6 [0, 2, 0, 0, 0, 0, 0] 2 1434844800
7 [0, 0, 0, 0, 0, 0, 0] 0 1435449600
8 [0, 0, 0, 0, 0, 0, 0] 0 1436054400
9 [0, 0, 0, 0, 0, 0, 0] 0 1436659200
10 [0, 0, 8, 0, 3, 0, 0] 11 1437264000
11 [0, 3, 36, 0, 1, 9, 0] 49 1437868800
12 [0, 2, 2, 2, 5, 1, 0] 12 1438473600
13 [0, 0, 0, 0, 0, 0, 0] 0 1439078400
14 [0, 2, 0, 0, 0, 0, 0] 2 1439683200
15 [0, 0, 0, 0, 0, 0, 0] 0 1440288000
16 [0, 0, 0, 0, 0, 0, 0] 0 1440892800
17 [0, 0, 0, 0, 0, 0, 0] 0 1441497600
18 [0, 0, 0, 0, 0, 3, 0] 3 1442102400
19 [0, 0, 0, 0, 0, 0, 0] 0 1442707200
20 [0, 0, 0, 0, 0, 0, 0] 0 1443312000
21 [0, 0, 0, 0, 0, 0, 0] 0 1443916800
22 [0, 0, 0, 0, 0, 0, 0] 0 1444521600
23 [0, 0, 10, 0, 0, 0, 0] 10 1445126400
24 [0, 0, 0, 0, 0, 0, 0] 0 1445731200
25 [1, 0, 0, 0, 0, 0, 0] 1 1446336000
26 [0, 0, 0, 0, 4, 3, 0] 7 1446940800
27 [0, 0, 0, 0, 0, 0, 0] 0 1447545600
28 [0, 0, 0, 0, 0, 0, 0] 0 1448150400
29 [0, 0, 0, 0, 0, 0, 0] 0 1448755200
30 [0, 0, 0, 0, 0, 0, 0] 0 1449360000
31 [0, 0, 0, 0, 0, 0, 0] 0 1449964800
32 [0, 0, 0, 0, 0, 0, 0] 0 1450569600
33 [0, 0, 0, 0, 0, 0, 1] 1 1451174400
34 [0, 0, 0, 0, 0, 0, 0] 0 1451779200
35 [0, 0, 0, 0, 0, 0, 0] 0 1452384000
36 [0, 0, 0, 0, 0, 0, 0] 0 1452988800
37 [0, 0, 0, 0, 0, 0, 0] 0 1453593600
38 [0, 0, 0, 0, 0, 0, 0] 0 1454198400
39 [0, 0, 5, 2, 0, 0, 0] 7 1454803200
40 [0, 0, 25, 2, 0, 0, 0] 27 1455408000
41 [1, 10, 0, 0, 3, 0, 0] 14 1456012800
42 [0, 0, 0, 0, 0, 0, 0] 0 1456617600
43 [0, 0, 0, 0, 0, 0, 0] 0 1457222400
44 [0, 0, 0, 2, 1, 0, 0] 3 1457827200
45 [0, 0, 0, 0, 0, 0, 0] 0 1458432000
46 [0, 0, 0, 0, 0, 0, 0] 0 1459036800
47 [0, 0, 0, 0, 0, 0, 0] 0 1459641600
48 [0, 0, 0, 0, 0, 0, 0] 0 1460246400
49 [0, 0, 0, 0, 0, 0, 0] 0 1460851200
50 [0, 0, 0, 0, 0, 0, 0] 0 1461456000
51 [0, 0, 0, 0, 0, 0, 0] 0 1462060800
you can do it this way:
df['date'] = pd.to_datetime(df.week, unit='s')
df['week_no'] = df.apply(lambda x: '{:d}-{:02d}'.format(x['date'].year, x['date'].weekofyear), axis=1)
df.set_index('week_no')['total'].plot.bar()

Converting an array of numpy arrays to DataFrame

I have a numpy object that contains the following:
17506 [0, 0, 0, 0, 0, 0]
17507 [0, 0, 0, 0, 0, 0]
17508 [0, 0, 0, 0, 0, 0]
17509 [0, 0, 0, 0, 0, 0]
17510 [0, 0, 0, 0, 0, 0]
17511 [0, 0, 0, 0, 0, 0]
17512 [0, 0, 0, 0, 0, 0]
17513 [0, 0, 0, 0, 0, 0]
17514 [0, 0, 0, 0, 0, 0]
17515 [0, 0, 0, 0, 0, 0]
17516 [0, 0, 0, 0, 0, 0]
17517 [0, 0, 0, 0, 0, 0]
17518 [0, 0, 0, 0, 0, 0]
17519 [0, 0, 0, 0, 0, 0]
(An array that contains arrays of dtype('int32'))
How can I efficiently convert this to data frame in pandas and concantenate it (vertically) to an existing dataframe?
What seems to be the problem? You may need to further describe your data.
a = np.array([np.zeros(6) for _ in range(3)])
>>> pd.DataFrame(a)
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0

Categories