Joining values in 3D pandas array on ", ", making it a 2D array - python

I have a 3D dataframe, and I want to get all values of one x,y index across the z axis, where the z axis here moves between the original 2D dataframes. The way I am able to imagine it although forgive me if I'm mistaken because it's a little weird to visualize, if I got a vector of the x,y of x=0, y=0 it would be [1, 5, 3].
So my result would be a dataframe, where the df_2d[0][0] would be a string "1, 5, 3", and so on, taking all the values in the 3D dataframe.
Is there any way I can achieve this without looping through each cell index and accessing the values explicitly?
The data frame is defined as:
import pandas as pd
columns = ['A', 'B']
index = [1, 2, 3]
df_1 = pd.DataFrame(data=[[1, 2], [99, 57], [57, 20]], index=index, columns=columns)
df_2 = pd.DataFrame(data=[[5, 6], [78, 47], [21, 11]], index=index, columns=columns)
df_3 = pd.DataFrame(data=[[3, 4], [66, 37], [33, 17]], index=index, columns=columns)
df_3d = pd.concat([df_1, df_2, df_3], keys=['1', '2', '3'])
And then to get the original data out I do:
print(df_3d.xs('1'))
print(df_3d.xs('2'))
print(df_3d.xs('3'))
A B
1 1 2
2 99 57
3 57 20
A B
1 5 6
2 78 47
3 21 11
A B
1 3 4
2 66 37
3 33 17
Again, to clarify, if looking at this print I would like to have a combined dataframe looking like:
A B
1 '1, 5, 3' '2, 6, 4'
2 '99, 78, 66' '57, 47, 37'
3 '57, 21, 33' '20, 11, 17'

Use .xs to get each level dataframe and reduce to combine all dataframe together.
from functools import reduce
# Get each level values
dfs = [df_3d.xs(i) for i in df_3d.index.levels[0]]
df = reduce(lambda left,right: left.astype(str) + ", " + right.astype(str), dfs)
df
A B
1 1, 5, 3 2, 6, 4
2 99, 78, 66 57, 47, 37
3 57, 21, 33 20, 11, 17
And if you want ' you can use applymap to apply the function on every element.
df.applymap(lambda x: "'" + x + "'")
A B
1 '1, 5, 3' '2, 6, 4'
2 '99, 78, 66' '57, 47, 37'
3 '57, 21, 33' '20, 11, 17'
Or df = "'" + df + "'"
df
A B
1 '1, 5, 3' '2, 6, 4'
2 '99, 78, 66' '57, 47, 37'
3 '57, 21, 33' '20, 11, 17'

Related

how to replace the same value in pandas dataframe with a different value in each row

I want to replace value 0 in each row in the pandas dataframe with a value that comes from a list that has the same index as the row index of the dataframe.
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
#here is the list
maxValueInRow = [3,5,34]
# the desired output would be:
df_updated = pd.DataFrame({'a': [12, 52, 3], 'b': [33, 5, 110], 'c':[34, 15, 134]})
I thought it could be something like
df.apply(lambda row: maxValueInRow[row.name] if row==0 else row, axis=1)
but that didnt work and produced 'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().' error.
Any thoughts would be greatly appreciated.
Here is what you need:
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
#here is the list
maxValueInRow = [3,5,34]
for index, row in df.iterrows():
for column in df.columns:
if row[column] == 0:
df.iloc[index][column] = maxValueInRow[index]
df
Output
a
b
c
0
12
33
3
1
52
5
15
2
34
110
134
Update
As per your comments, it seems by replacing the values with the same index, you meant something else. Anyway, here is an update to your problem:
# here is my dataframe
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
data = df.to_dict()
maxValueInRow = [3,5,34]
i = 0
for chr, innerList in data.items():
for index in range(len(innerList)):
value = innerList[index]
if value == 0:
data[chr][index] = maxValueInRow[i]
i += 1
df = pd.DataFrame(data)
df
Output
a
b
c
0
12
33
34
1
52
5
15
2
3
110
134
You could use .replace:
df = pd.DataFrame({'a': [12, 52, 0], 'b': [33, 0, 110], 'c':[0, 15, 134]})
maxValueInRow = [3,5,34]
repl = {col: {0: value} for col, value in zip(df.columns, maxValueInRow)}
df_updated = df.replace(repl)
Result:
a b c
0 12 33 34
1 52 5 15
2 3 110 134

Counting the cells in a row (across multiple columns) that are within x value of first column in Pandas

I have the following dataframe and I'm trying to determine how many of the column values in each row are within 12 of the max value found in the first four columns.
import pandas as pd
df = pd.DataFrame({
't1': [0, 0, 40, 37, 143],
't2': [0, 38, 149, 145, 151],
't3': [0, 140, 100, 37, 150],
't4': [0, 0, 23, 0, 19],
'other': ['str1', 'str2', 'str3', 'str4', 'NaN'],
'age': [21, 29, 57, 48, 37],
'new_max': [0,140,149,145,151]})
I want to check columns 't1' through 't4' to see if they are within 12 of the maximum value contained in those four columns for that row.
The output would be to add a 'w12_count' column for each row like this:
df = pd.DataFrame({
't1': [0, 0, 40, 37, 143],
't2': [0, 38, 149, 145, 151],
't3': [0, 140, 100, 37, 150],
't4': [0, 0, 23, 0, 19],
'other': ['str1', 'str2', 'str3', 'str4', 'NaN'],
'age': [21, 29, 57, 48, 37],
'new_max': [0,140,149,145,151],
'w12_count': [4, 1, 1, 1, 3]})
I know I could use .loc to create a new column based on each column I'm checking and assign it 0 if it is false or 1 if it is true and then sum those new columns to get the count. But my data actually has a lot of columns so I'm trying to find the syntax for using the count method to total the number of columns within 12 and assign a new column with the count.
We can filter the t like columns, then take the max along axis=1 on these columns then subtract the max value from these columns to get the difference then compare the absolute value of difference with 12 to create a boolean mask followed by sum along axis=1 to get the counts
t = df.filter(regex=r't\d+')
df['w12_count'] = t.sub(t.max(1), axis=0).abs().le(12).sum(1)
t1 t2 t3 t4 other age new_max w12_count
0 0 0 0 0 str1 21 0 4
1 0 38 140 0 str2 29 140 1
2 40 149 100 23 str3 57 149 1
3 37 145 37 0 str4 48 145 1
4 143 151 150 19 NaN 37 151 3

Transforming multiindex to row-wise multi-dimensional NumPy array.

Suppose I have a MultiIndex DataFrame similar to an example from the MultiIndex docs.
>>> df
0 1 2 3
first second
bar one 0 1 2 3
two 4 5 6 7
baz one 8 9 10 11
two 12 13 14 15
foo one 16 17 18 19
two 20 21 22 23
qux one 24 25 26 27
two 28 29 30 31
I want to generate a NumPy array from this DataFrame with a 3-dimensional structure like
>>> desired_arr
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
How can I do so?
Hopefully it is clear what is happening here - I am effectively unstacking the DataFrame by the first level and then trying to turn each top level in the resulting column MultiIndex to its own 2-dimensional array.
I can get half way there with
>>> df.unstack(1)
0 1 2 3
second one two one two one two one two
first
bar 0 4 1 5 2 6 3 7
baz 8 12 9 13 10 14 11 15
foo 16 20 17 21 18 22 19 23
qux 24 28 25 29 26 30 27 31
but then I am struggling to find a nice way to turn each column into a 2-dimensional array and then join them together, beyond doing so explicitly with loops and lists.
I feel like there should be some way for me to specify the shape of my desired NumPy array beforehand, fill it with np.nan and then use a specific iterating order to fill the values with my DataFrame, but I have not managed to solve the problem with this approach yet .
To generate the sample DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
ind = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.arange(8*4).reshape((8, 4)), index=ind)
Some reshape and swapaxes magic -
df.values.reshape(4,2,-1).swapaxes(1,2)
Generalizable to -
m,n = len(df.index.levels[0]), len(df.index.levels[1])
arr = df.values.reshape(m,n,-1).swapaxes(1,2)
Basically splitting the first axis into two of lengths 4 and 2 creating a 3D array and then swapping the last two axes, i.e. pushing in the axis of length 2 to the back (as the last one).
Sample output -
In [35]: df.values.reshape(4,2,-1).swapaxes(1,2)
Out[35]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
to complete the answer of #divakar, for a multidimensionnal generalisation :
# sort values by index
A = df.sort_index()
# fill na
for idx in A.index.names:
A = A.unstack(idx).fillna(0).stack(1)
# create a tuple with the rights dimensions
reshape_size = tuple([len(x) for x in A.index.levels])
# reshape
arr = np.reshape(A.values, reshape_size ).swapaxes(0,1)

How can I append the values of a list to the values of a DataFrame column using Python?

I have this DataFrame:
df=pd.DataFrame.from_items([('id', [14, 49, 21]),
('parameter', [12, 23, 11])])
And I have this list:
[8, 1, 3]
<class 'list'>
I want to append the list values to each value of the column id of the dataframe df, I want something like this:
id parameter
0 148 12
1 491 23
2 213 11
How can I do it?
You can try like so:
import pandas as pd
df = pd.DataFrame.from_items([('id', [14, 49, 21]), ('parameter', [12, 23, 11])])
l = iter([8, 1, 3])
df.id = df.id.apply(lambda x: str(x)+str(next(l))).astype(int)
print df
Output:
id parameter
0 148 12
1 491 23
2 213 11
Using zip works:
import pandas as pd
df = pd.DataFrame.from_items([('id', [14, 49, 21]), ('parameter', [12, 23, 11])])
l = [8, 1, 3]
df['id'] = [int(''.join([str(j) for j in i])) for i in zip(df['id'], l)]
and the resulting df is:
>>> df
id parameter
0 148 12
1 491 23
2 213 11
import pandas as pd
df = pd.DataFrame.from_items([('id', [14, 49, 21]), ('parameter', [12, 23, 11])])
lista = [8,1,3]
for i in range(len(lista)):
df['id'][i] = int(str(df['id'][i]) + str(lista[i]))
print df
id parameter
0 148 12
1 491 23
2 213 11

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Categories