I'm trying to understand the axis parameter in python pandas. I understand that it's analogous to the numpy axis, but the following example still confuses me:
a = pd.DataFrame([[0, 1, 4], [1, 2, 3]])
print a
0 1 2
0 0 1 4
1 1 2 3
According to this post, axis=0 runs along the rows (fixed column), while axis=1 runs along the columns (fixed row). Running print a.drop(1, axis=1) yields
0 2
0 0 4
1 1 3
which results in a dropped column, while print a.drop(1, axis=0) drops a row. Why? That seems backwards to me.
It's slightly confusing, but axis=0 operates on rows, axis=1 operates on columns.
So when you use df.drop(1, axis=1) you are saying drop column number 1.
The other post has df.mean(axis=1), which essentially says calculate the mean on columns, per row.
This is similar to indexing numpy arrays, where the first index specifies the row number (0th dimension), the second index the column number (1st dimension), and so on.
Related
This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
Let's say I write the code:
df2 = pd.DataFrame(np.random.randint(1,5,4).reshape((2,2)), columns=['A', 'B'])
print("dataframe: \n", df2)
print("mean: \n", df2.mean(0))
df2 = df2.apply(lambda x: x - [1, 2], axis=0)
print("altered df2: \n", df2)
It gives me the results:
dataframe:
A B
0 1 1
1 2 2
mean:
A 1.5
B 1.5
dtype: float64
altered df2:
A B
0 0 0
1 0 0
So first, I've asked it to give me the mean along axis=0. In my mind, this means count each row as a vector and find the average of those vectors. It seems that Pandas agrees with me on this one!
However, I then use the DataFrame.apply() function, and I again specify axis=0. Here I expect the same logic to work, which is for an operation on each row to be executed. In this case lambda x: x - [1, 2]. The result I expect is :
A B
0 0 -1
1 1 0
But instead, when I specify axis=0 (rows), it actually does the operation on the columns.
I'm having a very hard time with pandas, specifically wrapping my head around how it indexes rows and columns, and this has further added to the confusion. As it stands, every time I think I get how things work, I find out I'm wrong, so nothing ever sticks in my brain. I'm asking for an easy way to think about this stuff so that it sticks in my dyslexic little head.
I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!
I have a Pandas Dataframe that is populated by an CSV, and after that I read the columns and iterate by element in row (for each element in column) and writes that element in a file. My problem is that I have elements in row that I want joined into one element.
Say I have A through Z columns, and let's say it's elements are 1 to 23. Let's say that I want joined the number 9 and 10 (columns I and J) in one element only (columns I and J become one and it's values become[9,10])
How do I achieve that using pandas (while iterating)?
My code is long but you can find it here. I've tried groupby but I think it only work with booleans and int (correct me if I'm wrong)
Also I'm pretty new to Python, any advises on my code would be much apreciated!!
Here is an example. It adds a new column where each entry is the list of two other columns. I hope it helps!
df= pd.DataFrame(np.random.randn(10,4))
df[4]= [[df[2][x],df[3][x]] for x in range(df.shape[0])]
You can concat the columns, then convert to a list using numpy's tolist():
In [56]: df = pd.DataFrame(dict(A=[1,1,1], I=[9,9,9], J=[10,10,10]))
In [57]: df
Out[57]:
A I J
0 1 9 10
1 1 9 10
2 1 9 10
In [58]: df["IJ"] = pd.concat((df.I, df.J), axis=1).values.tolist()
In [59]: df.drop(["I","J"], axis=1)
Out[59]:
A IJ
0 1 [9, 10]
1 1 [9, 10]
2 1 [9, 10]