df.append() is not appending to the DataFrame - python

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)

DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd

The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

Related

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Pandas apply function and update copy of dataframe

I have data frames
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
dfcopy = df.copy()
I want to apply a function to values in df columns 'B' and 'C' based on value in col 'A' and then update the result in corresponding rows in dfcopy.
For example, for each row where 'A' is 1, get the 'B' and 'C' values for that row, apply function, and store results in dfcopy. For the first row where 'A'==2, the value for 'B' is 21 and 'C' is 5. Assume the function is to multiply by 2x2 ones matrix: np.dot(np.ones((2,2)),np.array([[21],[5]])). Then we want df[1,'B']=26 and df[1,'C']=26. Then I want to repeat for a different value in A until the function has been applied uniquely based on each value in A.
Lastly, I don't want to iterate row by row, check value in A, and apply function. This is because there will be an operation to do based on each value of A (i.e. the np.ones((2,2)) will be replaced by values in file corresponding to value in A, and I don't want to repeat it
I'm sure I can force a solution (e.g. by looping and setting values), but I'm guessing there is an elegant way to do this with Pandas API. I just can't find it.
In the example below I picked different matrices so it's obvious that I have applied them.
df = pd.DataFrame({'A':[1,2,2,1],'B':[20,21,22,32],'C':[4,5,6,7],'D':[99,98,97,96]})
matrices = [None,pd.DataFrame([[1,0],[0,0]],index=["B","C"]),pd.DataFrame([[0,0],[0,1]],index=["B","C"])]
df[["B","C"]] = pd.concat((df[df["A"] == i][["B","C"]].dot(matrices[i]) for i in set(df["A"])))
A B C D
0 1 20 0 99
1 2 0 5 98
2 2 0 6 97
3 1 32 0 96

Question about drop=True in pd.dataframe.reset_index()

In a Pandas dataframe, it's possible to reset the index using the reset_index() method. One optional argument is drop=True which according to the documentation:
drop : bool, default False
Do not try to insert index into dataframe columns.
This resets the index to the default integer index.
My question is, what does the first sentence mean? Will it try to convert an integer index to a new column in my df if I leave if False?
Also, will my row order be preserved or should I also sort to ensure proper ordering?
As you can see below, df.reset_index() will move the index into the dataframe as a column. If the index was just a generic numerical index, you probably don't care about it and can just discard it. Below is a simple dataframe, but I dropped the first row just to have differing values in the index.
df = pd.DataFrame([['a', 10], ['b', 20], ['c', 30], ['d', 40]], columns=['letter','number'])
df = df[df.number > 10]
print(df)
# letter number
# 1 b 20
# 2 c 30
# 3 d 40
Default behavior now shows a column named index which was the previous index. You can see that df['index'] matches the index from above, but the index has been renumbered starting from 0.
print(df.reset_index())
# index letter number
# 0 1 b 20
# 1 2 c 30
# 2 3 d 40
drop=True doesn't pretend like the index was important and just gives you a new index.
print(df.reset_index(drop=True))
# letter number
# 0 b 20
# 1 c 30
# 2 d 40
Regarding row order, I suspect that it would be maintained, but the order in which things are stored should not be relied on in general. If you are performing an aggregate function, you probably want to make sure you have the data ordered properly for the aggrigation.

How to get values on one dataframe based on the position of a value in other dataframe

I have two dataframes with the same size.
df1
1 5 3
6 5 1
2 4 9
df2
a b c
d e f
g h i
I want to get the corresponding value on df2 that is in the same position as the maximum value of each row in df1. For example, row 0 has element [0,1] as its max, so I'd like to get [0,1] from df2 in return
Desired result would be:
df3
b
d
i
Thank you so much!
Don't use for loops. numpy can be handy here
vals = df2.values[np.arange(len(df2)), df1.values.argmax(1)]
Of course, can df3 = pd.DataFrame(vals)
col
0 b
1 d
2 i
S=df1.idxmax(axis=0)
p=0
for a in range(len(df1):
df3.iloc(['a','0'])=df2.iloc([S[p],0])
p+=1
Try the code:
>>> for i, j in enumerate(df1.idxmax()):
... print(df2.iloc[i, j])
...
b
d
i
idxmax gives the id of the maximum value in the dataframe, either row-wise or column-wise.
Your problem has two parts:
1- Finding the maximum value of each row
2- Choosing the maximum column of each row with values found in step one
You can easily use lookup function. The first argument is finding the maximum column in rows(step one), and the second is the selection(step two)
df2.lookup(range(len(df1)), df1.idxmax()) #output => array(['b', 'd', 'i'], dtype=object)
If array does not work for you, you can also create data frame from these values if by simply passing it to pd.DataFrame:
pd.DataFrame(df2.lookup(range(len(df1)), df1.idxmax()))
One good feature of this solution is avoiding loops which makes it efficient.

How to remove blanks/NA's from dataframe and shift the values up

I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!

Categories