I am trying to create a new column based on both columns. Say I want to create a new column z, and it should be the value of y when it is not missing and be the value of x when y is indeed missing. So in this case, I expect z to be [1, 8, 10, 8].
x y
0 1 NaN
1 2 8
2 4 10
3 8 NaN
You can use apply with option axis=1. Then your solution is pretty concise.
df[z] = df.apply(lambda row: row.y if pd.notnull(row.y) else row.x, axis=1)
The new column 'z' get its values from column 'y' using df['z'] = df['y']. This brings over the missing values so fill them in using fillna using column 'x'. Chain these two actions:
>>> df['z'] = df['y'].fillna(df['x'])
>>> df
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
Use np.where:
In [3]:
df['z'] = np.where(df['y'].isnull(), df['x'], df['y'])
df
Out[3]:
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
Here it uses the boolean condition and if true returns df['x'] else df['y']
Let's say DataFrame is called df. First copy the y column.
df["z"] = df["y"].copy()
Then set the nan locations of z to the locations in x where the nans are in z.
import numpy as np
df.z[np.isnan(df.z)]=df.x[np.isnan(df.z)]
>>> df
x y z
0 1 NaN 1
1 2 8 8
2 4 10 10
3 8 NaN 8
I'm not sure if I understand the question, but would this be what you're looking for?
"if y[i]" will skip if the value is none.
for i in range(len(x));
if y[i]:
z.append(y[i])
else:
z.append(x[i])
The update method does almost exactly this. The only caveat is that update will do so in place so you must first create a copy:
df['z'] = df.x.copy()
df.z.update(df.y)
In the above example you start with x and replace each value with the corresponding value from y, as long as the new value is not NaN.
Related
I'm aiming to drop rows in a pandas df where a row is equal to a specific value. However, I want to extend this so it also drops associated rows grouped by a separate column. For instance, I want to drop all rows where Label == A,D, but I also to drop associated rows in Num from the same group.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,1,2,2,3,3,4,4,4],
'Label' : ['X','X','A','Y','Y','Y','Y','Y','Y','D'],
})
df = df.groupby('Num').filter(lambda x: (x['Label'].isin['A','D']).any())
intended output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
You are close, just add negation:
df.groupby('Num').filter(lambda x: ~x['Label'].isin(['A','D']).any())
Output:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
Let us try use isin without groupby
out = df.loc[~df.Num.isin(df.loc[df.Label.isin(['A','D']),'Num'])]
Out[108]:
Num Label
3 2 Y
4 2 Y
5 3 Y
6 3 Y
I moved from R to python about a year ago, and I still find pandas inscrutable sometimes. Here's an example:
import pandas as pd
a = pd.DataFrame(dict(x = [1,2,3], y = [4,5,6], z = [7,8,9]))
a
Out[34]:
x y z
0 1 4 7
1 2 5 8
2 3 6 9
I want to replace this part:
a.loc[a.x>1, "y":]
Out[35]:
y z
1 5 8
2 6 9
with this part:
a.loc[a.x<3, 'x':"y"]
Out[36]:
x y
0 1 4
1 2 5
so I tried to just do this:
a.loc[a.x>1, "y":] = a.loc[a.x<3, 'x':"y"]
but I got this:
a
Out[38]:
x y z
0 1.0 4.0 7.0
1 2.0 5.0 NaN
2 3.0 NaN NaN
What the heck. I'm guessing that this has something to do with indexing?
It works when I convert the part to be converted into a numpy array:
a.loc[a.x>1, "y":] = np.array(a.loc[a.x<3, 'x':"y"])
a
Out[44]:
x y z
0 1 4 7
1 2 1 4
2 3 2 5
presumably because this throws away all of the metadata that's tripping me up.
I have two questions:
What's the rationale behind pandas behavior here? It seems a bit obtuse to me, but my expectations were shaped by my background in R (and matlab before that), and it's likely that there is a good reason.
More concretely: what's the pandonic way to do what I'm attempting?
Your question is better represented with a picture:
The red square are the ones you retrieve by .loc which returns a view, and the blue one is the one you assign the values to. Since pandas is based on index and the only overlapping values is 5, the rest of the values in red square is filled with NaN.
You can assign the values without the index by a.loc[a.x<3, 'x':"y"].values, or a.loc[a.x<3, 'x':"y"].to_numpy().
I have a big DataFrame I need to split into two (A and B), with the same number of rows from a certain column value in A and in B. That column has over 700 unique values, all of them strings. I leave an example:
DataFrame
Price Type
1 X
2 Y
3 Y
4 X
5 X
6 X
7 Y
8 Y
When splitting it (randomly), I should get two values of X, and two values of Y in DataFrame A and DataFrame B, like:
A
Price Type
1 X
5 X
2 Y
3 Y
B
Price Type
4 X
6 X
7 Y
8 Y
Thanks in advance!
You can use groupby().cumcount() to enumerate the rows within Type, then %2 to divide rows into two groups:
df['groups'] = df.groupby('Type').cumcount()%2
A,B = df[df['groups']==0], df[df['groups']==1]
Output:
**A**
Price Type groups
0 1 X 0
1 2 Y 0
4 5 X 0
6 7 Y 0
**B**
Price Type groups
2 3 Y 1
3 4 X 1
5 6 X 1
7 8 Y 1
Could you group by the value of Type and assign A/B to half of the group as a new column, then copy only rows with the label A/B assigned? If you need an exact split you could base it off the size of the group
You can you use "arry_split" feature of numpy library like below:
import numpy as np
df_split = np.array_split(df, 2)
df1 = df_split[0]
df2 = df_split[1]
I have the following dataframe:
W Y
0 1 5
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
...
as the table rows keeps going until index 240. I want to get the following dataframe:
W Y
0 1 5
1 2 7
2 3 10
3 4 14
4 5 19
5 6 27
6 7 37
...
Please note that the values of W are arbitrary (just to make the computation here easier, in fact they are np.random.normal in my real program).
Or in other words:
If Y index is 0, then the value of Y is 5;
If Y index is between 1 and 4 (includes) then Y_i is the sum of the previous element in Y and the current elemnt in W.
If Y index is >=5 then the value of Y is: Y_{i-1} + Y_{i-4} - Y_{i-5} + W_i
using iipr answer I've managed to compute the first five values by running:
def calculate(add):
global value
value = value + add
return value
df.Y = np.nan
value = 5
df.loc[0, 'Y'] = value
df.loc[1:5, 'Y'] = df.loc[1:5].apply(lambda row: calculate(*row[['W']]), axis=1)
but I haven't managed to compute the rest of values (where index>=5).
Does anyone have any suggestions?
I wouldn't recommend to use apply in this case.
Why not simply use two loops, for each differently defined range one:
for i in df.index[1:5]:
df.loc[i, 'Y'] = df.W.loc[i] + df.Y.loc[i-1]
for i in df.index[5:]:
df.loc[i, 'Y'] = df.W.loc[i] + df.Y.loc[i-1] + df.Y.loc[i-4] - df.Y.loc[i-5]
This is straight forward and you still know next week what the code does.
I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])