I want to set the first N rows of a pandas dataframe that satisfy a condition. Eg for:
data = {'id':[1,1,2,2,2,4,2,2], 'val':[9,9,9,9,9,9,9,9]}
df = pd.DataFrame(data)
If I do:
df.loc[df.id == 2] = None
I set all rows with id == 2, but I only want to set N of them.
I've tried
df.loc[df.id == 2][range(N)] = None
but to no avail.
Is there any way to do this, without using loops?
IIUC, use pandas.DataFrame.loc twice. Not so pretty, but works:
df.loc[df.loc[df["id"].eq(2)].index[:2]]=None
print(df)
Output:
id val
0 1.0 9.0
1 1.0 9.0
2 NaN NaN
3 NaN NaN
4 2.0 9.0
5 4.0 9.0
6 2.0 9.0
7 2.0 9.0
Related
For the following dataframe:
a b c
0 NaN 5.0 NaN
1 2.0 6.0 NaN
2 3.0 7.0 11.0
3 4.0 NaN 12.0
I want to remove all rows with at least one NaN from the first row until a 'full' row is found. For the example above, rows 0 & 1 contain NaN so they are dropped. Row 2 is a 'full' row so it is retained, along with all following rows.
i.e., I want to get:
a b c
2 3.0 7.0 11.0
3 4.0 NaN 12.0
How can I achieve this?
Test non missing values per all rows by DataFrame.notna by DataFrame.all and Series.cummax and filter in boolean indexing:
df = df[df.notna().all(axis=1).cummax()]
print (df)
a b c
2 3.0 7.0 11.0
3 4.0 NaN 12.0
Suppose I have the following df
import pandas as pd
import numpy as np
test = pd.DataFrame(data = {
'a': [1,np.nan,np.nan,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,np.nan,np.nan,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
I would like to use pd.fillna(method='ffill') but only when two separate condition are both met:
Both elements of a row are NaN
The element of the previous row of column 'b' is 10
Note: the first row can never be NaN
I am looking for a smart way - maybe a lambda expression or a vectorized form, avoiding for loop or .iterrows()
Required result:
result= pd.DataFrame(data = {
'a': [1,1,1,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,10,10,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
You can test if forward filled value in b is 10 and also if all columns are filled with missing values and pass to DataFrame.mask with ffill:
mask = test['b'].ffill().eq(10) & test.isna().all(axis=1)
test = test.mask(mask, test.ffill())
print (test)
a b
0 1.0 10.0
1 1.0 10.0
2 1.0 10.0
3 4.0 1.0
4 NaN NaN
5 5.0 1.0
6 6.0 1.0
7 7.0 NaN
8 8.0 1.0
9 NaN 1.0
10 NaN NaN
11 6.0 1.0
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
I am trying to write a list of lists into an excel sheet using pandas
the list looks like:
List_of Lists = [ [1,2,3,4],
[5,6,7,8],
[9,10,11,12],
........,
]
The number of these lists inside the main list could go up to a 1000.
I also want to label them like colums1, colomns2, until colums100 for
instance. on the same sheets. can anyone familiar with pandas help me?
as this could be really easy for some?
I believe you can just pass the list into pd.DataFrame() and you will just get NaNs for the values that don't exist.
For example:
List_of_Lists = [[1,2,3,4],
[5,6,7],
[9,10],
[11]]
df = pd.DataFrame(List_of_Lists)
print(df)
0 1 2 3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Then to get the naming the way you want just use pandas.DataFrame.add_prefix
df = df.add_prefix('Column')
print(df)
Column0 Column1 Column2 Column3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Now I guess there is the possibility that you also could want each list to be a column. In that case you need to transpose your List_of_Lists.
from itertools import zip_longest
df = pd.DataFrame(list(map(list, zip_longest(*List_of_Lists))))
print(df)
0 1 2 3
0 1 5.0 9.0 11.0
1 2 6.0 10.0 NaN
2 3 7.0 NaN NaN
3 4 NaN NaN NaN
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.