I have a pandas dataframe:
Name A1 A2 A3
Andy 1 NaN NaN
Brian Nan NaN NaN
Carlos NaN 2 NaN
David NaN Nan 3
Frank 2 Nan Nan
For each row, in 3 columns A1, A2 and A3 there is at most one non-NaN cell. So I want to merge them into one column and remove the rows that are all NaN. So the above dataframe will become:
Name A A-ID
Andy 1 1
Carlos 2 2
David 3 3
Frank 2 1
A-ID will store the original column (A1, A2 or A3). The row with Brian is removed because all 3 columns are NaN.
Naively I can write a for loop to do the task, but is there a more pythonic and faster way?
This method should achieve the desired result:
import pandas as pd
import numpy as np
d = {"Name": ["Andy", "Brian", "Carlos", "David", "Frank"],
"A1": [1,np.nan,np.nan,np.nan,2],
"A2": [np.nan,np.nan,2,np.nan,np.nan],
"A3": [np.nan,np.nan,np.nan,3,np.nan]}
df = pd.DataFrame(data=d)
#Drops rows where all A* values are NaN
df = df.dropna(subset = ['A1', 'A2', 'A3'], how="all")
#Sums values to produce result
df["A"] = df.sum(axis=1)
#Alternative method for getting 'A'
#df["A"] = df[["A1", "A2", "A3"]].bfill(axis=1).iloc[:, 0]
#Returns final char of column name of first non-NaN column
df["A-ID"] = df[["A1", "A2", "A3"]].apply(lambda row: row.first_valid_index()[-1], axis=1)
#Dropping old A* columns
df = df.drop(["A1", "A2", "A3"], axis=1)
print(df)
Name A A-ID
0 Andy 1.0 1
2 Carlos 2.0 2
3 David 3.0 3
4 Frank 2.0 1
there are several ways to do that. probably the simplest is defining a new column which is the sum or the concatenation of the other columns
df["B"] = df["A1"] + df["A2"] + df["A3"]
then, you keep only the rows with B not null
df = df[df.B.notnull()]
Regards
Related
I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.
Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:
I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5
I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0
If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0
df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')
Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0