I have an empty dataframe.
df=pd.DataFrame(columns=['a'])
for some reason I want to generate df2, another empty dataframe, with two columns 'a' and 'b'.
If I do
df.columns=df.columns+'b'
it does not work (I get the columns renamed to 'ab')
and neither does the following
df.columns=df.columns.tolist()+['b']
How to add a separate column 'b' to df, and df.emtpy keep on being True?
Using .loc is also not possible
df.loc[:,'b']=None
as it returns
Cannot set dataframe with no defined index and a scalar
Here are few ways to add an empty column to an empty dataframe:
df=pd.DataFrame(columns=['a'])
df['b'] = None
df = df.assign(c=None)
df = df.assign(d=df['a'])
df['e'] = pd.Series(index=df.index)
df = pd.concat([df,pd.DataFrame(columns=list('f'))])
print(df)
Output:
Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []
I hope it helps.
If you just do df['b'] = None then df.empty is still True and df is:
Empty DataFrame
Columns: [a, b]
Index: []
EDIT:
To create an empty df2 from the columns of df and adding new columns, you can do:
df2 = pd.DataFrame(columns = df.columns.tolist() + ['b', 'c', 'd'])
If you want to add multiple columns at the same time you can also reindex.
new_cols = ['c', 'd', 'e', 'f', 'g']
df2 = df.reindex(df.columns.union(new_cols), axis=1)
#Empty DataFrame
#Columns: [a, c, d, e, f, g]
#Index: []
This is one way:
df2 = df.join(pd.DataFrame(columns=['b']))
The advantage of this method is you can add an arbitrary number of columns without explicit loops.
In addition, this satisfies your requirement of df.empty evaluating to True if no data exists.
You can use concat:
df=pd.DataFrame(columns=['a'])
df
Out[568]:
Empty DataFrame
Columns: [a]
Index: []
df2=pd.DataFrame(columns=['b', 'c', 'd'])
pd.concat([df,df2])
Out[571]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
Related
How to add three columns from one data frame to another at a certain position?
I want to add these columns after a specific column? DF1=['C','D'] after
columns A and B in DF2. So how to join columns in between other columns in
another dataframe.
df1=pd.read_csv(csvfile)
df2=pd.read_csv(csvfile)
df1['C','D','E'] to df2['K','L','A','B','F']
so it looks like df3= ['K','L','A','B','C','D','F']
Use concat with DataFrame.reindex for change order of columns:
df3 = pd.concat([df1, df2], axis=1).reindex(['K','L','A','B','C','D'], axis=1)
More general solution:
df1 = pd.DataFrame(columns=['H','G','C','D','E'])
df2 = pd.DataFrame(columns=['K','L','A','B','F'])
df3 = pd.concat([df1, df2], axis=1)
c = df3.columns.difference(['C', 'D'], sort=False)
pos = c.get_loc('B') + 1
c = list(c)
#https://stackoverflow.com/a/3748092/2901002
c[pos:pos] = ['C', 'D']
df3 = df3.reindex(c, axis=1)
print (df3)
Empty DataFrame
Columns: [H, G, E, K, L, A, B, C, D, F]
Index: []
Try:
df3=pd.DataFrame()
df3[['K','L','A','B']]=df2[['K','L','A','B']]
df3[['C','D','E']]=df1[['C','D','E']]
Finally:
df3=df3[['K','L','A','B','C','D']]
OR
df3=df3.loc[:,['K','L','A','B','C','D']]
This should work
pd.merge([df1, df2, left_index=True, right_index=True]).[['K','L','A','B','C','D']]
or simply use join which is left by deafult
df1.join(df2)[['K','L','A','B','C','D']]
Given a dataframe df, I need to select the columns that have only True values
df =
A B C D E
True False True False True
Output should be
output = [A, C, E]
Try boolean indexing with all (for only True values):
df.columns[df.all()]
Output:
Index(['A', 'C', 'E'], dtype='object')
Try iterating through it and putting the keys in a list (you can easily modify this to result in a dict, though).
result = []
for i in df.keys():
if df[i].all():
result.append(i)
This is simplified example of what I want to do:
data1 = {'one':['A', 'E', 'G'], 'two':['B', 'D', 'H'], 'three':['C', 'F', 'J']}
df1 = pd.DataFrame(data1)
df1
one two three
0 A B C
1 E D F
2 G H J
data2 = {'one':['C', 'F', 'P'], 'two':['B', 'D', 'R'], 'three':['A', 'E', 'C']}
df2 = pd.DataFrame(data2)
df2
one two three
0 C B A
1 F D E
2 P R C
I want a function the will show me something like this:
diff(df1, df2) # this syntaks can be different
one two three from
0 G H J df1
1 P R C df2
Basically find came text for column two in both dataFrames, and if one and three columns are reversed, then it is fine, do not add it in a new frame.
I know how to do it with a loop but would like to know what is panda way of doing this.
Using pandas.Index.symmetric_difference
df1.set_index(df1.apply(frozenset, 1), inplace=True)
df2.set_index(df2.apply(frozenset, 1), inplace=True)
df1['from'] = 'df1'
df2['from'] = 'df2'
new_df = pd.concat([df1, df2]).loc[df1.index ^ df2.index].reset_index(drop=True)
print(new_df)
Output:
one three two from
0 G J H df1
1 P C R df2
Simple enought, just compare the columns that you want to be the same and filter on that. In your example:
pd.concat([df.loc[df1["two"] != df2["two"]] for df in (df1, df2)], axis=0)
EDIT: if you want the "from" column as well, change the above line to:
pd.concat([df.loc[df1["two"] != df2["two"]].assign(from_df=df_name) for df, df_name in zip((df1, df2), ("df1", df2)], axis=0)
I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]
I found a bunch of answers about how to drop columns using index numbers.
I am stumped about how to drop all columns after a particular column by name.
df = pd.DataFrame(columns=['A','B','C','D'])
df.drop(['B':],inplace=True)
I expect the new df to have only A B columns.
Dropping all columns after is the same as keeping all columns up to and including. So:
In [321]: df = pd.DataFrame(columns=['A','B','C','D'])
In [322]: df = df.loc[:, :'B']
In [323]: df
Out[323]:
Empty DataFrame
Columns: [A, B]
Index: []
(Using inplace is typically not worth it.)
get_loc and iloc
Dropping some is the same as selecting the others.
df.iloc[:, :df.columns.get_loc('B') + 1]
Empty DataFrame
Columns: [A, B]
Index: []
df.drop(df.columns[list(df.columns).index("B")+1:],inplace=True)
df = df[df.columns[:list(df.columns).index('B')+1]]
should work.