Consider we have 2 dataframes:
df = pd.DataFrame(columns = ['a','b','c']) ##empty
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
How can I join them in order the result to be this:
a b c
-----
1 3 Nan
---------
2 4 Nan
-------
Use reindex by columns from df:
df = pd.DataFrame(columns = ['a','b','c'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d).reindex(columns=df.columns)
print (df1)
a b c
0 1 3 NaN
1 2 4 NaN
Difference betwen soluions - if columns are not sorted get different output:
#different order
df = pd.DataFrame(columns = ['c','a','b'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
print (df1.reindex(columns=df.columns))
c a b
0 NaN 1 3
1 NaN 2 4
print (df1.merge(df,how='left'))
a b c
0 1 3 NaN
1 2 4 NaN
How can I join them
If you have the dataframe existing somewhere(not creating a new), do :
df1.merge(df,how='left')
a b c
0 1 3 NaN
1 2 4 NaN
Note: This produces sorted columns. So if order of columns are already sorted, this will work fine , else not.
Related
I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.
With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1
I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0
For each row, I am computing values and storing them in a dictionary. I want to be able to take the dictionary and add it to the row where the keys are columns.
For example:
Dataframe
A B C
1 2 3
Dictionary:
{
'D': 4,
'E': 5
}
Result:
A B C D E
1 2 3 4 5
There will be more than one row in the dataframe, and for each row I'm computing a dictionary that might not necessarily have the same exact keys.
I ended up doing this to get it to work:
appiled_df = df.apply(lambda row: func(row['a']), axis='columns', result_type='expand')
df = pd.concat([df, appiled_df], axis='columns')
def func():
...
return pd.Series(dictionary)
If you want the dict values to appear in each row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
Demo
Data Input:
df
A B C
0 1 2 3
1 11 12 13
Output:
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
A B C D E
0 1 2 3 4 5
1 11 12 13 4 5
If you just want the dict to appear in the first row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(pd.Series(d).to_frame().T)
A B C D E
0 1 2 3 4.0 5.0
1 11 12 13 NaN NaN
Simply use a for cycle in your dictionary and assign the values.
df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3]])
# You can test with df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3], [8,0,33]]), too.
d = {
'D': 4,
'E': 5
}
for k,v in d.items():
df[k] = v
print(df)
Output:
A
B
C
D
E
0
1
2
3
4
5
I would like to merge two data frames, df1 and df2:
import pandas as pd
df1 = pd.DataFrame({
'A': ['a'],
'B': ['b'],
'C': ['c']
})
df2 = pd.DataFrame({
'W': [1, 2, 3],
'X': [4, 5, 6],
'Y': [7, 8, 9],
'Z': [10, 11, 12]
})
df1: (will always have only one row)
df2: (can have any number of rows)
In a way that all the columns of df1 are added to the df2 dataframe with all the rows having the same values present in the df1 dataframe.
I have tried:
df3 = pd.concat([df1,df2], sort=False, axis=1)
But this is giving me NaN's:
But i want all the rows to have the same constant value that is present in df1 like:
I would also like to maintain having the new columns from df1 be before the columns of df2 as above.
What might be the most efficient way to achieve this.
We can do an outer merge on an artificially created key:
df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12
Use DataFrame.assign with selecting first row and then change order of columns by DataFrame.reindex:
df3 = df2.assign(**df1.iloc[0]).reindex(df1.columns.union(df2.columns, sort=False),axis=1)
print (df3)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12
Or add rows to df1 by df2.index with method='ffill':
df3 = pd.concat([df1.reindex(df2.index, method='ffill'),df2], sort=False, axis=1)
print (df3)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12
I have two Pandas data frames. df1 has columns ['a','b','c'] and df2 has columns ['a','c','d']. Now, I create a new data frame df3 with columns ['a',
b','c','d'].
I want to fill df3 with all the inputs from df1 and df2. For example, if I have x rows in df1, and y rows in df2, then I will have x+y rows in df3.
Which Pandas function fills the new dataframe based on partial columns?
Example data:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 4], 'd':['h', 'j', 'k']})
df2 = pd.DataFrame({'a':[5, 6, 7], 'b':[1, 1, 1], 'c':[2, 2, 2]})
Code:
df1.append(df2)
Out:
a b c d
0 1 2 NaN h
1 2 3 NaN j
2 3 4 NaN k
0 5 1 2.0 NaN
1 6 1 2.0 NaN
2 7 1 2.0 NaN
use append function of dataframe https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
anotherFrame = df1.append(df2, ignore_index=True)
another way is merge - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
df1.merge(df2, how='outer')
How about:
df1 = pd.DataFrame({"a": [1,2], "b": [3,4], "c": [5,6]})
df2 = pd.DataFrame({"a": [7,8], "c": [9,10], "d": [11,12]})
df3 = df1.append(df2, sort=False)
df3
a b c d
0 1 3.0 5 NaN
1 2 4.0 6 NaN
0 7 NaN 9 11.0
1 8 NaN 10 12.0