I have DataFrame number 1
Price Things
0 1 pen
1 2 pencil
2 6 apple
I have DataFrame number 2:
Price Things
0 5 pen
1 6 pencil
2 10 cup
I want to join two DataFrames and I'd like to see this DataFrame:
DataFrame number 1 + DatFRame number 2
Price Things
0 6 pen
1 8 pencil
2 6 apple
3 10 cup
How can I do this?
This code:
import pandas as pd
df = pd.DataFrame({'Things': ['pen', 'pencil'], 'Price': [1, 2]})
series = pd.Series([1,2], index=[0,1])
df["Price"] = series
df.loc[2] = [6, "apple"]
print("DataFrame number 1")
print(df)
df2 = pd.DataFrame({'Things': ['pen', 'pencil'], 'Price': [1, 2]})
series = pd.Series([5,6], index=[0,1])
df2["Price"] = series
df2.loc[2] = [10, "cup"]
print("DataFrame number 2")
print(df2)
You can also use concatenate function to combine two dataframes along axis = 0, then group by column and sum them.
df3 = pd.concat([df, df2], axis=0).groupby('Things').sum().reset_index()
df3
Output:
Things Price
0 apple 6
1 cup 10
2 pen 6
3 pencil 8
You can merge, add, then drop the interim columns:
common = pd.merge(
df,
df2,
on='Things',
how='outer').fillna(0)
common['Price'] = common.Price_x + common.Price_y
common.drop(['Price_x', 'Price_y'], axis=1, inplace=True)
>>> common
Things Price
0 pen 6.0
1 pencil 8.0
2 apple 6.0
3 cup 10.0
You can also set Things as index on both data frames and then use add(..., fill_value=0):
df.set_index('Things').add(df2.set_index('Things'), fill_value=0).reset_index()
# Things Price
#0 apple 6.0
#1 cup 10.0
#2 pen 6.0
#3 pencil 8.0
Related
I have a dataset containing 250 employee names, gender and their salary. I am trying to create a new dataframe to simply 'extract' the salary for males and females respectively. This dataframe would have 2 columns, one with Male Salaries and another with Female Salaries.
From this dataframe, I would like to create a side by side boxplot with matplotlib to analyse if there is any gender wage gap.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("TMA_Data.csv")
df.head()
#Filter out female employees
df_female = df[(df.Gender == "F")]
df_female.head()
#Filter out male employees
df_male = df[(df.Gender == "M ")]
df_male.head()
#Create new dataframe with only salaries
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
print(df2)
#Assign Male Salaries column
df2["Male Salaries"] = df_male["Salary"]
df2.head() #This works
Output:
Male Salaries Female Salaries
3 93046 NaN
7 66808 NaN
10 46998 NaN
16 74312 NaN
17 50178 NaN
#Assign Female Salaries column (THIS IS WHERE THE PROBLEM LIES)
df2["Female Salaries"] = df_female["Salary"]
df2.head()
Output:
Male Salaries Female Salaries
3 93046 NaN
7 66808 NaN
10 46998 NaN
16 74312 NaN
17 50178 NaN
How come I am unable to add the values for female salaries (nothing seems to be added)? Also, given that my eventual goal is to create two side-by-side boxplots, feel free to suggest if I can do this in a completely different way. Thank you very much!
Edit: Dataset preview:
Solution:
Use .reset_index:
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
df2["Male Salaries"] = df_male["Salary"].reset_index(drop=True)
df2["Female Salaries"] = df_female["Salary"].reset_index(drop=True)
Explanation:
When setting values of a column of the dataframe, they are set to their respective indices.
And your Male and Female indices are obviously different, since they came from different rows of the initial dataframe.
Example:
df = pd.DataFrame([[1], [2], [3]])
df
0
0 1
1 2
2 3
Works as you expected:
df[1] = [4, 5, 6]
df
0 1
0 1 4
1 2 5
2 3 6
Works NOT as you expected:
df[2] = pd.Series([4, 5, 6], index=[1, 0, 999])
df
0 1 2
0 1 4 5.0
1 2 5 4.0
2 3 6 NaN
I have 3 datasets
All the same shape
CustomerNumber, Name, Status
A customer can appear on 1, 2 or all 3.
Each dataset is a list of gold/silver/bronze.
example data:
Dataframe 1:
100,James,Gold
Dataframe 2:
100,James,Silver
101,Paul,Silver
Dataframe 3:
100,James,Bronze
101,Paul,Bronze
102,Fred,Bronze
Expected output/aggregated list:
100,James,Gold
101,Paul,Silver
102,Fred,Bronze
So a customer that is captured in all 3, I want to keep Status as gold.
Have been playing with join and merge and just can’t get it right.
Use concat with convert column to ordered categorical, so get priorites if sorting values by multiple columns and last remove duplicates by DataFrame.drop_duplicates:
print (df1)
print (df2)
print (df3)
a b c
0 100 James Gold
a b c
0 100 James Silver
1 101 Paul Silver
a b c
0 101 Paul Bronze
1 102 Fred Bronze
df = pd.concat([df1, df2, df3], ignore_index=True)
df['c'] = pd.Categorical(df['c'], ordered=True, categories=['Gold','Silver','Bronze'])
df = df.sort_values(['a','b','c']).drop_duplicates(['a','b'])
print (df)
a b c
0 100 James Gold
2 101 Paul Silver
4 102 Fred Bronze
I have a dataframe which can be generated from the code below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':['12/31/2007','11/25/2009',np.nan],
'hero_id':[2,4,np.nan],'date2':['12/31/2017',np.nan,'10/06/2015'],
'heroine_id':[1,np.nan,5],'date3':['12/31/2027','11/25/2029',np.nan],
'bud_source_value':[1250000,250000,np.nan],
'prod__source_value':[10000,20000,np.nan]})
The dataframe looks like as shown below with Nan's
What I would like to do is
1) Fill na's with 0 (zeroes) for columns that ends with "id"
2) Fill na's with "unknown" for columns that ends with "value"
3) Fill na's with "12/31/9999" for columns that starts with "date"
I tried the below approach but it's lengthy and feel it isn't elegant
df2 = df.filter(regex='id$')
df2.fillna(0)
df2 = df.filter(regex='^date')
df2.fillna('12/31/9999')
df2 = df.filter(regex='value$')
df2.fillna('unknown')
Is there anyway to achieve this in one go? As you can see I am kind of repeating the same steps
For multiple choices according to multiple conditions you can use np.select:
import numpy as np
# choices
c = df.columns.str
c1 = c.endswith('id')
c2 = c.endswith('value')
c3 = c.startswith('date')
out = np.select([c1,c2,c3], [df.fillna(0), df.fillna('unknown'), df.fillna("12/31/9999")])
pd.DataFrame(out, columns=df.columns)
person_id date1 hero_id date2 heroine_id date3 \
0 1 12/31/2007 2 12/31/2017 1 12/31/2027
1 2 11/25/2009 4 12/31/9999 0 11/25/2029
2 3 12/31/9999 0 10/06/2015 5 12/31/9999
bud_source_value prod__source_value
0 1.25e+06 10000
1 250000 20000
2 unknown unknown
You can use DataFrame.fillna with dictionary as:
d = {col:value for col_s,value in zip(['id','value','date'], [0,'unknown','12/31/9999']) for col in df.filter(like=col_s)}
df = df.fillna(d)
print(df)
person_id date1 hero_id date2 heroine_id date3 \
0 1 12/31/2007 2.0 12/31/2017 1.0 12/31/2027
1 2 11/25/2009 4.0 12/31/9999 0.0 11/25/2029
2 3 12/31/9999 0.0 10/06/2015 5.0 12/31/9999
bud_source_value prod__source_value
0 1.25e+06 10000
1 250000 20000
2 unknown unknown
print(d)
{'person_id': 0,
'hero_id': 0,
'heroine_id': 0,
'bud_source_value': 'unknown',
'prod__source_value': 'unknown',
'date1': '12/31/9999',
'date2': '12/31/9999',
'date3': '12/31/9999'}
Lets start with very simplified abstract example, I hava a dataframe like this:
import pandas as pd
d = {'1-A': [1, 2], '1-B': [3, 4], '2-A': [3, 4], '5-B': [2, 7]}
df = pd.DataFrame(data=d)
1-A 1-B 2-A 5-B
0 1 3 3 2
1 2 4 4 7
I'm looking for elegant pandastic solution to have dataframe like this:
1 2 5
0 4 3 2
1 6 4 7
To make example more concrete column 1-A, means person id=1, expenses category A. Rows are expenses every month. In result, I want to have monthly expenses per person across categories (so column 1 is sum of column 1-A and 1-B). Note that, when there is no expenses, there is no column with 0s. Of course it should be ready for more columns (ids and categories).
I'm quite sure that smart solution with good separation of column selection and summing opeation for this exist.
Use groupby with lambda function with split and select first value, for grouping by columns add axis=1:
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
#alternative
#df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
print (df1)
1 2 5
0 4 3 2
1 6 4 7
I have two pandas.dataframes df1 and df2. Some of their index are equal. I want to find those index and combine the corresponding rows into a new dataframe.
df1 =
A B
Name
apple 1 5
orange 2 6
banana 3 7
df2 =
A B
Name
apple -1 10
audi -2 11
bmw 0 12
banana 2 8
vw -3 6
The new dataframe that I want is: 1) find the rows with the same index; 2) calculate the average value of the corresponding rows in column 'A'.
df_new =
A_average
Name
apple 0
banana 2.5
This is because: df1 and df2 both have the index apple and banana. The average value of apple in column 'A' is (1-1)/2=0, and the average value of banana in column 'A' is (3+2)/2=2.5.
Do you know how to use Python3 to achieve this? Please note that, in my real application, there can be many more rows than the example I showed above.
Thanks!
Option 1
You could concatenate the two dataframes and group by columns.
pd.concat([df1, df2], 1).dropna().mean(axis=1, level=0)
A B
apple 0.0 7.5
banana 2.5 7.5
If it's just A you want, then this should suffice -
pd.concat([df1, df2], 1).dropna()['A'].mean(axis=1, level=0)
A
apple 0.0
banana 2.5
Option 2
An alternative would be to find the intersecting indices with index.intersection and index with loc -
i = df1.index.intersection(df2.index)
df1.loc[i, ['A']].add(df2.loc[i, ['A']]).div(2)
A
Name
apple 0.0
banana 2.5