I need to merge two data frame with different rows and without common key:
df1:
name | age | loc
Bob | 20 | USA
df2:
food | car | sports
Sushi | Toyota | soccer
meat | Ford | baseball
result I want:
name | age | loc | food | car | sports
Bob | 20 | USA | Sushi | Toyota | soccer
Bob | 20 | USA | Meat | Ford | baseball
my code below:
pd.merge(df1,df2,how='right',left_index=True,right_index=True)
it works well when df2 is more than two rows but be incorrect when df2 is only one row.
any ideas for this question?
Use reindex_axis by index of df2:
df1 = df1.reindex_axis(df2.index, method='ffill')
print (df1)
name age loc
0 Bob 20 USA
1 Bob 20 USA
df = pd.merge(df1,df2,how='right',left_index=True,right_index=True)
print (df)
name age loc food car sports
0 Bob 20 USA Sushi Toyota soccer
1 Bob 20 USA meat Ford baseball
You can use fillna with method ffill (.ffill) if no NaN data in df1 and df2:
#default outer join
df = pd.concat([df1,df2], axis=1).ffill()
print (df)
name age loc food car sports
0 Bob 20.0 USA Sushi Toyota soccer
1 Bob 20.0 USA meat Ford baseball
df = pd.merge(df1,df2,how='right',left_index=True,right_index=True).ffill()
print (df)
name age loc food car sports
0 Bob 20.0 USA Sushi Toyota soccer
1 Bob 20.0 USA meat Ford baseball
Another type of solution... based on concat.
x = range(0,5)
y = range(5,10)
z = range(10,15)
a = range(10,5,-1)
b = range(15,10,-1)
v = range(0,1)
w = range(2,3)
A = pd.DataFrame(dict(x=x,y=y,z=z))
B = pd.DataFrame(dict(a=a,b=b))
C = pd.DataFrame(dict(v=v,w=w))
pd.concat([A,B])
>>> pd.concat([A,B],axis = 1)
x y z a b
0 0 5 10 10 15
1 1 6 11 9 14
2 2 7 12 8 13
3 3 8 13 7 12
4 4 9 14 6 11
#Edit: based on the comments.. this solution does not answer the question.. Because in the question the amount of rows are different. Here is another solution
This solution is based on the dataframe D
n_mult = B.shape[0]
D = C.append([C]*(n_mult-1)).reset_index()[['v','w']]
pd.concat([D,B],axis = 1)
Related
I am attempting to drop duplicates where the value of a specific column of the duplicated row is zero.
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
The output I'm hoping to achieve is seen below
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Any assistance anyone could provide would be greatly appreciated.
You can do a check if Clients == 0 and find all duplicates based on Name and Division, then do an & and inverse, then boolean mask:
c = df['Clients'].eq(0)
df[~(df.duplicated(['Name','Division'],keep=False) & c)]
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Thanks to Seabean , consider the following df:
df1 = df.append(pd.DataFrame([['Dave','HR',0]],columns=df.columns),ignore_index=True)
print(df1)
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
6 Dave HR 0
c = df1['Clients'].eq(0)
print(df1[~(df1.duplicated(['Name','Division'],keep=False) & c)])
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
6 Dave HR 0
It depends on how your data is organized... If you're reading in from a csv you could do something like this:
#Get the Data:
data = pd.read_csv("employees.csv")
#Sort by Clients so the zeros are dropped instead of the Clients:
data.sort_values("Clients", inplace = True)
#Drop any duplicates based on name:
data.drop_duplicates(subset ="Name",
keep = False, inplace = True)
| 1st Most Common Value | 2nd Most Common Value | 3rd Most Common Value | 4th Most Common Value | 5th Most Common Value |
|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
| Grocery Store | Pub | Coffee Shop | Clothing Store | Park |
| Pub | Grocery Store | Clothing Store | Park | Coffee Shop |
| Hotel | Theatre | Bookstore | Plaza | Park |
| Supermarket | Coffee Shop | Pub | Park | Cafe |
| Pub | Supermarket | Coffee Shop | Cafe | Park |
The name of the dataframe is df0. As you can see there are many values repeating in all the columns. So I want to create a dataframe which has all the unique values with their frequencies from all the columns. Can someone please help with the code since I want to create a Bar plot of it?
The Output should be as follows:
| Venues | Count |
|----------------|-------|
| Bookstore | 1 |
| Cafe | 2 |
| Coffee Shop | 4 |
| Clothing Store | 2 |
| Grocery Store | 2 |
| Hotel | 1 |
| Park | 5 |
| Plaza | 1 |
| Pub | 4 |
| Supermarket | 2 |
| Theatre | 1 |
EDIT: I got ahead of myself in my original answer (also thanks OP for adding the edit/expected output). You want this post, I think the simplest answer:
new_df = pd.DataFrame(df0.stack().value_counts())
If you don't care about which column the values are coming from, and you just want their counts, then use value_counts() (as #Celius Stingher said in the comments), following this post.
If you do want to report the frequency of each value for each column, you can use value_counts() for each column, but you may end up with uneven entries (to get back into a DataFrame, you could do some sort of join).
I instead made a little function to count the occurrences of values in a df, and return a new one:
import pandas as pd
import numpy as np
def counted_entries(df, array):
output = pd.DataFrame(columns=df.columns, index=array)
for i in array:
output.loc[i] = (df==i).sum()
return output
This works for a df filled with random animal value names. You just have to pass the unique entries in the df by getting the set of its values:
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(5)]
df = pd.DataFrame(np.random.choice(['pig','cow','sheep','horse','dog'],size=(5,10)), columns=columns, index=index)
unique_vals = list(set(df.stack())) #this is all the possible entries in the df
df2 = counted_entries(df, unique_vals)
df before:
Column 1 Column 2 Column 3 Column 4 ... Column 7 Column 8 Column 9 Column 10
Row 1 pig pig cow cow ... cow pig dog pig
Row 2 sheep cow pig sheep ... dog pig pig cow
Row 3 cow cow cow sheep ... horse dog sheep sheep
Row 4 sheep cow sheep cow ... cow horse pig pig
Row 5 dog pig sheep sheep ... sheep sheep horse horse
output of counted_entries()
Column 1 Column 2 Column 3 ... Column 8 Column 9 Column 10
pig 1 2 1 ... 2 2 2
horse 0 0 0 ... 1 1 1
sheep 2 0 2 ... 1 1 1
dog 1 0 0 ... 1 1 0
cow 1 3 2 ... 0 0 1
Thank you for the edit, maybe this is what you are looking for, using value_counts for the full dataframe and then aggregating the output:
df0 = pd.DataFrame({'1st':['Grocery','Pub','Hotel','Supermarket','Pub'],
'2nd':['Pub','Grocery','Theatre','Coffee','Supermarket'],
'3rd':['Coffee','Clothing','Supermarket','Pub','Coffee'],
'4th':['Clothing','Park','Plaza','Park','Cafe'],
'5th':['Park','Coffee','Park','Cafe','Park']})
df1 = df0.apply(pd.Series.value_counts)
df1['Count'] = df1.sum(axis=1)
df1 = df1.reset_index().rename(columns={'index':'Venues'}).drop(columns=list(df0))
print(df1)
Output:
Venues Count
5 Park 5.0
2 Coffee 4.0
7 Pub 4.0
8 Supermarket 3.0
0 Cafe 2.0
1 Clothing 2.0
3 Grocery 2.0
4 Hotel 1.0
6 Plaza 1.0
9 Theatre 1.0
You can also do this:
df = pd.read_csv('test.csv', sep=',')
list_of_list = df.values.tolist()
t_list = sum(list_of_list, [])
df = pd.DataFrame(t_list)
df.columns = ['Columns']
df = df.groupby(by=['Columns'], as_index=False).size().to_frame().reset_index().rename(columns={0: 'Count'})
print(df)
Columns Count
0 Bookstore 1
1 Cafe 2
2 Clothing Store 2
3 Coffee Shop 4
4 Grocery Store 2
5 Hotel 1
6 Park 5
7 Plaza 1
8 Pub 4
9 Supermarket 2
10 Theatre 1
I've the following dataframe:
data = {'Name': ['Peter | Jacker', 'John | Parcker', 'Paul | Cash', 'Tony'],
'Age': [10, 45, 14, 65]}
df = pd.DataFrame(data)
What I want to extract is the nicknames (the word after the character '|') only for the person that have more than 16 years. For that I am using the following code:
df['nickname'] = df.apply(lambda x: x.str.split('|', 1)[-1] if x['Age'] > 16 else 0, axis=1)
However, when I print the nickname I only getting the following results:
Name Age nickname
Peter | Jacker 10 0.0
John | Parcker 45 NaN
Paul | Cash 14 0.0
Tony 65 NaN
And what I want is this:
Name Age nickname
Peter | Jacker 10 NaN
John | Parcker 45 Parcker
Paul | Cash 14 NaN
Tony 65 NaN
What I am doing wrong?
Use numpy.where with select second lists after split if condition match, else add mising values (or 0, what need):
df['nickname'] = np.where(df['Age'] > 16, df['Name'].str.split('|', 1).str[1] , np.nan)
print (df)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN
Apply split function on name column. Try below code:
import numpy as np
df.apply(lambda x: x['Name'].split('|', 1)[-1] if x['Age'] > 16 and len(x['Name'].split('|',1))>1 else np.nan, axis=1)
Name Age nickname
0 Peter | Jacker 10 NaN
1 John | Parcker 45 Parcker
2 Paul | Cash 14 NaN
3 Tony 65 NaN
Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2 pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2 which has less than 3 values for cat:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?
Use set_index with reindex by MultiIndex.from_product:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe by all combinations of unique values of columns and merge with right join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3
I have a data-frame that has multiple values in a cell that are delimited by a pipe and I need to map them to another cell with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My data-frame (df1) looks like this:
index name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
index name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20