Table 1
Category Date Value
A 01/01/2015 4
A 02/01/2015 1
B 01/01/2015 6
B 02/01/2015 7
Table 1 above has the values for each category organized by month.
Table 2
Category Date Value
A 03/01/2015 10
C 03/01/2015 66
D 03/01/2015 9
Suppose table 2 comes in, which has the values for each category in March, 2015.
Table 3
Category Date Value
A 01/01/2015 4
A 02/01/2015 1
A 03/01/2015 10
B 01/01/2015 6
B 02/01/2015 7
B 03/01/2015 0
C 01/01/2015 0
C 02/01/2015 0
C 03/01/2015 66
D 01/01/2015 0
D 02/01/2015 0
D 03/01/2015 9
I want to "outer-join" the two tables "vertically" on Python:
If Table2 has a category that Table1 doesn't have, then it adds that category to Table3 and assign a value of 0 for 01/01/2015 and 02/01/2015. Also, the category that is in table1 but not in table2 will also be added in table 3, by assigning a value of 0 for 03/01/2015. If both have the same categories, they will just be added vertically with the values in the table1 and table2.
Any advice or help will be greatly appreciated.. I've been thinking about this all day and still can't find an efficient way to do this.
Thanks so much!
I would do this using Pandas as follows (I'll call your tables df1 and df2):
First prepare the list of dates and categories for the final table using set together with concat to select only the unique values from your original tables:
import pandas as pd
dates = set(pd.concat([df1.Date,df2.Date]))
cats = set(pd.concat([df1.Category,df2.Category]))
Then we create the landing table by iterating through these sets (that's where itertools.product is a neat trick although note that you have to cast it as a list to feed it into the dataframe constructor):
df3 = pd.DataFrame(list(itertools.product(cats,dates)),columns = ['Category','Date'])
df3
Out[88]:
Category Date
0 D 01/01/2015
1 D 03/01/2015
2 D 02/01/2015
3 C 01/01/2015
4 C 03/01/2015
5 C 02/01/2015
6 A 01/01/2015
7 A 03/01/2015
8 A 02/01/2015
9 B 01/01/2015
10 B 03/01/2015
11 B 02/01/2015
Finally we bring in the values using merge (setting on='left'), using np.fmax to consolidate the two sets (you have use fmax instead of max to ignore the nans):
df3['Value'] = np.fmax(pd.merge(df3,df1,how='left')['Value'],pd.merge(df3,df2,how='left')['Value'])
df3
Out[127]:
Category Date Value
0 D 01/01/2015 NaN
1 D 03/01/2015 9
2 D 02/01/2015 NaN
3 C 01/01/2015 NaN
4 C 03/01/2015 66
5 C 02/01/2015 NaN
6 A 01/01/2015 4
7 A 03/01/2015 10
8 A 02/01/2015 1
9 B 01/01/2015 6
10 B 03/01/2015 NaN
11 B 02/01/2015 7
Related
I want to merge df1 and df2, both have different lengths. The intersection on key column needs to be such that the output table has the values from df2 for every corresponding key as the values within the key column are repeating.
df1
key
value
1
5
1
5
2
9
3
11
4
14
4
14
df2
key
value
1
a
2
b
3
c
4
d
output
key
value
value
1
5
a
1
5
a
2
9
b
3
11
c
4
14
d
4
14
d
I'm trying
output = pd.merge(df1, df2, left_on = 'key', right_on = 'key')
But it's creating extra rows.
Thanks for your help in advance.
Try:
pd.merge(df1,df2,on='KEY', how='left').rename(columns={'VALUE_x':'VALUE','VALUE_y':'VALUE'})
Prints:
KEY VALUE VALUE
0 1 5 a
1 1 5 a
2 2 9 b
3 3 11 c
4 4 14 d
5 4 14 d
Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0
I want to group my dataframe by two columns and then sort the aggregated results within the groups.
In [167]:df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
df.groupby(['job','source']).agg({'count':sum})
Out[168]:
job source count
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:
job source count
market A 5
D 4
B 3
sales E 7
C 6
B 4
I want to further sort this problem w.r.t job, so if the sum of count for sales is more, I want the data to be printed as
job source count
sales E 7
C 6
B 4
market A 5
D 4
B 3
I am unable to get the top 5 job
IIUC, we can do a further groupby and use nlargest(3) to get the top n values.
then we can create an ordered list to sort your top values to sort and create a categorical column.
s = df.groupby(['job','source']).agg({'count':sum}).groupby(level=0)['count']\
.nlargest(3).reset_index(0,drop=True).to_frame()
# see which of your indices is higher and create a sorting list.
sorter = s.groupby(level=0)['count'].sum().sort_values(ascending=False).index
#Index(['sales', 'market'], dtype='object', name='job')
s['sort'] = pd.Categorical(s.index.get_level_values(0),sorter)
df2 = s.sort_values('sort').drop('sort',axis=1)
print(df2)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
You could use the sort_values mentioned in another similar answer sorting after aggregation and again group by job to get the top N from the job like,
>>> df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
>>> agg = df.groupby(['job','source']).agg({'count':sum})
>>> agg
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
>>> agg.reset_index().sort_values(['job', 'count'], ascending=False).set_index(['job', 'source']).groupby('job').head(3)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
>>>
I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8
Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2