I have the following data frame.
ID Product quantity
9626 a 1
9626 b 1
9626 c 1
6600 f 1
6600 a 1
6600 d 1
And I want to join rows by ID.
Below is an example of the results.
(The quantity column is optional. This column is not necessary.)
ID Product quantity
9626 a,b,c 3
6600 a,d,f 3
I used merge and sum, but it did not work.
Is this problem solved only with a loop statement?
I'd appreciate it if you could provide me with a solution.
Use groupby.agg:
df = (df.sort_values('Product')
.groupby('ID', as_index=False, sort=False)
.agg({'Product':','.join, 'quantity':'sum'}))
print(df)
ID Product quantity
0 9626 a,b,c 3
1 6600 a,d,f 3
Related
I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)
I am trying to join two dataframes by ID and Date. However, the date criteria is that the a.date<=b.date and in case a.date has a many results, then taking the max value (but still <b.date). How would I do that?
Dataframe A (cumulative sales table)
ID| date | cumulative_sales
1 | 2020-01-01 | 10
1 | 2020-01-03 | 15
1 | 2021-01-02 | 20
Dataframe B
ID| date | cumulative_sales (up to this date, how much was purchased for a given ID?)
1 | 2020-05-01 | 15
In SQL, I would do a join by a.date<=b.date, then I would next do a dense_rank() and take the max value within that partition for each ID. Not sure how to approach this with Pandas. Any suggestion?
Looks like you simply want a merge_asof:
dfA['date'] = pd.to_datetime(dfA['date'])
dfB['date'] = pd.to_datetime(dfB['date'])
out = pd.merge_asof(dfB.sort_values(by='date'),
dfA.sort_values(by='date'),
on='date', by='ID')
Here's a way to do what your question asks:
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query('date <= date_b').drop(
columns='date_b').groupby(['ID']).last().reset_index()
Explanation:
sort dfA by ID, date
use join to join with dfB on ID and bring the columns from dfB in with the suffix _b
use query to keep only rows where dfA.date <= dfB.date
use groupby on ID and then last to select the row with the highest remaining value of dfA.date (i.e., the highest dfA.date that is <= dfB.date for each ID)
use reset_index to convert ID from an index level back into a column label
Full test code:
import pandas as pd
dfA = pd.DataFrame({'ID':[1,1,1,2,2,2], 'date':['2020-01-01','2020-01-03','2020-01-02','2020-01-01','2020-01-03','2020-01-02'], 'cumulative_sales':[10,15,20,30,40,50]})
dfB = pd.DataFrame({'ID':[1,2], 'date':['2020-05-01','2020-01-01'], 'cumulative_sales':[15,30]})
print(dfA)
print(dfB)
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query(
'date <= date_b').drop(columns='date_b').groupby(['ID']).last().reset_index()
print(dfA)
Input:
dfA:
ID date cumulative_sales
0 1 2020-01-01 10
1 1 2020-01-03 15
2 1 2020-01-02 20
3 2 2020-01-01 30
4 2 2020-01-03 40
5 2 2020-01-02 50
dfB:
ID date cumulative_sales
0 1 2020-05-01 15
1 2 2020-01-01 30
Output:
ID date cumulative_sales cumulative_sales_b
0 1 2020-01-03 15 15
1 2 2020-01-01 30 30
Note: I have left cumulative_sales_b in place in case you want it. If it's not needed, it can be dropped by replacing drop(columns='date_b') with drop(columns=['date_b', 'cumulative_sales_b']).
UPDATE:
For fun, if your version of python has the walrus operator := (also known as "conditional assignment" operator), you can do this instead of using query:
dfA = (dfA := dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b'))[dfA.date <= dfA.date_b].drop(
columns='date_b').groupby(['ID']).last().reset_index()
We can do merge
out = df1.merge(df2, on = 'ID', suffixes = ('','_x')).\
query('date<=date_x').sort_values('date').drop_duplicates('ID',keep='last')[df1.columns]
Out[272]:
ID date cumulative_sales
1 1 2020-01-03 15
I have several dataframes, each has different and same column names, and the columns with same and different column names may have same values. I want to find the columns in one dataset which has matching values with other dataset' columns(may have same or different column names). Is there any efficient way to do that using python?
For example:
df1: ID count Name
0 1 A
1 2 B
2 3 C
df2: person_id count_number Name Value
0 1 A 11
2 3 C 22
3 4 D 33
df3: key Value
11 11
22 22
33 33
I tried 'isin()':this is not efficient, and 'datacompy': can't be used? because I have different column names.
My expected output: the column names that have matchings. And also better show how many matchings do they have.
For example: In this example, I want to find the matching columns of df1, df2 and df3. And the output I want is: Their pairwise matches: For df1 and df2: ID&person_id; count&count_number, Name; for df2 and df3: Value, and so on.
As you have no expected output, it's hard to answer. A first proposition:
>>> df1.merge(df2, left_on='ID', right_on='person_id').merge(df3, on='Value')
ID count Name_x person_id count_number Name_y Value key
0 0 1 A 0 1 A 11 11
1 2 3 C 2 3 C 22 22
Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
I am still new to Python pandas' pivot_table and would like to ask a way to count frequencies of values in one column, which is also linked to another column of ID. The DataFrame looks like the following.
import pandas as pd
df = pd.DataFrame({'Account_number':[1,1,2,2,2,3,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B']
})
For the output, I'd like to get something like the following:
Product
A B
Account_number
1 2 0
2 1 2
3 1 1
So far, I tried this code:
df.pivot_table(rows = 'Account_number', cols= 'Product', aggfunc='count')
This code gives me the two same things. What is the problems with the code above? A part of the reason why I am asking this question is that this DataFrame is just an example. The real data that I am working on has tens of thousands of account_numbers.
You need to specify the aggfunc as len:
In [11]: df.pivot_table(index='Account_number', columns='Product',
aggfunc=len, fill_value=0)
Out[11]:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
It looks like count, is counting the instances of each column (Account_number and Product), it's not clear to me whether this is a bug...
Solution: Use aggfunc='size'
Using aggfunc=len or aggfunc='count' like all the other answers on this page will not work for DataFrames with more than three columns. By default, pandas will apply this aggfunc to all the columns not found in index or columns parameters.
For instance, if we had two more columns in our original DataFrame defined like this:
df = pd.DataFrame({'Account_number':[1, 1, 2 ,2 ,2 ,3 ,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B'],
'Price': [10] * 7,
'Quantity': [100] * 7})
Output:
Account_number Product Price Quantity
0 1 A 10 100
1 1 A 10 100
2 2 A 10 100
3 2 B 10 100
4 2 B 10 100
5 3 A 10 100
6 3 B 10 100
If you apply the current solutions to this DataFrame, you would get the following:
df.pivot_table(index='Account_number',
columns='Product',
aggfunc=len,
fill_value=0)
Output:
Price Quantity
Product A B A B
Account_number
1 2 0 2 0
2 1 2 1 2
3 1 1 1 1
Solution
Instead, use aggfunc='size'. Since size always returns the same number for each column, pandas does not call it on every single column and just does it once.
df.pivot_table(index='Account_number',
columns='Product',
aggfunc='size',
fill_value=0)
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
In new version of Pandas, slight modification is required. I had to spend some time figuring out so just wanted to add that here so that someone can directly use this.
df.pivot_table(index='Account_number', columns='Product', aggfunc=len,
fill_value=0)
You can use count:
df.pivot_table(index='Account_number', columns='Product', aggfunc='count')
I know this question is about pivot_table but for the problem given in the question, we can use crosstab:
out = pd.crosstab(df['Account_number'], df['Product'])
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1