I am having some trouble with merging 2 data frames that doesn't have any same column.
I have 2 data frames and I want to combine them (they both have the same row number)
For example, I have these two data frames:
A:
store
red candy
apple
first
5
3
second
1
2
third
4
2
B:
yellow candy
banana
green candy
10
5
4
5
3
3
1
1
0
and I want to merge them so I will have one data frame looks like this:
store
red candy
apple
yellow candy
banana
green candy
first
5
3
10
5
4
second
1
2
5
3
3
third
4
2
1
1
0
Related
Iam using this dataframe
source fruit 2019 2020 2021
0 a apple 3 1 1
1 a banana 4 3 5
2 a orange 2 2 2
3 b apple 3 4 5
4 b banana 4 5 2
5 b orange 1 6 4
i want to refine it like this
source fruit 2019 2020 2021
0 a total 9 6 8
1 a seeds 5 3 3
2 a banana 4 3 5
3 b total 8 15 11
4 b seeds 4 10 9
5 b banana 4 5 2
total is sum of all fruits in that year for each source.
seeds is the sum of fruits containing seeds for each year for each source.
I tried
Appending new empty rows : Insert a new row after every nth row & Insert row at any position
But wasn't getting the expected result.
What would be the best way to get the desired output?
TRY:
df1 = df.groupby('source', as_index=False).sum().assign(fruit = 'total')
seeds = ['orange','apple']
df2 = df.loc[df['fruit'].isin(seeds)].groupby('source', as_index=False).sum().assign(fruit = 'seeds')
final_df = pd.concat([df.loc[~df['fruit'].isin(seeds)], df1,df2])
I'm trying to compute the cumulative sum in python based on a two different conditions.
As you can see in the attached image, Calculation column would take the same value as the Number column as long as the Cat1 and Cat2 column doesn't change.
Once Cat1 column changes, we should reset the Number column.
Calculation column stays the same as the Number column, Once Cat2 column changes with the same value of Cat1 column, the Calculation column will take the last value of the Number column and add it to the next one.
Example of data below:
Cat1 Cat2 Number CALCULATION
a orange 1 1
a orange 2 2
a orange 3 3
a orange 4 4
a orange 5 5
a orange 6 6
a orange 7 7
a orange 8 8
a orange 9 9
a orange 10 10
a orange 11 11
a orange 12 12
a orange 13 13
b purple 1 1
b purple 2 2
b purple 3 3
b purple 4 4
b purple 5 5
b purple 6 6
b purple 7 7
b purple 8 8
b silver 1 9
b silver 2 10
b silver 3 11
b silver 4 12
b silver 5 13
b silver 6 14
b silver 7 15
Are you looking for:
import pandas as pd
df = pd.DataFrame({'Cat1': ['a','a','a','a','a','a','a','a','a','a','a', 'a','a','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'],
'Cat2': ['orange','orange','orange','orange','orange','orange','orange', 'orange','orange','orange','orange','orange','orange','purple','purple', 'purple','purple','purple','purple','purple','purple','silver','silver','silver', 'silver','silver','silver','silver']})
df['Number'] = df.groupby(['Cat1', 'Cat2']).cumcount()+1
df['CALCULATION'] = df.groupby('Cat1').cumcount()+1
[how do I create a DataFrame mirroring the table below and assign this to data. Then group by the flavor column and find the mean price for each flavor; assign this series to price_by_flavor.
flavor
scoops
price
chocolate
1
2
vanilla
1
1.5
chocolate
2
3
strawberry
1
2
strawberry
3
4
vanilla
2
2
mint
1
4
mint
2
5
chocolate
3
5
I have a data frame which has the structure as follows
code value
1 red
2 blue
3 yellow
1
4
4 pink
2 blue
so basically i want to update the value column so that the blank rows are filled with values from other rows. So I know the code 4 refers to value pink, I want it to be updated in all the rows where that value is not present.
Using groupby and ffill and bfill
df.groupby('code').value.ffill().bfill()
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
You could use first value of the given code group
In [379]: df.groupby('code')['value'].transform('first')
Out[379]:
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
To assign back
In [380]: df.assign(value=df.groupby('code')['value'].transform('first'))
Out[380]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Or
df['value'] = df.groupby('code')['value'].transform('first')
You can create a series of your code-value pairs, and use that to map:
my_map = df[df['value'].notnull()].set_index('code')['value'].drop_duplicates()
df['value'] = df['code'].map(my_map)
>>> df
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Just to see what is happening, you are passing the following series to map:
>>> my_map
code
1 red
2 blue
3 yellow
4 pink
Name: value, dtype: object
So it says: "Where you find 1, give the value red, where you find 2, give blue..."
You can sort_values, ffill and then sort_index. The last step may not be necessary if order is not important. If it is, then the double sort may be unreasonably expensive.
df = df.sort_values(['code', 'value']).ffill().sort_index()
print(df)
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Using reindex
df.dropna().drop_duplicates('code').set_index('code').reindex(df.code).reset_index()
Out[410]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1