i have a table in pandas df.
id prod1 prod2 count
1 10 30 100
2 10 20 200
3 20 10 200
4 30 10 100
5 30 40 300
also i am having another table in df2
product price master_product
1000 1 10
5000 2 10
2000 2 20
9000 5 20
8000 1 20
30 3 0
4000 4 50
Check if prod1 and prod2 belongs to the values in master_product,
if yes
i want to replace prod1 and prod2 in my first df with the cheapest product in my master_product.
if the prod1 and prod2 values donot match with the values in master_product,
leave the values as it is.
i am looking for the final table as.
id prod1 prod2 count
1 1000 4000 100
2 1000 8000 200
3 8000 1000 200
4 30 1000 100 #since 30 is not in master_product,leave as it is
5 30 40 300
i was trying to use .map function to achieve this
but i could only reach to this.
df['prod1'] = df['prod1'].map(df2.set_index('master_product')['product'])
df['prod2'] = df['prod2'].map(df2.set_index('master_product')['product'])
but it will try to replace every values in prod1 and prod2 with matching values in master_product from df2.
Any ideas how to achieve this?
You can first modify df1 for minimal price by master_product by groupby with idxmin - get all indices with minimal price:
df1 = df1.loc[df1.groupby('master_product')['price'].idxmin()]
print (df1)
product price master_product
5 30 3 0
0 1000 1 10
4 8000 1 20
6 4000 4 50
Create dict for mapping:
d = df1.set_index('master_product')['product'].to_dict()
print (d)
{0: 30, 10: 1000, 20: 8000, 50: 4000}
Last map and if value is missing add it by combine_first:
df.prod1 = df.prod1.map(d).combine_first(df.prod1)
df.prod2 = df.prod2.map(d).combine_first(df.prod2)
print (df)
id prod1 prod2 count
0 1 1000.0 30.0 100
1 2 1000.0 8000.0 200
2 3 8000.0 1000.0 200
3 4 30.0 1000.0 100
4 5 30.0 40.0 300
Related
I am trying to get the amount spent on each type ID based on the month column
Dataset :
ID TYPE_ID Month_year Amount
100 1 jun_2019 20
100 1 jul_2019 30
100 2 jun_2019 10
200 1 jun_2019 50
200 1 jun_2019 30
100 2 jul_2019 20
200 2 jun_2019 40
200 2 jul_2019 10
200 2 jun_2019 20
200 1 jul_2019 30
100 1 jul_2019 10
Output :
Based on every type ID, I want to calculate the spend depending on the month . The column value TYPEID_1_jun2019 tells me the no of transactions done in that particular month. Amount_type1_jun2019 tells me the total amount spend in every month based on my type ID.
ID TYPEID_1_jun2019 Amount_type1_jun2019 TYPEID_1_jul2019 Amount_type1_jul2019 TYPEID_2_jun2019 Amount_type2_jun2019 TYPEID_2_jul2019 Amount_type2_jul2019
100 1 20 2 40 1 10 1 20
200 1 80 1 30 2 60 1 10
EDIT : I also want to calculate the average monthly spent for every ID
Output : Also include these columns,
ID Average_type1_jul2019 Average_type1_jun2019
100 20 10
The formula I used to calculate the average is amount spent in july with type ID 1 divided by the total months.
First convert Month_year to datetimes for correct order, then create helper column type and aggregate sum with size, reshape by DataFrame.unstack, sorting by DataFrame.sort_index and last flatten MultiIndex with datetimes to original format:
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2])
.sort_index(axis=1, level=[1,2]))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df1 = df1.reset_index()
print (df1)
ID Amount_1_Jun_2019 type_1_Jun_2019 Amount_2_Jun_2019 \
0 100 20 1 10
1 200 80 2 60
type_2_Jun_2019 Amount_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 \
0 1 40 2 20
1 2 30 1 10
type_2_Jul_2019
0 1
1 1
EDIT:
#removed sorting anf flatteting MultiIndex
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2]))
print (df1)
Amount type
Month_year 2019-06-01 2019-07-01 2019-06-01 2019-07-01
TYPE_ID 1 2 1 2 1 2 1 2
ID
100 20 10 40 20 1 1 2 1
200 80 60 30 10 2 2 1 1
#get number of unique mmonth_year per ID and type and divided by Amount
df2 = df.groupby(['ID','TYPE_ID'])['Month_year'].nunique().unstack()
df3 = df1.xs('Amount', axis=1, level=0).div(df2, level=1)
#added top level Average
df3.columns = pd.MultiIndex.from_tuples([('Average', a, b) for a, b in df3.columns])
print (df3)
Average
2019-06-01 2019-07-01
1 2 1 2
ID
100 10.0 5.0 20.0 10.0
200 40.0 30.0 15.0 5.0
#join together, sorting and flatten MultiIndex
df5 = pd.concat([df1, df3],axis=1).sort_index(axis=1, level=[1,2])
df5.columns = df5.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df5 = df5.reset_index()
print (df5)
ID Amount_1_Jun_2019 Average_1_Jun_2019 type_1_Jun_2019 \
0 100 20 10.0 1
1 200 80 40.0 2
Amount_2_Jun_2019 Average_2_Jun_2019 type_2_Jun_2019 Amount_1_Jul_2019 \
0 10 5.0 1 40
1 60 30.0 2 30
Average_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 Average_2_Jul_2019 \
0 20.0 2 20 10.0
1 15.0 1 10 5.0
type_2_Jul_2019
0 1
1 1
I have a pandas dataframe like this.
>data
ID Distance Speed
1 100 40
1 200 20
1 200 10
2 400 20
2 500 30
2 100 40
2 600 20
2 700 90
3 800 80
3 700 10
3 400 20
I want to groupby the table by ID, and create a new column time by dividing each value in the Distance column by the first row of the Speed column of each ID group. So the result should look like this.
>data
ID Distance Speed Time
1 100 40 2.5
1 200 20 5
1 200 10 5
2 400 20 20
2 500 30 25
2 100 40 5
2 600 20 30
2 700 90 35
3 800 80 10
3 700 10 8.75
3 400 20 5
My attempt:
data['Time'] = data['Distance'] / data.loc[data.groupby('ID')['Speed'].head(1).index, 'Speed']
But the result seems to be not good. How do you do it?
Use transform with first for return same length Series as original df:
data['Time'] = data['Distance'] /data.groupby('ID')['Speed'].transform('first')
Or use drop_duplicates with map:
s = data.drop_duplicates('ID').set_index('ID')['Speed']
data['Time'] = data['Distance'] / data['ID'].map(s)
print (data)
ID Distance Speed Time
0 1 100 40 2.50
1 1 200 20 5.00
2 1 200 10 5.00
3 2 400 20 20.00
4 2 500 30 25.00
5 2 100 40 5.00
6 2 600 20 30.00
7 2 700 90 35.00
8 3 800 80 10.00
9 3 700 10 8.75
10 3 400 20 5.00
I'm trying to use two columns start and stop to define multiple ranges of values in another dataframe's age column. Ranges are defined in a df called intervals:
start stop
1 3
5 7
Ages are defined in another df:
age some_random_value
1 100
2 200
3 300
4 400
5 500
6 600
7 700
8 800
9 900
10 1000
Desired output is values where age is between the ranges defined in intervals (1-3 and 5-7):
age some_random_value
1 100
2 200
3 300
5 500
6 600
7 700
I've tried using numpy.r_ but it doesn't work quite as I want it to:
df.age.loc[pd.np.r_[intervals.start, intervals.stop]]
Which yields:
age some_random_value
2 200
6 600
4 400
8 800
Any ideas are much appreciated!
I believe need parameter closed='both' in IntervalIndex.from_arrays:
intervals = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
And then select matching values:
df = df[intervals.get_indexer(df.age.values) != -1]
print (df)
age some_random_value
0 1 100
1 2 200
2 3 300
4 5 500
5 6 600
6 7 700
Detail:
print (intervals.get_indexer(df.age.values))
[ 0 0 0 -1 1 1 1 -1 -1 -1]
i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column price_product2 in my first df, taking values of price from df2 with respect to product_2.
And also find the percentage difference of product_2 with respect to product_1
and make one more column %_diff .
i.e say product_1 = 100 and product_2 = 200. therefore product_2 is 200% of the price of 100.
similarly if product_1 = 400 and product_2 = 100, it is a decline in price.
therefore product_2 is -25% of product_1.
my final output should be. df =
id product_1 product_2 count price_product_2 %_diff
1 100 200 10 10 +200
2 200 600 20 30 +300
3 100 500 30 25 +500
4 400 100 40 5 -25
5 500 700 50 35 +140
6 200 500 60 25 +250
7 100 400 70 20 -71.42
Any ideas how to achieve it?
i was trying to use map functions.
df['price_product_2'] = df['product_2'].map(df2.set_index('product_id')['price'])
but i could get only one column , how do i get the %_diff column?
Use merge (or map) twice, once for each product, then calculate the difference.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['pct_diff'] = (df3.price_y - df3.price_x) / df3.price_x
Suppose you have the following data frames:
In [32]: df1
Out[32]:
index id product_1 product_2 count
0 0 1 100 200 10
1 1 2 200 600 20
2 2 3 100 500 30
3 3 4 400 100 40
4 4 5 500 700 50
5 5 6 200 500 60
6 6 7 100 400 70
In [33]: df2
Out[33]:
product price
0 100 5
1 200 10
2 300 15
3 400 20
4 500 25
5 600 30
6 700 35
It is probably easier simply to set product as the index for df2:
In [35]: df2.set_index('product', inplace=True)
In [36]: df2
Out[36]:
price
product
100 5
200 10
300 15
400 20
500 25
600 30
700 35
Then you can do things like the following:
In [37]: df2.loc[df1['product_2']]
Out[37]:
price
product
200 10
600 30
500 25
100 5
700 35
500 25
400 20
Use the values explicitly to set, or else the product index will screw things up:
In [38]: df1['price_product_2'] = df2.loc[df1['product_2']].values
In [39]: df1
Out[39]:
index id product_1 product_2 count price_product_2
0 0 1 100 200 10 10
1 1 2 200 600 20 30
2 2 3 100 500 30 25
3 3 4 400 100 40 5
4 4 5 500 700 50 35
5 5 6 200 500 60 25
6 6 7 100 400 70 20
For the percentage difference, you can also use vectorized operations:
In [40]: df1.product_2 / df1.product_1 * 100
Out[40]:
0 200.0
1 300.0
2 500.0
3 25.0
4 140.0
5 250.0
6 400.0
dtype: float64
Solution with map by dict d with divide by div:
d = df2.set_index('product')['price'].to_dict()
df['price_product_2'] = df['product_2'].map(d)
df['price_product_1'] = df['product_1'].map(d)
df['diff'] = df['price_product_2'].div(df['price_product_1']).mul(100)
print (df)
id product_1 product_2 count price_product_2 price_product_1 diff
0 1 100 200 10 10 5 200.0
1 2 200 600 20 30 10 300.0
2 3 100 500 30 25 5 500.0
3 4 400 100 40 5 20 25.0
4 5 500 700 50 35 25 140.0
5 6 200 500 60 25 10 250.0
6 7 100 400 70 20 5 400.0
But it seems only divide is necessary if multiple by same constant columns product_1 and product_2, then difference is same:
df['diff1'] = df['product_2'].div(df['product_1']).mul(100)
print (df)
id product_1 product_2 count diff1
0 1 100 200 10 200.0
1 2 200 600 20 300.0
2 3 100 500 30 500.0
3 4 400 100 40 25.0
4 5 500 700 50 140.0
5 6 200 500 60 250.0
6 7 100 400 70 400.0
i have a table in pandas df
id count
0 10 3
1 20 4
2 30 5
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
also i have another pandas series s
0 1000
1 2000
2 3000
3 4000
what i want to do is replace the NaN values in my df with the respective values from series s.
my final output should be
id count
0 10 3
1 20 4
2 30 5
3 40 1000
4 50 2000
5 60 3000
6 70 4000
Any ideas how do achieve this?
Thanks in advance.
There is problem lenght of Series is different as length of NaN values in column count. So you need reindex Series by length of NaN:
s = pd.Series({0: 1000, 1: 2000, 2: 3000, 3: 4000, 5: 5000})
print (s)
0 1000
1 2000
2 3000
3 4000
5 5000
dtype: int64
df.loc[df['count'].isnull(), 'count'] =
s.reindex(np.arange(df['count'].isnull().sum())).values
print (df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
It's as simple as this:
df.count[df.count.isnull()] = s.values
In this case, I prefer iterrows for its readability.
counter = 0
for index, row in df.iterrows():
if row['count'].isnull():
df.set_value(index, 'count', s[counter])
counter += 1
I might add that this 'merging' of dataframe + series is a bit odd, and prone to bizarre errors. If you can somehow get the series into the same format as the dataframe (aka add some index/column tags, then you might be better served by the merge function).
You can re-index your Series with indexes of np.nan from dataframe and than fillna() with your Series:
s.index = np.where(df['count'].isnull())[0]
df['count'] = df['count'].fillna(s)
print(df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0