Changing column values based on previous rows values in python - python

I have a data frame that looks like this
Name
Order
Manufacturer
0
Company1
1
product
2
Company2
1
product
2
product
2
product
2
the only identifier for the value in the Names column is the order, where 0 represents Manufacturers, 1 represents companies, and 2 represents products.
And I want to add a column with a value-based in comparison between previous and current rows under the same group.
basically, I want to identify that company1 relates to Manufacturer1, and product1 relates to company1, etc...
Name
Order
Desired_Output
Manufacturer
0
Manufacturer
Company1
1
Manufacturer_Company1
product
2
Company1_product
Company2
1
Manufacturer_Company2
product
2
Company2_product
product
2
Company2_product
product
2
Company2_product

You can pivot the data, ffill and join the last two items:
df['output'] = (df
.reset_index()
.pivot(index='index', columns='Order', values='Name')
.ffill()
.apply(lambda d: '_'.join(d.dropna().iloc[-2:]), axis=1)
)
NB. This should work with any number of 0 values.
output:
Name Order output
0 Manufacturer 0 Manufacturer
1 Company1 1 Manufacturer_Company1
2 product 2 Company1_product
3 Company2 1 Company2_product
4 product 2 Company2_product
5 product 2 Company2_product
6 product 2 Company2_product

This one is a little tricky but will do the job
df['output'] = df.apply(lambda x: df[df['Order'] == x['Order'] -1]['Name'].iloc[-1]+'_'+x['Name'] if x['Order'] > 0 else x['Name'],axis=1)
However, that's not the most clever solution for large dfs

Related

How to get the Percentage of a Column based on a Condition? Python

I want to calculate the percentage of my Products column according to the occurrences per related Country. I would greatly appreciate your help.
Here is what I did so far,
I calculated my new dataframe with this code:
gb = data1.groupby(['Country', 'Products']).size()
df = gb.to_frame(name = 'ProductsCount').reset_index()
df
Which gives me something that look like this:
Countries Products ProductsCount
0 Country 1 Product 1 5
1 Country 1 Product 2 31
2 Country 2 Product 1 2
3 Country 2 Product 2 1
Note: I have a couple of thousands rows of output.
My goal is to get the percentage per each products according to the country directly without calculating the ['ProductsCount'], like this:
Countries Products Percentage
0 Country 1 Product 1 0.138
1 Country 1 Product 2 0.861
2 Country 2 Product 1 0.667
3 Country 2 Product 2 0.333
Otherwise If I can't get the the output to show only the %, then I would like something like this:
Countries Products ProductsCount Products%
0 Country 1 Product 1 5 0.138
1 Country 1 Product 2 31 0.861
2 Country 2 Product 1 2 0.667
3 Country 2 Product 2 1 0.333
I managed to calculate only the % according to the whole dataset using this code:
df['Products%'] = df.ProductsCount/len(df.Country)
Thank you in advance!
Use SeriesGroupBy.value_counts with normalize=True parameter:
df = (data1.groupby('Countries')['Products']
.value_counts(normalize=True,sort=False)
.reset_index(name='Percentage'))
print (df)
Countries Products Percentage
0 Country 1 Product 1 0.138889
1 Country 1 Product 2 0.861111
2 Country 2 Product 1 0.666667
3 Country 2 Product 2 0.333333
EDIT:
df = (data1.groupby('Countries')['Products']
.value_counts(sort=False)
.reset_index(name='ProductsCount')
.assign(Percentage = lambda x: x['ProductsCount'].div(len(x))))
print (df)

Top 2 products counts per day Pandas

I have dataframe like in the below pic.
First; I want the top 2 products, second I need the top 2 products frequents per day, so I need to group it by days and select the top 2 products from products column, I tried this code but it gives an error.
df.groupby("days", as_index=False)(["products"] == "Follow Up").count()
enter image description here
You need to groupby over both days and products and then use size. Once you have done this you will have all the counts in the df you require.
You will then need to sort both the day and the default 0 column which now contains your counts, this has been created by resetting your index on the initial groupby.
We follow the instructions in Pandas get topmost n records within each group to give your desired result.
A full example:
Setup:
df = pd.DataFrame({'day':[1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3],
'value':['a','a','b','b','b','c','a','a','b','b','b','c','a','a','b','b','b','c']})
df.head(6)
day value
0 1 a
1 1 a
2 1 b
3 1 b
4 1 b
5 1 c
df_counts = df.groupby(['day','values']).size().reset_index().sort_values(['day', 0], ascending = [True, False])
df_top_2 = df_counts.groupby('day').head(2)
df_top_2
day value 0
1 1 b 3
0 1 a 2
4 2 b 3
3 2 a 2
7 3 b 3
6 3 a 2
Of course, you should rename the 0 column to something more reasonable but this is a minimal example.

How to maintain order when selecting rows in pandas dataframe?

I want to select rows in a particular order given in a list. For example
This dataframe
a=[['car',1],['bike',3],['jewel',2],['tv',5],['phone',6]]
df=pd.DataFrame(a,columns=['items','quantity'])
>>> df
items quantity
0 car 1
1 bike 3
2 jewel 2
3 tv 5
4 phone 6
I want to get the rows with this order ['tv','car','phone'], that is, first row tv and then car and then phone. I tried this method but it doesn't maintain order
arr=['tv','car','phone']
df.loc[df['items'].isin(arr)]
items quantity
0 car 1
3 tv 5
4 phone 6
Here's a non-intrusive solution using Index.get_indexer that doesn't involve setting the index:
df.iloc[pd.Index(df['items']).get_indexer(['tv','car','phone'])]
items quantity
3 tv 5
0 car 1
4 phone 6
Note that if this is going to become a frequent thing (by thing, I mean "indexing" with a list on a column), you're better off turning that column into an index. Bonus points if you sort it.
df2 = df.set_index('items')
df2.loc[['tv','car','phone']]
quantity
items
tv 5
car 1
phone 6
IIUC Categorical
df=df.loc[df['items'].isin(arr)]
df.iloc[pd.Categorical(df['items'],categories=arr,ordered=True).argsort()]
Out[157]:
items quantity
3 tv 5
0 car 1
4 phone 6
Or reindex :Notice only different is this will not save the pervious index and if the original index do matter , you should using Categorical (Mentioned by Andy L, if you have duplicate in items ,reindex will failed )
df.set_index('items').reindex(arr).reset_index()
Out[160]:
items quantity
0 tv 5
1 car 1
2 phone 6
Or loop via the arr
pd.concat([df[df['items']==x] for x in arr])
Out[171]:
items quantity
3 tv 5
0 car 1
4 phone 6
merge to the rescue:
(pd.DataFrame({'items':['tv','car','phone']})
.merge(df, on='items')
)
Output:
items quantity
0 tv 5
1 car 1
2 phone 6
For all items to be chosen existing in input df, here's one with searchsorted and should be good on performance -
In [43]: sidx = df['items'].argsort()
In [44]: df.iloc[sidx[df['items'].searchsorted(['tv','car','phone'],sorter=sidx)]]
Out[44]:
items quantity
3 tv 5
0 car 1
4 phone 6
I would create a dictionary from arr and map it to items and dropna, sort_values
d = dict(zip(arr, range(len(arr))))
Out[684]: {'car': 1, 'phone': 2, 'tv': 0}
df.loc[df['items'].map(d).dropna().sort_values().index]
Out[693]:
items quantity
3 tv 5
0 car 1
4 phone 6
Here is another variety that uses .loc.
# Move items to the index, select, then reset.
df.set_index("items").loc[arr].reset_index()
Or another that doesn't change the index.
df.loc[df.reset_index().set_index("items").loc[arr]["index"]]
Why not:
>>> df.iloc[df.loc[df['items'].isin(arr), 'items'].apply(arr.index).sort_values().index]
items quantity
3 tv 5
0 car 1
4 phone 6
>>>
Why not search for index, filter and re-order:
df['new_order'] = df['items'].apply(lambda x: arr.index(x) if x in arr else -1)
df_new = df[df['new_order']>=0].sort_values('new_order')
items quantity new_order
3 tv 5 0
0 car 1 1
4 phone 6 2

Summarizing a dataset and creating new variables

I have a Dataset that lists individual transactions by country, quarter, division, the transaction type and the value. I would like to sum it up based on the first three variables but create new columns for the other two. The dataset looks like this:
Country Quarter Division Type Value
A 1 Sales A 50
A 2 Sales A 150
A 3 Sales B 20
A 1 Sales A 250
A 2 Sales B 50
A 3 Sales B 50
A 2 Marketing A 50
Now I would like to aggregate the data to get the number of transactions by type as a new variable. The overall number of transactions grouped by the first three variables is easy:
df.groupby(['Country', 'Quarter', 'Division'], as_index=False).agg({'Type':'count', 'Value':'sum'})
However, I would like my new dataframe to look as follows:
Country Quarter Division Type_A Type_B Value_A Value_B
A 1 Sales 2 0 300 0
A 2 Sales 1 1 150 50
A 3 Sales 0 2 0 70
A 2 Marketing 1 0 50 0
How do I do that?
Specify column after groupby with tuples in agg functions for new columns names with aggregate functions, then reshape by DataFrame.unstack and last convert MultiIndex in columns by map:
df1 = (df.groupby(['Country', 'Quarter', 'Division', 'Type'])['Value']
.agg([('Type','count'), ('Value','sum')])
.unstack(fill_value=0))
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index()
print (df1)
Country Quarter Division Type_A Type_B Value_A Value_B
0 A 1 Sales 2 0 300 0
1 A 2 Marketing 1 0 50 0
2 A 2 Sales 1 1 150 50
3 A 3 Sales 0 2 0 70

Most efficient way of joining dataframes in pandas: loc or join?

Suppose I have two dataframes; one holds transactions, trans and the other holds product information, prod, and I want to join the product prices, the variable price, on to the transaction data frame, repeating them down for each column. Which of these approaches is more efficient / preferred:
Method 1:
trans = trans.set_index('product_id').join(trans.set_index('product_id'))
Method 2:
trans.set_index('product_id',inplace=True)
trans['price'] = prod.loc[trans.product_id, 'price']
It seems you need map:
trans = pd.DataFrame({'product_id':[1,2,3],
'price':[4,5,6]})
print (trans)
price product_id
0 4 1
1 5 2
2 6 3
prod = pd.DataFrame({'product_id':[1,2,4],
'price':[40,50,60]})
print (prod)
price product_id
0 40 1
1 50 2
2 60 4
d = prod.set_index('product_id')['price'].to_dict()
trans['price'] = trans['product_id'].map(d)
print (trans)
price product_id
0 40.0 1
1 50.0 2
2 NaN 3

Categories