pandas add index to column name - python

I have data frame with about 100 columns that repeat itself because the data is organized by weeks, it looks something like that:
hours
hours
clicks
clicks
days
days
minutes
minutes
week 1
week 2
week 1
week 2
week 1
week 2
week 1
week 2
2
2
2
3
6
2
2
3
1
7
6
3
8
2
9
3
I would like the output to look like this:
hours_w1
hours_w2
clicks_w1
clicks_w2
days_w1
days_w2
minutes_w1
minutes_w2
2
2
2
3
6
2
2
3
1
7
6
3
8
2
9
3
I know I can just rename the columns but because I have over 100 columns I'm looking for a more efficient way.
I tried to use add_suffix but had only managed to add the same suffix to all columns, when what I need is a different index for each week.
any idea how to do this?
Thanks!!

Extract the suffixes from the first row then add them to the column names and finally remove the first row.
# To fix mangle_dup_cols
df.columns = df.columns.str.split('.').str[0]
suffixes = '_' + df.iloc[0].str[0] + df.iloc[0].str[-1]
df.columns += suffixes
df = df.iloc[1:]
Output:
>>> df
hours_w1 hours_w2 clicks_w1 clicks_w2 days_w1 days_w2 minutes_w1 minutes_w2
1 2 2 2 3 6 2 2 3
2 1 7 6 3 8 2 9 3

first you should change the first row:
df.iloc[0] = df.iloc[0].apply(lambda x:'w1' if x == 'week 1' else 'w2')
Then you can merge it with the column name like this:
df.columns = [i + '_' + j for i, j in zip(df.columns, df.iloc[0])]
And then you can remove the first row:
df = df.iloc[1:]

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

How to create N groups based on conditions in columns?

I need to create groups using two columns. For example, I took shop_id and week. Here is the df:
shop_id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 2
6 1 5
Imagine that each group is some promo which took place in each shop consecutively (week by week). So, my attempt was to use sorting, shifting by 1 to get last_week, use booleans and then iterate over them, incrementing each time whereas condition not met:
test_df = pd.DataFrame({'shop_id':[1,1,1,2,2,3,1], 'week':[1,2,3,1,2,2,5]})
def createGroups(df, shop_id, week, group):
'''Create groups where is the same shop_id and consecutive week
'''
periods = []
period = 0
# sorting to create chronological order
df = df.sort_values(by = [shop_id,week],ignore_index = True)
last_week = df[week].shift(+1)==df[week]-1
last_shop = df[shop_id].shift(+1)==df[shop_id]
# here i iterate over booleans and increment group by 1
# if shop is different or last period is more than 1 week ago
for p,s in zip(last_week,last_shop):
if (p == True) and (s == True):
periods.append(period)
else:
period += 1
periods.append(period)
df[group] = periods
return df
createGroups(test_df, 'shop_id', 'week', 'promo')
And I get the grouping I need:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
3 1 5 2
4 2 1 3
5 2 2 3
6 3 2 4
However, function seems to be an overkill. Any ideas on how to get the same without a for-loop using native pandas function? I saw .ngroups() in docs but have no idea how to apply it to my case. Even better would be to vectorise it somehow, but I don't know how to achieve this:(
First we want to identify the promotions (continuously in weeks), then use groupby().ngroup() to enumerate the promotion:
df = df.sort_values('shop_id')
s = df['week'].diff().ne(1).groupby(df['shop_id']).cumsum()
df['promo'] = df.groupby(['shop_id',s]).ngroup() + 1
Update: This is based on your solution:
df = df.sort_values(['shop_id','week'])
s = df[['shop_id', 'week']]
df['promo'] = (s['shop_id'].ne(s['shop_id'].shift()) |
s['week'].diff().ne(1) ).cumsum()
Output:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
6 1 5 2
3 2 1 3
4 2 2 3
5 3 2 4

Sum up value in different numbers of columns for each row

I have a data frame including number of sold tickets in different price buckets for each flight.
For each record/row, I want to use the value in one column as an index in iloc function, to sum up values in a specific number of columns.
Like, for each row, I want to sum up values from column index 5 to value in ['iloc_index']
I tried df.iloc[:, 5:df['iloc_index']].sum(axis=1) but it did not work.
sample data:
A B C D iloc_value total
0 1 2 3 2 1
1 1 3 4 2 2
2 4 6 3 2 1
for each row, I want to sum up the number of columns based on the value in ['iloc_value']
for example,
for row0, I want the total to be 1+2
for row1, I want the total to be 1+3+4
for row2, I want the total to be 4+6
EDIT:
I quickly got the results this way:
First define a function that can do it for one row:
def sum_till_iloc_value(row):
return sum(row[:row['iloc_value']+1])
Then apply it to all rows to generate your output:
df_flights['sum'] = df_flights.apply(sum_till_iloc_value, axis=1)
A B C D iloc_value sum
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
PREVIOUSLY:
Assuming you have information that looks like:
df_flights = pd.DataFrame({'flight':['f1', 'f2', 'f3'], 'business':[2,3,4], 'economy':[6,7,8]})
df_flights
flight business economy
0 f1 2 6
1 f2 3 7
2 f3 4 8
you can sum the columns you want as below:
df_flights['seat_count'] = df_flights['business'] + df_flights['economy']
This will create a new column that you can later select:
df_flights[['flight', 'seat_count']]
flight seat_count
0 f1 8
1 f2 10
2 f3 12
Here's a way to do that in a fully vectorized way: melting the dataframe, summing only the relevant columns, and getting the total back into the dataframe:
d = dict([[y, x] for x, y in enumerate(df.columns[:-1])])
temp_df = df.copy()
temp_df = temp_df.rename(columns=d)
temp_df = temp_df.reset_index().melt(id_vars = ["index", "iloc_value"])
temp_df = temp_df[temp_df.variable <= temp_df.iloc_value]
df["total"] = temp_df.groupby("index").value.sum()
The output is:
A B C D iloc_value total
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10

Is it possible to use vectorization for a conditionnal count of rows in a Pandas Dataframe?

I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Categories