Make new column from slice of string from one column pandas python - python

I read the answer at the link [Text] (Pandas make new column from string slice of another column), but it does not solve my problem.
df
SKU Noodles FaceCream BodyWash Powder Soap
Jan10_Sales 122 100 50 200 300
Feb10_Sales 100 50 80 90 250
Mar10_sales 40 30 100 10 11
and so on
Now I want column month and year which will take value from SKU Column and return Jan for month and 10 for year (2010).
df['month']=df['SKU'].str[0:3]
df['year']=df['SKU'].str[4:5]
I get KeyError: 'SKU'
Doing other things to understand why the error, I perform the following:
[IN]df.index.name
[OUT]None
[IN]df.columns
[OUT]Index(['Noodles','FaceCream','BodyWash','Powder','Soap'], dtype='object', name='SKU')
Please help

I think first column is index, so use .index, also for year change 4:5 slicing to 3:5, 0 is possible omit in 0:3:
df['month']=df.index.str[:3]
df['year']=df.index.str[3:5]
print (df)
Noodles FaceCream BodyWash Powder Soap month year
SKU
Jan10_Sales 122 100 50 200 300 Jan 10
Feb10_Sales 100 50 80 90 250 Feb 10
Mar10_sales 40 30 100 10 11 Mar 10

Related

How to summarize only certain columns of dataframe (python pandas)

I want to get new dataframe, in which I need to see sum of certain columns for rows which have same value of 'Index' columns (campaign_id and group_name in my example)
This is sample (example) of my dataframe:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 20 5 50 bar 12
102 red 7 3 25 bar 12
102 brown 5 0 18 bar 12
this is what I want to get:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 27 8 75 bar 12
102 brown 5 0 18 bar 12
I tried:
df = df.groupby(['campaign_id','group_name'])['clicks','conversions','cost'].sum().reset_index()
but this gives my only mentioned (summarized) columns (and Index), like this:
campaign_id group_name clicks conversions cost
101 blue 40 15 100
102 red 27 8 75
102 brown 5 0 18
I can try to add leftover columns after this operation, but I'm not sure if this will be optimal and adequate way to solve the problem
Is there simple way to summarize certain columns and leave other columns untouched (I don't care if they would differ, because in my data all leftover columns have same data for rows with same corresponding values in 'Index' columns (which are campaign_id and group_name)
When I finished my post I saw the answer right away: since all columns except those which I want to summarize - have matching values - I just need to take all those columns as part of multi-index, for this operation. Like this:
df = df.groupby(['campaign_id','group_name','lavel','city_id'])['clicks','conversions','cost'].sum().reset_index()
In this case I got exacty what I wanted.

python pandas averaging columns to produce new ones

I have a Pandas DataFrame with the following data, displaying the hours worked per week for employees at a company:
name week 1 week 2 week 3 week 4...
joey 20 15 35 10
thomas 20 10 25 15
mark 30 20 25 10
sal 25 25 15 20
amy 25 30 20 10
Assume the data carries on in the same for 100+ weeks.
What I want to produce is a biweekly average of hours for each employee,
so the average hours worked over two weeks. Shown in the following DataFrame:
name weeks 1-2 weeks 2-4...
joey 17.5 22.5
thomas 15 20
mark 25 17.5
sal 25 17.5
amy 27.5 15
How could I make this work? Trying out iterating right now but I'm stuck.
You can achieve that with the following:
for i in range(0, len(df.columns), 2):
df[f'weeks {i+1}-{i+2}'] = df.iloc[:, i:i+1].mean(axis=1)
This code basically iterates through the amount of columns, taking a step of size 2. Then it selects the column indicated by the current iteration (variable i) and the following column (i+1), averages these two, and stores in a new column.
It assumes columns are properly ordered.

create column name repeats for column values when particular columns have duplicate rows

I have a dataframe that I need to spin around (am not sure if this involves stacking or pivoting..)
So, where I have duplicate values in columns "Year", "Month and "Group" , I want to shift the follow columns names to be repeated for the Variable
So if this is the original DF:
Year Month Group Variable feature1 feature2 feature3
2010 6 1 1 12 23 56
2010 6 1 2 34 56 25
The result will be :
Year Month Group Variable1 feature1_1 feature2_1 feature3_1 Variable2 feature1_2 feature2_2 feature3_2
2010 6 1 1 12 23 56 2 34 56 25
I am looking for something along these lines - any tips/help is much appreciated,
Thankyou
Izzy
IIUC, if you want to convert it back from long to wide , you can using cumcount get the addtional key , then reshape.(Notice this reverse of wide_to_long)
df['New']=(df.groupby(['Year','Month','Group']).cumcount()+1).astype(str)
w=df.set_index(['Year','Month','Group','New']).unstack().sort_index(level=1,axis=1)
w.columns=pd.Index(w.columns).str.join('_')
w
Out[217]:
Variable_1 feature1_1 feature2_1 feature3_1 Variable_2 \
Year Month Group
2010 6 1 1 12 23 56 2
feature1_2 feature2_2 feature3_2
Year Month Group
2010 6 1 34 56 25

Update dataframe header with values from another dataframe

I'm working with census data (using the Census package ). When I select variables with the census API, they pass through in their raw format (e.g. B01001_007) and I'd like to replace the column name with the label (e.g. male 18 to 19 years).
I know this can be done through df.columns = ['male 18 to 19 years',
'male 20 years',
'male 21 years']
but this is tedious.
Is there a way to do some type of mapping that will auto-query into the header in my df below?
Sample data:
import pandas as pd
from pandas import DataFrame
variables_table = pd.DataFrame({'variable': ['B01001_007E','B01001_008E','B01001_009E'],
'label': ['male 18 to 19 years','male 20 years','male 21 years']
})
variables_table
label variable
male 18 to 19 years B01001_007E
male 20 years B01001_008E
male 21 years B01001_009E
Unclean output:
df = pd.DataFrame({'B01001_007E': ['100','200','300'],
'B01001_008E': ['300','200','100'],
'B01001_009E': ['500','100','200']})
df
B01001_007E B01001_008E B01001_009E
100 300 500
200 200 100
300 100 200
df.rename(columns=variables_table.set_index('variable')['label'])
Out:
male 18 to 19 years male 20 years male 21 years
0 100 300 500
1 200 200 100
2 300 100 200
Note that variables_table.set_index('variable')['label'] is a Series whose index is 'variable`. It will do the mapping on that index.
This is not an inplace operation. If you want to change the actual dataframe, assign it back to df: df = df.rename(columns=variables_table.set_index('variable')['label']) or use the inplace parameter: df.rename(columns=variables_table.set_index('variable')['label'], inplace=True)

data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

Categories