Creating panda column based off of values from other columns - python

So the I'm working with a panda dataframe that looks like this:
Current Panda Table
I want to turn sum all of the times for each individual property on a given week, my idea is to append this to the data frame like this:
Dataframe2
Then to simplify things I'd create a new data frame that looks like this:
Property Name Week Total_weekly_time
A 1 60
A 2 xx
B 1 xx
etc. etc.
I'm new to pandas, trying to learn the ins and outs. Any answers must appreciated as well as references to learn pandas better.

I think you need transform if need new column with same dimension as df after groupby:
df['Total_weekly_time'] = df.groupby(['Property Name', 'Week #'])['Duration']
.transform('sum')
print (df)
Property Name Week # Duration Total_weekly_time
0 A 1 10 60
1 A 1 10 60
2 A 2 5 5
3 B 1 20 70
4 B 1 20 70
5 B 1 20 70
6 C 2 10 10
7 C 3 30 50
8 A 1 40 60
9 A 4 40 40
10 B 1 5 70
11 B 1 5 70
12 C 3 10 50
13 C 3 10 50
Pandas docs

Related

Adding multiple columns randomly to a dataframe from columns in another dataframe

I've looked everywhere but can't find a solution.
Let's say I have two tables:
Year
1
2
3
4
and
ID Value
1 10
2 50
3 25
4 20
5 40
I need to pick randomly from both columns of the 2nd table to add to the first table - so if ID=3 is picked randomly as a column to add to the first table, I also add Value=25 i.e. end up with something like:
Year ID Value
1 3 25
2 1 10
3 1 10
4 5 40
5 2 50
IIUC, do you want?
df_year[['ID', 'Value']] = df_id.sample(n=len(df_year), replace=True).to_numpy()
Output:
Year ID Value
0 1 4 20
1 2 4 20
2 3 2 50
3 4 3 25

Python dataframe rank each column based on row values

I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.
You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.

pandas increment count column based on value of another column

I have a df that is the result of a join:
ID count
0 A 30
1 A 30
2 B 5
3 C 44
4 C 44
5 C 44
I would like to be able to iterate the count column based on the ID column. Here is an example of the desired result:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46
I know there are non-pythonic ways to do this via loops, but I am wondering if there is a more intelligent (and time efficient, as this table is large) way to do this.
Transform the group to get a cumulative count and add it to count, eg:
df['count'] += df.groupby('ID')['count'].cumcount()
Gives you:
ID count
0 A 30
1 A 31
2 B 5
3 C 44
4 C 45
5 C 46

How to impute missing values based on other variables

I have a dataframe like below:
df = pd.DataFrame({'one' : pd.Series(['a', 'b', 'c', 'd','aa','bb',np.nan,'b','c',np.nan, np.nan] ),
'two' : pd.Series([10, 20, 30, 40,50,60,10,20,30,40,50])} )
In which first column is the variables, second column is the values. Variable value is constant, which will never change.
example 'a' value is 10, whenever 'a' is presented corrsponding value will be10
Here some values missing in first column eg: NaN 10 which is a, NaN 40 which is d like wise dataframe contains 200 variables.
Values are not continuous variables, those are discrete and unsortable
In this case how can we impute missing values.
Expected output should be :
Please help me on this.
Regards,
Venkat.
I think in general it would be better to group and fill. We use DataFrame.groupby:
df.groupby('two').apply(lambda x: x.ffill().bfill())
It can be done without using groupby but you have to sort by both columns:
df.sort_values(['two','one']).ffill().sort_index()
Below I show you how the method proposed in another answer may fail:
Here is an example:
df=pd.DataFrame({'one':['a',np.nan,'c','d',np.nan,'c','b','b',np.nan,'a'],'two':[10,20,30,40,10,30,20,20,30,10]})
print(df)
one two
0 a 10
1 NaN 20
2 c 30
3 d 40
4 NaN 10
5 c 30
6 b 20
7 b 20
8 NaN 30
9 a 10
df.sort_values(['two']).fillna(method='ffill').sort_index()
one two
0 a 10
1 a 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As you can see the proposed method in another of the answers fails here(see row 1). This occurs because some NaN Value can be the first for a specific value of the column 'two' and is filled with the value of the upper group.
This don't happen if we group first:
df.groupby('two').apply(lambda x: x.ffill().bfill())
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As I said we can use DataFrame.sort_values ​​but we need to sort for both columns.I recommend you this method.
df.sort_values(['two','one']).ffill().sort_index()
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
Here it is:
df.ffill(inplace=True)
output:
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50
Try this:
df = df.sort_values(['two']).fillna(method='ffill').sort_index()
Which will give you
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50

Embedding modified sub-frame into the original one

I have 2 pandas data frames which one of them consists the modified selected rows of the first one (they have similar columns).
For simplicity the below frames illustrate this problem.
df1 = df2 =
A B C A B C
0 1 2 3 1 20 30 40
1 2 3 4 3 40 50 60
2 3 4 5
3 4 5 6
Is there any more efficient and pythonic way than the below code,to embedding the df2 into df1 by overwriting values? (working with high-dimensional frames)
for index, row in df2.iterrows():
df1.ix[index,:] = df2.ix[index, :]
which results in:
df1 =
A B C
0 1 2 3
1 20 30 40
2 3 4 5
3 40 50 60
You can use update to update a df with another df, where the row and column labels agree the values are updated, you will need to cast to int using astype because the dtype is changed to float due to missing values:
In [21]:
df1.update(df2)
df1 = df1.astype(int)
df1
Out[21]:
A B C
0 1 2 3
1 20 30 40
2 3 4 5
3 40 50 60

Categories