Adding calculated constant value into Python data frame - python

I'm new to Python, and I believe this is very basic question (sorry for that), but I tried to look for a solution here: Better way to add constant column to pandas data frame and here: add column with constant value to pandas dataframe and in many other places...
I have a data frame like this "toy" sample:
A B
10 5
20 12
50 200
and I want to add new column (C) which will be the division of the last data cells of A and B (50/200); So in my example, I'd like to get:
A B C
10 5 0.25
20 12 0.25
50 200 0.25
I tried to use this code:
groupedAC ['pNr'] = groupedAC['cIndCM'][-1:]/groupedAC['nTileCM'][-1:]
but I'm getting the result only in the last cell (I believe it's a result of my code acting as a "pointer" and not as a number - but as I said, I tried to "convert" my result into a constant (even using temp variables) but with no success).
Your help will be appreciated!

You need to index it with .iloc[-1] instead of .iloc[-1:], because the latter returns a Series and thus when assigning back to the data frame, the index needs to be matched:
df.B.iloc[-1:] # return a Series
#2 150
#Name: B, dtype: int64
df['C'] = df.A.iloc[-1:]/df.B.iloc[-1:] # the index has to be matched in this case, so only
# the row with index = 2 gets updated
df
# A B C
#0 10 5 NaN
#1 20 12 NaN
#2 50 200 0.25
df.B.iloc[-1] # returns a constant
# 150
df['C'] = df.A.iloc[-1]/df.B.iloc[-1] # there's nothing to match when assigning the
# constant to a new column, the value gets broadcasted
df
# A B C
#0 10 5 0.25
#1 20 12 0.25
#2 50 200 0.25

Related

Merging content of two rows in Pandas

I have a data frame, where I would like to merge the content of two rows, and have it separated by underscore, within the same cell.
If this is the original DF:
0 eye-right eye-right hand
1 location location position
2 12 27.7 2
3 14 27.6 2.2
I would like it to become:
0 eye-right_location eye-right_location hand_position
1 12 27.7 2
2 14 27.6 2.2
Eventually I would like to translate row 0 to become header, and reset indexes for the entire df.
You can set your column labels, slice via iloc, then reset_index:
print(df)
# 0 1 2
# 0 eye-right eye-right hand
# 1 location location position
# 2 12 27.7 2
# 3 14 27.6 2.2
df.columns = (df.iloc[0] + '_' + df.iloc[1])
df = df.iloc[2:].reset_index(drop=True)
print(df)
# eye-right_location eye-right_location hand_position
# 0 12 27.7 2
# 1 14 27.6 2.2
I like jpp's answer a lot. Short and sweet. Perfect for quick analysis.
Just one quibble: The resulting DataFrame is generically typed. Because strings were in the first two rows, all columns are considered type object. You can see this with the info method.
For data analysis, it's often preferable that columns have specific numeric types. This can be tidied up with one more line:
df.columns = df.iloc[0] + '_' + df.iloc[1]
df = df.iloc[2:].reset_index(drop=True)
df = df.apply(pd.to_numeric)
The third line here applies Panda's to_numeric function to each column in turn, leaving a more-typed DataFrame:
While not essential for simple usage, as soon as you start performing math on DataFrames, or start using very large data sets, column types become something you'll need to pay attention to.

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Setting the content of a pandas DataFrame cell based on the values of other columns cells

I have a pandas DataFrame df with the following content:
Serial N voltage current average
B 10 2
B 10 2
C 12 0.7
D 40 0.5
. . .
AB 10 3
AB 10 3
I would like to have the column "average" have the the average of the column current for which they have the same voltage. Otherwise they should keep the same value of the current. For example, I would like my dataFrame to have something like this.
Serial N voltage current average
B 10 2 2.5
B 10 2 2.5
C 12 0.7 0.7
D 40 0.5 0.5
. . .
AB 10 3 2.5
AB 10 3 2.5
The Serial N column B and AB have the same voltage, therefore, their average contains average of each of the Serial N with the same voltage. How can I tackle this problem without using a loop if possible?
You can use pandas groupby function to get the averages. You then need to merge it with the rest of the data frame. Have a look at the result of each line to see what it does.
averages = df.groupby('voltage').mean()
# rename the column so it's obvious what it is
averages.columns = ['average current']
averages = averages.reset_index()
df = df.merge(averages, how='left', on='voltage')
Have a look at the documentation on grouping, it should give you some hints for problems like this

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

Categories