Is there a more elegant way to achieve this? my current solution based on various stackoverflow answers is as following
df = pds.DataFrame([[11,12,13,14],[15,16,17,18]], columns = [0,1,2,3])
print df
dT = df.T
dT.reindex(dT.index[::-1]).cumsum().reindex(dT.index).T
Output
df is:
0 1 2 3
0 11 12 13 14
1 15 16 17 18
after by row reverse cumsum
0 1 2 3
0 50 39 27 14
1 66 51 35 18
I have to perform this often on my data (much bigger size also), and try to find out a short/better way to do achieve this.
Thanks
Here is a slightly more readable alternative:
df[df.columns[::-1]].cumsum(axis=1)[df.columns]
There is no need to transpose your DataFrame; just use the axis=1 argument to cumsum.
Obviously the easiest thing would be to just store your DataFrame columns in the opposite order, but I assume there is some reason why you're not doing that.
Related
I want to fill missing "Age" values of a DataFrame by a mean of a two-column subgroup.
df.groupby(["col_x","col_y"])["Age"].mean()
The code above returns the means of these sub-groups:
col_X col_Y
X 1 35
2 29
3 22
Y 1 41
2 31
3 27
I have a feeling this can be achieved by using the .map function:
df.loc[df['Age'].isnull(),'Age'] = df[['col_X',"col_Y"]].map(something)
Can anybody help me with this?
It's better with groupby().transform, which returns a series with same index as df. So you can fillna with it:
df['Age'] = df['Age'].fillna(df.groupby(['col_x','col_y'])['Age'].transform('mean'))
I'm finding an efficient to reshape a N*M dataframe to 1*(N*M) dataframe:
INPUT>
df1
ID distUnit col_a col_b
1000 150 35 55
1000 250 10 20
1200 150 12 13
1200 250 16 20
DESIRED OUTPUT>
ID col_a_150 col_b_150 col_a_250 col_b_250
1000 35 55 10 20
1200 12 13 16 20
My idea>
Go through every row in df1
add prefix to col_a and col_b based on the value of row['distUnit']
using combined_first to add processed row back to result dataframe
Challenging part >
Since the size of my input data is 14440 * 20, my idea is not efficient enough.
Wondering any better implementation ways to solve this?
Thanks for reading.
If pair (ID, distUnit) is unique across your dataset, you can simply "unmelt" your dataframe like this:
df=df.groupby(['ID','distUnit'])['col_a','col_b'].mean().unstack()
df.columns = [f'{col[0]}_{col[1]}' for col in df.columns.values]
Check this question for similar approaches.
I have a pandas df that looks like this:
import pandas as pd
df = pd.DataFrame({0:[1],5:[1],10:[1],15:[1],20:[0],25:[0],
30:[1],35:[1],40:[0],45:[0],50:[0]})
df
The column names reflect coordinates. I would like to retrieve the start and end coordinate of columns with consecutive equal numbers.
The output should be something like this:
# start,end
0,15
20,25
30,35
40,50
IIUCusing groupby with diff and cumsum to split the group
s=df.T.reset_index()
s=s.groupby(s[0].diff().ne(0).cumsum())['index'].agg(['first','last'])
Out[241]:
first last
0
1 0 15
2 20 25
3 30 35
4 40 50
cumsum to identify group, and groupby:
s = df.iloc[0].diff().ne(0).cumsum()
(df.columns.to_series()
.groupby(s).agg(['min','max'])
)
Output:
min max
0
1 0 15
2 20 25
3 30 35
4 40 50
df =
0 20
1 19
2 18
3 17
4 16
I am iterating with a loop:
for k in df:
af = AffinityPropagation(preference=k).fit(X)
labels = af.labels_
score = silhouette_score(frechet, labels)
print("Preference: {0}, Silhouette score: {1}".format(k,score))
I get 1 number. But I need/want to get dataframe with numbers in the length of df len(df)
You need to use iterrows as #CodeDifferently points out in his comment above.
Here is an example:
Where df is:
df = pd.DataFrame({0:range(20,0,-1)})
Then using your method:
for k in df:
print(k)
Output:
0
This zero is the column header for a dataframe. You are iterating thow the dataframe column names.
Using iterrows:
for _,k in df.iterrows():
print(k.iloc[0])
Output:
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Here you are getting each row of the dataframe as series, and using iloc you are getting the first and only value in the rows for this case.
You almost never need to iterate over a DataFrame. Columns are basically NumPy arrays and have array-like 'elementwise' superpowers. (You ~never need to iterate over NumPy arrays either.)
Maybe formulate your task as a function and use the apply() method on the DataFrame or Series. This 'applies' a function to every item in a column without the need for a loop.
But if you really only have one column like this, why use a DataFrame at all? Just use a NumPy array (or get at it with the column's values attribute).
I have two pandas dataframes.
df1
unique numerator
23 4
29 10
df2
unique denominator
23 2
29 5
Now I want like this
unique result
23 2
29 2
Without using loops... or whichever is the most efficient way. Its a division numerator/denominator
if you set the index to unique for both dfs then you can just divide the 2 columns:
In [6]:
df.set_index('unique')['numerator']/df1.set_index('unique')['denominator']
Out[6]:
unique
23 2
29 2
dtype: float64
or merge on 'unique' and then do the calc as normal:
In [9]:
merged=df.merge(df1, on='unique')
merged
Out[9]:
unique numerator denominator
0 23 4 2
1 29 10 5
In [10]:
merged['result'] = merged['numerator']/merged['denominator']
merged
Out[10]:
unique numerator denominator result
0 23 4 2 2
1 29 10 5 2
EdChum has provided 2 good options.
An alternative is in using the div() or divide() function.
df1 = pd.DataFrame ({'unique':[23,29],'numerator': [4,10]})
df2 = pd.DataFrame ({'unique':[23,29],'denominator': [2,5]})
df1.set_index('unique',inplace=True)
df2.set_index('unique',inplace=True)
print df1.div(df2['denominator'],axis=0)
An important thing to note is that you need to divide by a series aka df2['denominator']
df1.div(df2,axis=0) will produce
denominator numerator
unique
23 NaN NaN
29 NaN NaN
this is because the label 'denominator' in df2 does not match 'numerator' in df1. However a series does not have column label as it were and its values are broadcast across the columns of df1.