Filling missing values based on multi-column subgroup

Filling missing values based on multi-column subgroup - python

I want to fill missing "Age" values of a DataFrame by a mean of a two-column subgroup.
df.groupby(["col_x","col_y"])["Age"].mean()
The code above returns the means of these sub-groups:
col_X col_Y
X 1 35
2 29
3 22
Y 1 41
2 31
3 27
I have a feeling this can be achieved by using the .map function:
df.loc[df['Age'].isnull(),'Age'] = df[['col_X',"col_Y"]].map(something)
Can anybody help me with this?

It's better with groupby().transform, which returns a series with same index as df. So you can fillna with it:
df['Age'] = df['Age'].fillna(df.groupby(['col_x','col_y'])['Age'].transform('mean'))

Related

using set() with pandas

May I ask you please if we can use set() to read the data in a specific column in pandas? For example, I have the following output from a DataFrame df1:
df1= [
0 -10 2 5
1 24 5 10
2 30 3 6
3 30 2 1
4 30 4 5
]
where the first column is the index.. I tried first to isolate the second column
[-10
24
30
30
30]
using the following: x = pd.DataFrame(df1, coulmn=[0]) Then, I transposed the column using the following XX = x.T Then, I used set() function.
However, instead of obtaining [-10 24 30] I got the following [0 1 2 3 4]
So set() read the index instead of reading the first column

set() takes an itterable.
using a pandas dataframe as an itterable yields the column names in turn.
Since you've transposed the dataframe, your index values are now column names, so when you use the transposed dataframe as an itterable you get those index values.
If you want to use set to get the values in the column using set() you can use:
x = pd.DataFrame(df1, colmns=[0])
set(x.iloc[:,0].values)
But if you just want the unique values in column 0 then you can use
df1[[0]].unique()

How to add a new column that includes sum consecutive values of other column?

I am new to python. I want to create a column in pandas that includes the sum of consecutive values of a column. For example, let's assume we have the following table
id
A
1
21
2
32
3
55
And I want to have this table
id
A
New column
1
21
21
2
32
53
3
55
108
Any help will be appreciated.

This is called a cumulative sum. You can just set df["C"] = df["A"].cumsum().
https://pandas.pydata.org/docs/reference/api/pandas.Series.cumsum.html

Iterate through given dataframe

df =
0 20
1 19
2 18
3 17
4 16
I am iterating with a loop:
for k in df:
af = AffinityPropagation(preference=k).fit(X)
labels = af.labels_
score = silhouette_score(frechet, labels)
print("Preference: {0}, Silhouette score: {1}".format(k,score))
I get 1 number. But I need/want to get dataframe with numbers in the length of df len(df)

You need to use iterrows as #CodeDifferently points out in his comment above.
Here is an example:
Where df is:
df = pd.DataFrame({0:range(20,0,-1)})
Then using your method:
for k in df:
print(k)
Output:
0
This zero is the column header for a dataframe. You are iterating thow the dataframe column names.
Using iterrows:
for _,k in df.iterrows():
print(k.iloc[0])
Output:
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Here you are getting each row of the dataframe as series, and using iloc you are getting the first and only value in the rows for this case.

You almost never need to iterate over a DataFrame. Columns are basically NumPy arrays and have array-like 'elementwise' superpowers. (You ~never need to iterate over NumPy arrays either.)
Maybe formulate your task as a function and use the apply() method on the DataFrame or Series. This 'applies' a function to every item in a column without the need for a loop.
But if you really only have one column like this, why use a DataFrame at all? Just use a NumPy array (or get at it with the column's values attribute).

Best possible way to merge two pandas dataframe using a key and divide it

I have two pandas dataframes.
df1
unique numerator
23 4
29 10
df2
unique denominator
23 2
29 5
Now I want like this
unique result
23 2
29 2
Without using loops... or whichever is the most efficient way. Its a division numerator/denominator

if you set the index to unique for both dfs then you can just divide the 2 columns:
In [6]:
df.set_index('unique')['numerator']/df1.set_index('unique')['denominator']
Out[6]:
unique
23 2
29 2
dtype: float64
or merge on 'unique' and then do the calc as normal:
In [9]:
merged=df.merge(df1, on='unique')
merged
Out[9]:
unique numerator denominator
0 23 4 2
1 29 10 5
In [10]:
merged['result'] = merged['numerator']/merged['denominator']
merged
Out[10]:
unique numerator denominator result
0 23 4 2 2
1 29 10 5 2

EdChum has provided 2 good options.
An alternative is in using the div() or divide() function.
df1 = pd.DataFrame ({'unique':[23,29],'numerator': [4,10]})
df2 = pd.DataFrame ({'unique':[23,29],'denominator': [2,5]})
df1.set_index('unique',inplace=True)
df2.set_index('unique',inplace=True)
print df1.div(df2['denominator'],axis=0)
An important thing to note is that you need to divide by a series aka df2['denominator']
df1.div(df2,axis=0) will produce
denominator numerator
unique
23 NaN NaN
29 NaN NaN
this is because the label 'denominator' in df2 does not match 'numerator' in df1. However a series does not have column label as it were and its values are broadcast across the columns of df1.

Pandas Dataframe cumsum by row in reverse column order?

Is there a more elegant way to achieve this? my current solution based on various stackoverflow answers is as following
df = pds.DataFrame([[11,12,13,14],[15,16,17,18]], columns = [0,1,2,3])
print df
dT = df.T
dT.reindex(dT.index[::-1]).cumsum().reindex(dT.index).T
Output
df is:
0 1 2 3
0 11 12 13 14
1 15 16 17 18
after by row reverse cumsum
0 1 2 3
0 50 39 27 14
1 66 51 35 18
I have to perform this often on my data (much bigger size also), and try to find out a short/better way to do achieve this.
Thanks

Here is a slightly more readable alternative:
df[df.columns[::-1]].cumsum(axis=1)[df.columns]
There is no need to transpose your DataFrame; just use the axis=1 argument to cumsum.
Obviously the easiest thing would be to just store your DataFrame columns in the opposite order, but I assume there is some reason why you're not doing that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling missing values based on multi-column subgroup - python

It's better with groupby().transform, which returns a series with same index as df. So you can fillna with it: df['Age'] = df['Age'].fillna(df.groupby(['col_x','col_y'])['Age'].transform('mean'))

Related

using set() with pandas

How to add a new column that includes sum consecutive values of other column?

Iterate through given dataframe

Best possible way to merge two pandas dataframe using a key and divide it

Pandas Dataframe cumsum by row in reverse column order?

Categories

Resources