Pandas: Apply a function to exploded values - python

Assume I have the following dataframe:
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
a
b
c
1
[2]
[3]
8
[10,11,12,13]
[6,7,8,9]
Columns b and c can consist of 1 to n elements, but in a row they will always have the same number of elements.
I am looking for a way to explode the columns b and c, while applying a function to the whole row, here for example in order to divide the column a by the number of elements in b and c (I know this is a stupid example as this would be easily solvable by first dividing, then exploding. The real use case is a bit more complicated but of no importance here.)
So the result would look something like this:
a
b
c
1
2
3
2
10
6
2
11
7
2
12
8
2
13
9
I tried using the apply-method like in the following snippet, but this only produced garbage and does not work, when the number of elements in the list does not fit the number of columns:
def fun(row):
if isinstance(row.c, list):
result = [[row.a, row.b, c] for c in row.c]
return result
return row
df.apply(fun, axis=1)
Panda's explode explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
Is there an easier way than to iterate through the data-frame, exploding the values and manually building up a new data-frame in the way I need it here?
Thank you already for your help.
Edit:
The real use case is basically a mapping from b+c to a.
So I have another file that looks something like that:
b
c
a
2
3
1
10
6
1
11
7
1
12
8
2
13
9
4
So coming from this example, the result would actually be as follows:
a
b
c
1
2
3
1
10
6
1
11
7
2
12
8
4
13
9
The problem is, that between this two files there is no 1:1 relation between those two files as it might seem here.

You say:
Panda's explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
This isn't true. Consider this:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
# explode
df2 = df.explode(['b','c'])
As you can see from the print below, the result also explodes the index:
a b c
0 1 2 3
1 8 10 6
1 8 11 7
1 8 12 8
1 8 13 9
So, we can use the index to track how many elements per row got exploded. Try this:
# reset index, add old index as column and create a Series with frequence
# for old index; now populate 'a' with 'a' divided by this Series
df2.reset_index(drop=False, inplace=True)
df2['a'] = df2['a']/df2.loc[:,'index'].map(df2.loc[:,'index'].value_counts())
# drop the old index, now as column
df2.drop('index', axis=1, inplace=True)
Result:
a b c
0 1.0 2 3
1 2.0 10 6
2 2.0 11 7
3 2.0 12 8
4 2.0 13 9

Related

Appending rows to existing pandas dataframe

I have a pandas dataframe df1
a b
0 1 2
1 3 4
I have another dataframe in the form of a dictionary
dictionary = {'2' : [5, 6], '3' : [7, 8]}
I want to append the dictionary values as rows in dataframe df1. I am using pandas.DataFrame.from_dict() to convert the dictionary into dataframe. The constraint is, when I do it, I cannot provide any value to the 'column' argument for the method from_dict().
So, when I try to concatenate the two dataframes, the pandas adds the contents of the new dataframe as new columns. I do not want that. The final output I want is in the format
a b
0 1 2
1 3 4
2 5 6
3 7 8
Can someone tell me how do I do this in least painful way?
Use concat with help of pd.DataFrame.from_dict, setting the columns of df1 during the conversion:
out = pd.concat([df1,
pd.DataFrame.from_dict(dictionary, orient='index',
columns=df1.columns)
])
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8
Another possible solution, which uses numpy.vstack:
pd.DataFrame(np.vstack([df.values, np.array(
list(dictionary.values()))]), columns=df.columns)
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8

groupby a column to get a minimum value and corresponding value in another column [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

More efficient way to filter over a subset of a dataframe

The problem is this, I have a dataframe like so:
A B C D
2 3 X 5
7 2 5 7
1 2 7 9
3 4 X 9
1 2 3 5
6 3 X 8
I wish to iterate over the rows of the dataframe and every time column C=X I want to reset a counter and start adding the values in column B until column C=X again. Then rince and repeat down the rows till complete.
Currently I am iterating over the rows using .iterrows(), comparing column C and then procedurally adding to a variable.
I'm hoping there is a more efficient 'pandas' like approach to doing something like this.
Use cumsum() as follows.
import pandas as pd
df = pd.DataFrame({"B":[3, 2, 2, 4, 2, 3],
"C":["X", 5, 7, "X", 3, "X"]})
df['C'].loc[df['C'] == "X"] = df['B'].loc[df['C'] == "X"].cumsum()
The output is
B C
0 3 3
1 2 5
2 2 7
3 4 7
4 2 3
5 3 10

Assign new values to DF column based on a condition

I need to assign a value to all registers assigned as 0 (zero) in a column ('A'). This new value will be the mean value of every registers that share the same value registered on another column ('B'), i.e.: all rows that have 'A' assigned as 0 will have their value replaced by the mean of 'A' found among those who share the same value on 'B'. Apparently, the following code is not working, because, when I call print(df.A) after it, I have some rows with 'A' as 0 returned:
df = df[df.A == 0].groupby('B')['A'].mean().reset_index()
I tried a bunch of line codes, but some aren't even accepted...
What I expect is a situation that all 0 values for A are replaced for a mean of A column grouped by B column. Like this:
Before:
Output:
A B
1 0 7
2 0 7
3 9 7
4 10 6
5 8 6
6 0 6
7 0 2
After:
Output:
A B
1 3 7
2 3 7
3 9 7
4 10 6
5 8 6
6 3 6
7 0 2
Thank you for your support.
I think I understand your question now, but then I don't see how you got the '3' for row 6 col A. I am following the logic for how I was able to match the 3's in rows 1 and 2 in col A, which I will try to explain in code below. If this isn't quite the correct interpretation, hopefully is still gets you pointed in the right direction.
Your initial df
df = pd.DataFrame({
'A': [0, 0, 9, 10, 8, 0, 0],
'B': [7, 7, 7, 6, 6, 6, 2]
})
A B
1 0 7
2 0 7
3 9 7
4 10 6
5 8 6
6 0 6
7 0 2
Restating the objective:
For each unique value in col B where col A is 0, find the rows in col A where B has that value and take the mean of those col A values. Then overwrite that mean value to those rows in A that are 0 and line up with the value selected in B. So, for example, the first 3 rows there are 7's in col B, and 0, 0, 9 in col A. The mean of the first 3 A rows is 3, so that is the value that will get overwritten on the 0s in col A, rows 1 and 2.
Steps
Get the unique values from col B where col A is also 0
bvals_when_a_zero = df[df['A'] == 0]['B'].unique()
array([7, 6, 2])
For each of those unique values, calculate the mean of the corresponding values in col A
means = [df[df['B'] == i]['A'].mean() for i in bvals_when_a_zero]
[3.0, 6.0, 0.0]
Loop over the bvals,means pairs and overwrite the 0's with the corresponding mean for the bval. The logic of the pandas where method keeps the values stated to the left (in this case, df['A'] values) that meet the condition in the first argument in the brackets, otherwise choose the second argument as the value to keep. Our condition (df['A'] == 0) & (df['B'] == bval) says get the rows where col A is 0 and col B is one of the unique bvals. But here we actually want to keep the df['A'] values that are NOT equal to the condition, so the conditional argument in the brackets is negated with the ~ symbol in front.
for bval, mean in zip(bvals_when_a_zero, means):
df['A'] = df['A'].where( ~((df['A'] == 0) & (df['B'] == bval)), mean )
This gives the final df
A B
1 3 7
2 3 7
3 9 7
4 10 6
5 8 6
6 6 6
7 0 2

How to find the top 10 pairs that appear the most in a pandas dataframe in python

I have a pandas dataframe in python with columns 'a', 'b', 'c'. The 'a','b' pairs are unique and repeat multiple times. 'c' is changing all the time. I want to find the 10 pairs 'a','b' that appear the most and put them in a dataframe but don't know how. Any help is appreciated.
I'm not entirely sure I follow you, but assuming you mean you have a DataFrame looking something like
>>> N = 1000
>>> df = pd.DataFrame(np.random.randint(0, 10, (N, 3)), columns="A B C".split())
>>> df.head()
A B C
0 7 4 5
1 5 1 3
2 8 9 8
3 2 3 0
4 2 3 0
and you simply want to count (A, B) pairs, that's easy enough:
>>> df.groupby(["A", "B"]).size().order().iloc[-10:]
A B
6 1 13
1 0 14
4 0 14
7 2 14
1 6 15
8 2 15
1 8 16
2 6 16
6 4 16
7 4 16
dtype: int64
That can be broken down into four parts:
groupby, which groups the data by (A, B) tuples
size, which computes the size of each group
order, which returns the Series sorted by value
iloc, which lets us select the last 10 entries in the Series by position
That results in a Series, but you could make a DataFrame out of it simply by passing it to pd.DataFrame.

Categories