I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])
Related
This is my code:
mapping = {"ISTJ":1, "ISTP":2, "ISFJ":3, "ISFP":4, "INFP":6, "INTJ":7, "INTP":8, "ESTP":9, "ESTJ":10, "ESFP":11, "ESFJ":12, "ENFP":13, "ENFJ":14, "ENTP":15, "ENTJ":16, "NaN": 17}
q20 = castaway_details["personality_type"]
q20["personality_type"] = q20["personality_type"].map(mapping)
the data frame is like this
personality_type
0 INTP
1 INFP
2 INTJ
3 ISTJ
4 NAN
5 ESFP
I want the output like this:
personality_type
0 8
1 6
2 7
3 1
4 17
5 11
however, what I get from my code is all NANs
Try to pandas.Series.str.strip before the pandas.Series.map :
q20["personality_type"]= q20["personality_type"].str.strip().map(mapping)
# Output :
print(q20)
personality_type
0 8
1 6
2 7
3 1
4 17
5 11
The key NaN in your mapping dictionary and NaN value in your data frame do not match. I have modified the one in your dictionary.
df.apply(lambda x: x.fillna('NAN').map(mapping))
personality_type
0 8
1 6
2 7
3 1
4 17
5 11
Say I have the following dataframe:
values
0 4
1 0
2 2
3 3
4 0
5 8
6 5
7 1
8 0
9 4
10 7
I want to find a pandas vectorized function (preferably using groupby) that would replace all nonzero values with the first nonzero value in that chunk of nonzero values, i.e. something that would give me
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
Is there a good way of achieving this?
Make a boolean mask to select the rows having zero and its following row, then use this boolean mask with where to replace remaining values with NaN, then use forward fill to propagate the values in forward direction.
m = df['values'].eq(0)
df['new'] = df['values'].where(m | m.shift()).ffill().fillna(df['values'])
Result
print(df)
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
get rows for zeros, and the rows immediately after:
zeros = df.index[df['values'].eq(0)]
after_zeros = zeros.union(zeros +1)
Get the rows that need to be forward filled:
replace = df.index.difference(after_zeros)
replace = replace[replace > zeros[0]]
Set values and forward fill on replace:
df['new'] = df['values']
df.loc[replace, 'new'] = np.nan
df.ffill()
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
The following function should do the job for you. Check the comments in the function to understand the work flow of the solution.
import pandas as pd
def ffill_nonZeros(values):
# get the values that are not equal to 0
non_zero = values[df['values'] != 0]
# get their indexes
non_zero_idx = non_zero.index.to_series()
# find where indexes are consecutive
diff = non_zero_idx.diff()
mask = diff == 1
# using the mask make all places in non_zero where the change is consecutive equal None
non_zero[mask] = None
# fill forward (replace all None values with previous valid value)
new_non_zero = non_zero.fillna(method='ffill')
# put new values back in their indexs
new = values.copy()
new[new_non_zero.index] = new_non_zero
return new
Now applying this function to your data:
df = pd.DataFrame([4, 0, 2, 3, 0, 8, 5, 1, 0, 4, 7], columns=['values'])
df['new'] = ffill_nonZeros(df['values'])
print(df)
Output:
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I have this pandas Dataframe :
A B C
20 6 7
5 3.8 9
34 4 1
I want to create duplicate rows if value in A is say >10.
So the Dataframe should finally look like:
A B C
10 6 7
10 6 7
5 3.8 9
10 4 1
10 4 1
10 4 1
4 4 1
Is there a way in pandas to do this elegantly? Or I will have to loop over rows and do it manually..?
I have already browsed similar queries on StackOverflow, but none of them does exactly what I want.
Use:
#create default index
df = df.reset_index(drop=True)
#get floor and modulo divisions
a = df['A'] // 10
b = (df['A'] % 10)
#repeat once if not 0
df2 = df.loc[df.index.repeat(b.ne(0).astype(int))]
#repplace values of A with map by index
df2['A'] = df2.index.map(b.get)
#repeat with assign scalar 10
df1 = df.loc[df.index.repeat(a)].assign(A=10)
#join together, sort index and create default RangeIndex
df = df1.append(df2).sort_index().reset_index(drop=True)
print (df)
A B C
0 10 6.0 7
1 10 6.0 7
2 5 3.8 9
3 10 4.0 1
4 10 4.0 1
5 10 4.0 1
6 4 4.0 1
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15