Consider a dataframe which contains several groups of integers:
d = pd.DataFrame({'label': ['a','a','a','a','b','b','b','b'], 'value': [1,2,3,2,7,1,8,9]})
d
label value
0 a 1
1 a 2
2 a 3
3 a 2
4 b 7
5 b 1
6 b 8
7 b 9
For each of these groups of integers, each integer has to be bigger or equal to the previous one. If not the case, it takes on the value of the previous integer. I replace using
s.where(~(s < s.shift()), s.shift())
which works fine for a single series. I can even group the dataframe, and loop through each extracted series:
grouped = s.groupby('label')['value']
for _, s in grouped:
print(s.where(~(s < s.shift()), s.shift()))
0 1.0
1 2.0
2 3.0
3 3.0
Name: value, dtype: float64
4 7.0
5 7.0
6 8.0
7 9.0
Name: value, dtype: float64
However, how do I now get these values back into my original dataframe?
Or, is there a better way to do this? I don't care for using .groupby and don't consider the for loop a pretty solution either...
IIUC, you can use cummax in the groupby like:
d['val_max'] = d.groupby('label')['value'].cummax()
print (d)
label value val_max
0 a 1 1
1 a 2 2
2 a 3 3
3 a 2 3
4 b 7 7
5 b 1 7
6 b 8 8
7 b 9 9
Related
Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0
I'm new in python and I need your assistance with getting the result when you add your columns as value and your values in rows.
Here's an example:
columns
A B C
1 2 3
4 5 6
7 8 9
Expected result:
avg
A 4
B 5
C 6
I can do it easily in excel by placing the columns in "Values" the move the values in rows to get the average but I can't seem to do it in python.
df=pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
df
A B C
0 1 2 3
1 4 5 6
2 7 8 9
ser=df.mean() #Result is a Series
df=pd.DataFrame({'avg':ser}) #Convert this Series into DataFrame
df
avg
A 4.0
B 5.0
C 6.0
Using to_frame
df.mean().to_frame('ave')
Out[186]:
ave
A 4.0
B 5.0
C 6.0
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15
I have a dataframe with the quarterly U.S. GDP as column values. I would like to look at the values, 3 at a time, and find the index where the GDP fell for the next two consecutive quarters. This means I need to compare individual elements within df['GDP'] with each other, in groups of 3.
Here's an example dataframe.
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
df
GDP
0 4
1 4
2 4
3 1
4 4
5 4
6 8
7 2
8 3
9 9
I'm using df.rolling().apply(find_recession), but I don't know how I can access individual elements of the rolling window within my find_recession() function.
gdp['Recession_rolling'] = gdp['GDP'].rolling(window=3).apply(find_recession_start)
How can I access individual elements within the rolling window, so I can make a comparison such as gdp_val_2 < gdp_val_1 < gdp_val?
The .rolling().apply() will go through the entire dataframe, 3 values at a time, so let's take a look at one particular window, which starts at index location 6:
GDP
6 8 # <- gdp_val
7 2 # <- gdp_val_1
8 3 # <- gdp_val_2
How can I access gdp_val, gdp_val_1, and gdp_val_2 within the current window?
Using a lambda expression within .apply() will pass an array into the custom function (find_recession_start), and so I can just access the elements as I would any list/array e.g. arr[0], arr[1], arr[2]
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
def my_func(arr):
if((arr[2] < arr[1]) & (arr[1] < arr[0])):
return 1
else:
return 0
df['Result'] = df.rolling(window=3).apply(lambda x: my_func(x))
df
GDP Result
0 8 NaN
1 0 NaN
2 8 0.0
3 1 0.0
4 9 0.0
5 7 0.0
6 9 0.0
7 8 0.0
8 3 1.0
9 9 0.0
The short answer is: you can't, but you can use your knowledge about the structure of the dataframe/series.
You know the size of the window, you know the current index - therefore, you can output the shift relative to the current index:
Let's pretend, here is your gdp:
In [627]: gdp
Out[627]:
0 8
1 0
2 0
3 4
4 0
5 3
6 6
7 2
8 5
9 5
dtype: int64
The naive approach is just to return the (argmin() - 2) and add it to the current index:
In [630]: gdp.rolling(window=3).apply(lambda win: win.argmin() - 2) + gdp.index
Out[630]:
0 NaN
1 NaN
2 1.0
3 1.0
4 2.0
5 4.0
6 4.0
7 7.0
8 7.0
9 7.0
dtype: float64
The naive approach won't return the correct result, since you can't predict which index it would return when there are equal values, and when there is a rise in the middle. But you understand the idea.