I'm trying to create a dataframe populated by repeating rows based on an existing steady sequence.
For example, if I had a sequence increasing in 3s from 6 to 18, the sequence could be generated using np.arange(6, 18, 3) to give array([ 6, 9, 12, 15]).
How would I go about generating a dataframe in this way?
How could I get the below if I wanted 6 repeated rows?
0 1 2 3
0 6.0 9.0 12.0 15.0
1 6.0 9.0 12.0 15.0
2 6.0 9.0 12.0 15.0
3 6.0 9.0 12.0 15.0
4 6.0 9.0 12.0 15.0
5 6.0 9.0 12.0 15.0
6 6.0 9.0 12.0 15.0
The reason for creating this matrix is that I then wish to add a pd.sequence row-wise to this matrix
pd.DataFrame([np.arange(6, 18, 3)]*7)
alternately,
pd.DataFrame(np.repeat([np.arange(6, 18, 3)],7, axis=0))
0 1 2 3
0 6 9 12 15
1 6 9 12 15
2 6 9 12 15
3 6 9 12 15
4 6 9 12 15
5 6 9 12 15
6 6 9 12 15
Here is a solution using NumPy broadcasting which avoids Python loops, lists, and excessive memory allocation (as done by np.repeat):
pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4)))
To understand why this is more efficient than other solutions, refer to the np.broadcast_to() docs: https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html
more than one element of a broadcasted array may refer to a single memory location.
This means that no matter how many rows you create before passing to Pandas, you're only really allocating a single row, then a 2D array which refers to the data of that row multiple times.
If you assign the above to df, you can say df.values.base is a single row--this is the only storage required no matter how many rows appear in the DataFrame.
Related
My pandas array looks like this...
DOY Value
0 5 5118
1 10 5098
2 15 5153
I've been trying to resample my data and fill in the gaps using pandas resample function. My worry is that since I'm trying to resample without using direct datetime values, I won't be able to resample my data.
My attempt to solve this was using the following line of code but got an error saying I was using Range Index. Perhaps I need to use Period Index somehow, but I'm not sure how to go about it.
inter.resample('1D').mean().interpolate()
Here's my intended result
DOY Value
0 5 5118
1 6 5114
2 7 5110
3 8 5106
4 9 5102
5 10 5098
: : :
10 15 5153
Convert to_datetime, perform the resample and then drop the unwanted column:
df["date"] = pd.to_datetime(df["DOY"].astype(str),format="%j")
output = df.resample("D", on="date").last().drop("date", axis=1).interpolate().reset_index(drop=True)
>>> output
DOY Value
0 5.0 5118.0
1 6.0 5114.0
2 7.0 5110.0
3 8.0 5106.0
4 9.0 5102.0
5 10.0 5098.0
6 11.0 5109.0
7 12.0 5120.0
8 13.0 5131.0
9 14.0 5142.0
10 15.0 5153.0
pd.DataFrame.interpolate works on the index. So let's start with setting an appropriate index and then a new one over which we will interpolate.
d0 = df.set_index('DOY')
idx = pd.RangeIndex(d0.index.min(), d0.index.max()+1, name='DOY')
d0.reindex(idx).interpolate().reset_index()
DOY Value
0 5 5118.0
1 6 5114.0
2 7 5110.0
3 8 5106.0
4 9 5102.0
5 10 5098.0
6 11 5109.0
7 12 5120.0
8 13 5131.0
9 14 5142.0
10 15 5153.0
I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here.
Serial No. Name Type Total
0 1 Easter Multiple 19
1 2 Costeri Roundabout 16
2 3 Zhiop Tee 16
3 4 Nesss Cross 10
4 5 Uoar Lhahara Tee 10
5 6 Trino Nishra (KX) Tee 9
6 7 Old-FX Box Cross 8
7 8 Gardeners Roundabout 8
8 9 Max Detter Roundabout 7
9 NaN Others (Asynco, NaN NaN
10 10 D+ E, Cross 7
11 NaN etc) NaN NaN
If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row. There was multiple line in the table (in pdf). This table has more than 100 rows and at least 12 places those issues have occurred. Some places it is 2 consecutive rows and in some places it is 3 consecutive rows. How can we merge those rows with index values?
You can try:
df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
Result:
print(df_out)
Serial No. Name Type Total
0 1.0 Easter Multiple 19.0
1 2.0 Costeri Roundabout 16.0
2 3.0 Zhiop Tee 16.0
3 4.0 Nesss Cross 10.0
4 5.0 Uoar Lhahara Tee 10.0
5 6.0 Trino Nishra(KX) Tee 9.0
6 7.0 Old-FX Box Cross 8.0
7 8.0 Gardeners Roundabout 8.0
8 9.0 Max Detter Roundabout 7.0
9 10.0 Others (Asynco,D+ E,etc) Cross 7.0
Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.
It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
I am trying to create a new column that will list down the last recorded peak values, until the next peak comes along. For example, suppose this is my existing DataFrame:
index values
0 10
1 20
2 15
3 17
4 15
5 22
6 20
I want to get something like this:
index values last_recorded_peak
0 10 10
1 20 20
2 15 20
3 17 17
4 15 17
5 22 22
6 20 22
So far, I have tried with np.maximum.accumulate, which 'accumulates' the max value but not quite the "peaks" (some peaks might be lower than the max value).
I have also tried with scipy.signal.find_peaks which returns an array of indexes where my peaks are (in the example, index 1, 3, 5), which is not what I'm looking for.
I'm relatively new to coding, any pointer is very much appreciated!
You're on the right track, scipy.signal.find_peaks is the way I would go, you just need to work a little bit from the result:
from scipy import signal
peaks = signal.find_peaks(df['values'])[0]
df['last_recorded_peak'] = (df.assign(last_recorded_peak=float('nan'))
.last_recorded_peak
.combine_first(df.loc[peaks,'values'])
.ffill()
.combine_first(df['values']))
print(df)
index values last_recorded_peak
0 0 10 10.0
1 1 20 20.0
2 2 15 20.0
3 3 17 17.0
4 4 15 17.0
5 5 22 22.0
6 6 20 22.0
If I understand your correcly, your are looking for rolling max:
note: you might have to play around with the window size which I set on 2 for your example dataframe
df['last_recorded_peak'] = df['values'].rolling(2).max().fillna(df['values'])
Output
values last_recorded_peak
0 10 10.0
1 20 20.0
2 15 20.0
3 17 17.0
4 15 17.0
5 22 22.0
6 20 22.0
I have a dataframe with the quarterly U.S. GDP as column values. I would like to look at the values, 3 at a time, and find the index where the GDP fell for the next two consecutive quarters. This means I need to compare individual elements within df['GDP'] with each other, in groups of 3.
Here's an example dataframe.
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
df
GDP
0 4
1 4
2 4
3 1
4 4
5 4
6 8
7 2
8 3
9 9
I'm using df.rolling().apply(find_recession), but I don't know how I can access individual elements of the rolling window within my find_recession() function.
gdp['Recession_rolling'] = gdp['GDP'].rolling(window=3).apply(find_recession_start)
How can I access individual elements within the rolling window, so I can make a comparison such as gdp_val_2 < gdp_val_1 < gdp_val?
The .rolling().apply() will go through the entire dataframe, 3 values at a time, so let's take a look at one particular window, which starts at index location 6:
GDP
6 8 # <- gdp_val
7 2 # <- gdp_val_1
8 3 # <- gdp_val_2
How can I access gdp_val, gdp_val_1, and gdp_val_2 within the current window?
Using a lambda expression within .apply() will pass an array into the custom function (find_recession_start), and so I can just access the elements as I would any list/array e.g. arr[0], arr[1], arr[2]
df = pd.DataFrame(data=np.random.randint(0,10,10), columns=['GDP'])
def my_func(arr):
if((arr[2] < arr[1]) & (arr[1] < arr[0])):
return 1
else:
return 0
df['Result'] = df.rolling(window=3).apply(lambda x: my_func(x))
df
GDP Result
0 8 NaN
1 0 NaN
2 8 0.0
3 1 0.0
4 9 0.0
5 7 0.0
6 9 0.0
7 8 0.0
8 3 1.0
9 9 0.0
The short answer is: you can't, but you can use your knowledge about the structure of the dataframe/series.
You know the size of the window, you know the current index - therefore, you can output the shift relative to the current index:
Let's pretend, here is your gdp:
In [627]: gdp
Out[627]:
0 8
1 0
2 0
3 4
4 0
5 3
6 6
7 2
8 5
9 5
dtype: int64
The naive approach is just to return the (argmin() - 2) and add it to the current index:
In [630]: gdp.rolling(window=3).apply(lambda win: win.argmin() - 2) + gdp.index
Out[630]:
0 NaN
1 NaN
2 1.0
3 1.0
4 2.0
5 4.0
6 4.0
7 7.0
8 7.0
9 7.0
dtype: float64
The naive approach won't return the correct result, since you can't predict which index it would return when there are equal values, and when there is a rise in the middle. But you understand the idea.