Reshape Pandas DataFrame with TimeSeries in rows instead of columns - python

I have a DataFrame df that contains price data (Open, Close, High, Low) for every day in the time from January 2010 to December 2021:
Name
ISIN
Data
02.01.2010
05.01.2010
06.01.2010
...
31.12.2021
Apple
US9835635986
Price Open
12.45
13.45
12.48
...
54.12
Apple
US9835635986
Price Close
12.58
15.35
12.38
...
54.43
Apple
US9835635986
Price High
12.78
15.85
12.83
...
54.91
Apple
US9835635986
Price Low
12.18
13.35
12.21
...
53.98
Microsoft
US1223928384
Price Open
12.45
13.45
12.48
...
43.56
...
..
...
...
...
...
...
...
I am trying to reshape the table into the format below:
Date
Name
ISIN
Price Open
Price Close
Price High
Price Low
02.01.2010
Apple
US9835635986
12.45
12.58
12.78
12.18
05.01.2010
Apple
US9835635986
13.45
15.35
15.85
13.35
...
...
...
...
...
...
...
...
02.01.2010
Microsoft
US1223928384
12.45
13.67
13.74
12.35
Simply transposing the DateFrame did not work. I also tried pivot which gave the error message that the operands ould not be broadcasted to different shapes.
dates = ['NAME','ISIN']
dates.append(df.columns.tolist()[3:]) # appends all columns names starting with 02.01.2010
df.pivot(index = dates, columns = 'Data', Values = 'Data')
How can I get this DataFrame in the desired format?

Use DataFrame.melt before pivoting with convert datetimes, last sorting MultiIndex:
df = (df.melt(['Name','ISIN','Data'], var_name='Date')
.assign(Date = lambda x: pd.to_datetime(x['Date'], format='%d.%m.%Y'))
.pivot(index = ['Date','Name','ISIN'], columns = 'Data', values = 'value')
.sort_index(level=[1,2,0])
.reset_index()
)
print (df)
Data Date Name ISIN Price Close Price High Price Low \
0 2010-01-02 Apple US9835635986 12.58 12.78 12.18
1 2010-01-05 Apple US9835635986 15.35 15.85 13.35
2 2010-01-06 Apple US9835635986 12.38 12.83 12.21
3 2021-12-31 Apple US9835635986 54.43 54.91 53.98
4 2010-01-02 Microsoft US1223928384 NaN NaN NaN
5 2010-01-05 Microsoft US1223928384 NaN NaN NaN
6 2010-01-06 Microsoft US1223928384 NaN NaN NaN
7 2021-12-31 Microsoft US1223928384 NaN NaN NaN
Data Price Open
0 12.45
1 13.45
2 12.48
3 54.12
4 12.45
5 13.45
6 12.48
7 43.56
Another idea is first convert columns names for datetimes and then reshape by DataFrame.stack and Series.unstack:
L = df.columns.tolist()
df = (df.set_axis(L[:3] + pd.to_datetime(L[3:], format='%d.%m.%Y').tolist(), axis=1)
.rename_axis('Date', axis=1)
.set_index(L[:3])
.stack()
.unstack(2)
.reorder_levels([2,0,1])
.reset_index())
print (df)
Data Date Name ISIN Price Close Price High Price Low \
0 2010-01-02 Apple US9835635986 12.58 12.78 12.18
1 2010-01-05 Apple US9835635986 15.35 15.85 13.35
2 2010-01-06 Apple US9835635986 12.38 12.83 12.21
3 2021-12-31 Apple US9835635986 54.43 54.91 53.98
4 2010-01-02 Microsoft US1223928384 NaN NaN NaN
5 2010-01-05 Microsoft US1223928384 NaN NaN NaN
6 2010-01-06 Microsoft US1223928384 NaN NaN NaN
7 2021-12-31 Microsoft US1223928384 NaN NaN NaN
Data Price Open
0 12.45
1 13.45
2 12.48
3 54.12
4 12.45
5 13.45
6 12.48
7 43.56

Related

Merge two columns in Pandas

I have the following Pandas DataFrame:
date at weight status buy_ts sell_ts
--- ------------------- ------ ------------ -------- ------------------- -------------------
0 2010-01-03 00:00:00 1.4286 7 buy 2010-01-04 01:47:00 nan
1 2010-01-03 00:00:00 1.4288 7 buy 2010-01-04 00:00:00 nan
2 2010-01-03 00:00:00 1.4289 7 buy 2010-01-04 00:00:00 nan
3 2010-01-04 00:00:00 1.442 25 buy 2010-01-05 00:00:00 nan
4 2010-01-05 00:00:00 1.4422 15 sell nan 2010-01-06 14:03:00
5 2010-01-05 00:00:00 1.4423 15 sell nan 2010-01-06 14:03:00
6 2010-01-05 00:00:00 1.4424 15 sell nan 2010-01-06 14:03:00
7 2010-01-06 00:00:00 1.4403 18 sell nan 2010-01-07 00:04:00
8 2010-01-06 00:00:00 1.4404 18 sell nan 2010-01-07 00:05:00
9 2010-01-06 00:00:00 1.4405 18 sell nan 2010-01-08 08:54:00
10 2010-01-07 00:00:00 1.4313 26 buy 2010-01-08 00:07:00 nan
11 2010-01-07 00:00:00 1.4314 26 buy 2010-01-08 00:07:00 nan
12 2010-01-07 00:00:00 1.4316 26 sell nan 2010-01-08 00:10:00
buy_ts and sell_ts contains a Python datetime.datetime object
I would like to create a new column called merged_ts which contains the datetime.dateime object from buy_ts or sell_ts (when one column has value the other is always nan so it is not possible that both columns are populated).
Use combine_first:
df['merged'] = df['buy_ts'].combine_first(df['sell_ts'])

How to combine multiple rows with the same index with each row have only one true value in pandas?

I have a pandas dataframe which has the following shape:
OPEN_INT PX_HIGH PX_LAST VOL
timestamp ticker source
2018-01-01 AAPL NYSE 1 NaN NaN NaN
2018-01-01 AAPL NYSE NaN 2 NaN NaN
2018-01-01 AAPL NYSE NaN NaN 3 NaN
2018-01-01 AAPL NYSE Nan NaN NaN 4
2018-01-01 MSFT NYSE 5 NaN NaN NaN
2018-01-01 MSFT NYSE NaN 6 NaN NaN
2018-01-01 MSFT NYSE NaN NaN 7 NaN
2018-01-01 MSFT NYSE Nan NaN NaN 8
In each column for each (timestamp, ticker, source) group there is gurantted only one value, all other values are Nan, is there any way I can combine these into single rows so it looks like:
OPEN_INT PX_HIGH PX_LAST VOL
timestamp ticker source
2018-01-01 AAPL NYSE 1 2 3 4
2018-01-01 MSFT NYSE 5 6 7 8
I have tried to use df.groupby(['timestamp', 'ticker', 'source']).agg(lambda x: x.dropna() but I got an error saying Function does not reduce.
Use GroupBy.first:
df.groupby(['timestamp', 'ticker', 'source']).first()
If is always only one value per groups aggregate by max, min, sum, mean...:
df.groupby(['timestamp', 'ticker', 'source']).max()

Convert pandas column with single list of values into rows

I have the following dataframe:
symbol PSAR
0 AAPL [nan,100,200]
1 PYPL [nan,300,400]
2 SPY [nan,500,600]
I am trying to turn the PSAR list values into rows like the following:
symbol PSAR
AAPL nan
AAPL 100
AAPL 200
PYPL nan
PYPL 300
... ...
SPY 600
I have been trying to solve it by following the answers in this post(one key difference being that that post has a list of list) but cant get there.
How to convert column with list of values into rows in Pandas DataFrame.
df['PSAR'].stack().reset_index(level=1, drop=True).to_frame('PSAR')
.join(df[['symbol']], how='left')
Not a slick one but this does the job:
list_of_lists = []
df_as_dict = dict(df.values)
for key,values in df_as_dict.items():
list_of_lists+=[[key,value] for value in values]
pd.DataFrame(list_of_lists)
returns:
0 1
0 AAPL NaN
1 AAPL 100.0
2 AAPL 200.0
3 PYPL NaN
4 PYPL 300.0
5 PYPL 400.0
6 SPY NaN
7 SPY 500.0
8 SPY 600.0
Pandas >= 0.25:
df1 = pd.DataFrame({'symbol':['AAPL', 'PYPL', 'SPY'],
'PSAR':[[None,100,200], [None,300,400], [None,500,600]]})
print(df1)
symbol PSAR
0 AAPL [None, 100, 200]
1 PYPL [None, 300, 400]
2 SPY [None, 500, 600]
df1.explode('PSAR')
symbol PSAR
0 AAPL None
0 AAPL 100
0 AAPL 200
1 PYPL None
1 PYPL 300
1 PYPL 400
2 SPY None
2 SPY 500
2 SPY 600

How to read and write table with extra information as a dataframe and adding new columns from the informations

I have a file-like object generated from StringIO which is a table with lines of information ahead the table (see below starting from #TIMESTAMP).
I want to add extra columns to the exisiting table using the information "Date", "UTCoffset - Time (Substraction)" from #Timestamp and "ZenAngle" from #GLOBAL_SUMMARY.
I used pd.read_csv command to read it but it only worked when I skip the first 8 rows which includes the information I need. Also the Error "TypeError: data argument can't be an iterator" was reported as I tried to import the object below as dataframe.
#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
I think you can first use read_csv to create 3 DataFrames:
import pandas as pd
import io
temp=u"""#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
"""
df1 = pd.read_csv(io.StringIO(temp),
skiprows=9)
print (df1)
Wavelength S-Irradiance Time
0 290.0 0.000000 NaN
1 290.5 0.000000 NaN
2 291.0 0.000004 NaN
3 291.5 0.000022 NaN
4 292.0 0.000021 NaN
5 292.5 0.000022 NaN
6 293.0 0.000025 NaN
7 293.5 0.000023 NaN
8 294.0 0.000031 NaN
9 294.5 0.000047 NaN
10 295.0 0.000034 NaN
11 295.5 0.000036 NaN
12 296.0 0.000043 NaN
13 296.5 0.000038 NaN
14 297.0 0.000048 NaN
15 297.5 0.000074 NaN
16 298.0 0.000092 NaN
17 298.5 0.000108 NaN
18 299.0 0.000214 NaN
19 299.5 0.000318 NaN
20 300.0 0.000334 NaN
21 300.5 0.000499 NaN
22 301.0 0.000869 NaN
23 301.5 0.001210 NaN
24 302.0 0.001133 NaN
df2 = pd.read_csv(io.StringIO(temp),
skiprows=1,
nrows=1)
print (df2)
UTCOffset Date Time
0 +00:30:32 2011-09-05 08:32:21
df3 = pd.read_csv(io.StringIO(temp),
skiprows=5,
nrows=1)
print (df3)
Time IntACGIH IntCIE ZenAngle MuValue AzimAngle Flag TempC O3 \
0 08:32:21 7.3576 52.758 59.109 1.929 114.427 0 24 291
Err_O3 SO2 Err_SO2 F324
0 1 NaN NaN 91.9

concat pandas DataFrame along timeseries indexes

I have two largish (snippets provided) pandas DateFrames with unequal dates as indexes that I wish to concat into one:
NAB.AX CBA.AX
Close Volume Close Volume
Date Date
2009-06-05 36.51 4962900 2009-06-08 21.95 0
2009-06-04 36.79 5528800 2009-06-05 21.95 8917000
2009-06-03 36.80 5116500 2009-06-04 22.21 18723600
2009-06-02 36.33 5303700 2009-06-03 23.11 11643800
2009-06-01 36.16 5625500 2009-06-02 22.80 14249900
2009-05-29 35.14 13038600 --AND-- 2009-06-01 22.52 11687200
2009-05-28 33.95 7917600 2009-05-29 22.02 22350700
2009-05-27 35.13 4701100 2009-05-28 21.63 9679800
2009-05-26 35.45 4572700 2009-05-27 21.74 9338200
2009-05-25 34.80 3652500 2009-05-26 21.64 8502900
Problem is, if I run this:
keys = ['CBA.AX','NAB.AX']
mv = pandas.concat([data['CBA.AX'][650:660],data['NAB.AX'][650:660]], axis=1, keys=stocks,)
the following DateFrame is produced:
CBA.AX NAB.AX
Close Volume Close Volume
Date
2200-08-16 04:24:21.460041 NaN NaN NaN NaN
2203-05-13 04:24:21.460041 NaN NaN NaN NaN
2206-02-06 04:24:21.460041 NaN NaN NaN NaN
2208-11-02 04:24:21.460041 NaN NaN NaN NaN
2211-07-30 04:24:21.460041 NaN NaN NaN NaN
2219-10-16 04:24:21.460041 NaN NaN NaN NaN
2222-07-12 04:24:21.460041 NaN NaN NaN NaN
2225-04-07 04:24:21.460041 NaN NaN NaN NaN
2228-01-02 04:24:21.460041 NaN NaN NaN NaN
2230-09-28 04:24:21.460041 NaN NaN NaN NaN
2238-12-15 04:24:21.460041 NaN NaN NaN NaN
Does anybody have any idea why this might be the case?
On another point: is there any python libraries around that pull data from yahoo and normalise it?
Cheers.
EDIT: For reference:
data = {
'CBA.AX': <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2313 entries, 2011-12-29 00:00:00 to 2003-01-01 00:00:00
Data columns:
Close 2313 non-null values
Volume 2313 non-null values
dtypes: float64(1), int64(1),
'NAB.AX': <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2329 entries, 2011-12-29 00:00:00 to 2003-01-01 00:00:00
Data columns:
Close 2329 non-null values
Volume 2329 non-null values
dtypes: float64(1), int64(1)
}
It is possible to read the data with pandas and to concatenate it.
First import the data
In [449]: import pandas.io.data as web
In [450]: nab = web.get_data_yahoo('NAB.AX', start='2009-05-25',
end='2009-06-05')[['Close', 'Volume']]
In [451]: cba = web.get_data_yahoo('CBA.AX', start='2009-05-26',
end='2009-06-08')[['Close', 'Volume']]
In [453]: nab
Out[453]:
Close Volume
Date
2009-05-25 21.15 9685100
2009-05-26 21.64 8541900
2009-05-27 21.74 9042900
2009-05-28 21.63 9701000
2009-05-29 22.02 14665700
2009-06-01 22.52 6782000
2009-06-02 22.80 10473400
2009-06-03 23.11 9931400
2009-06-04 22.21 17869000
2009-06-05 21.95 8214300
In [454]: cba
Out[454]:
Close Volume
Date
2009-05-26 35.45 4529600
2009-05-27 35.13 4521500
2009-05-28 33.95 7945400
2009-05-29 35.14 12548500
2009-06-01 36.16 4509400
2009-06-02 36.33 4304900
2009-06-03 36.80 4845400
2009-06-04 36.79 4592300
2009-06-05 36.51 4417500
2009-06-08 36.51 0
Than concatenate it:
In [455]: keys = ['CBA.AX','NAB.AX']
In [456]: pd.concat([cba, nab], axis=1, keys=keys)
Out[456]:
CBA.AX NAB.AX
Close Volume Close Volume
Date
2009-05-25 NaN NaN 21.15 9685100
2009-05-26 35.45 4529600 21.64 8541900
2009-05-27 35.13 4521500 21.74 9042900
2009-05-28 33.95 7945400 21.63 9701000
2009-05-29 35.14 12548500 22.02 14665700
2009-06-01 36.16 4509400 22.52 6782000
2009-06-02 36.33 4304900 22.80 10473400
2009-06-03 36.80 4845400 23.11 9931400
2009-06-04 36.79 4592300 22.21 17869000
2009-06-05 36.51 4417500 21.95 8214300
2009-06-08 36.51 0 NaN NaN
Try to join on outer.
When I am working with a number of stocks, I would usually have a frame titled "open high,low,close,etc" with column as a ticker. If you want one data structure, I would use Panels for this.
for Yahoo data, you can use pandas:
import pandas.io.data as data
spy = data.DataReader("SPY","yahoo","1991/1/1")

Categories