Why all element are all NaN when construct a multiIndex Dataframe - python

suppose I have a Dataframe like this. I want to convert this to a 2-level multiIndex Dataframe.
dt st close volume
0 20100101 000001.sz 1 10000
1 20100101 000002.sz 10 50000
2 20100101 000003.sz 5 1000
3 20100101 000004.sz 15 7000
4 20100101 000005.sz 100 100000
5 20100102 000001.sz 2 20000
6 20100102 000002.sz 20 60000
7 20100102 000003.sz 6 2000
8 20100102 000004.sz 20 8000
9 20100102 000005.sz 110 110000
But when I try this code:
data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
data.st.unique()],
names=['dt', 'st'])
col = ['close', 'volume']
df = pd.DataFrame(data, idx, col)
print(df)
I find that all the element are NaN
close volume
dt st
20100101 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
20100102 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
How to handle this situation? Thanks.

You need only parameter index_col in read_csv:
#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])
Or:
#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])
print (data)
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
Why all element are all NaN when construct a multiIndex Dataframe?
Reason is in DataFrame constructor:
df = pd.DataFrame(data, idx, col)
DataFrame called data has RangeIndex and not align with new MultiIndex, so get NaNs in data.
Possible solution if always each dt has same st values is filter Dataframe by columns names and then convert to numpy array, but better are index_col and set_index solutions:
df = pd.DataFrame(data[col].values, idx, col)

Try using set_index() like this:
new_df = df.set_index(['dt', 'st'])
Result:
>>> new_df
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['dt', 'st'])

Related

Python Pandas finds cumulative Min per group

I tried to find the Desired_Output column, which is defined as follow: For each Name and Subj group, find the Min of all previous Score.
Name Date Subj Score Desired_Output
A 2022-05-11 1200 70.88 69.60
A 2022-03-20 1200 69.96 69.60
A 2022-02-23 1200 69.60 69.63
A 2022-01-26 1200 69.63 70.22
A 2022-01-05 1200 70.35 70.22
A 2021-12-08 1200 70.22 70.69
A 2021-11-17 1000 56.73 null
A 2021-11-10 1200 70.69 null
B 2022-05-07 1600 96.16 96.53
B 2022-04-24 1600 94.53 null
B 2022-03-20 2000 124.60 null
B 2022-02-27 1800 109.16 null
B 2022-02-03 1400 82.54 null
Here is the dataset:
pd.DataFrame({
'Name': ['A','A','A','A','A','A','A','A','B','B','B','B','B'],
'Date': ['2022-05-11','2022-03-20','2022-02-23','2022-01-26','2022-01-05','2021-12-08','2021-11-17','2021-11-10','2022-05-07','2022-04-24','2022-03-20','2022-02-27','2022-02-03'],
'Subj': [1200,1200,1200,1200,1200,1200,1000,1200,1600,1600,2000,1800,1400],
'Score': [70.88,69.96,69.6,69.63,70.35,70.22,56.73,70.69,96.16,94.53,124.6,109.16,82.54]})
I don't know how to achieve that in Pandas, especially without looping the DataFrame.
Assuming the dates are sorted in reverse order, you can use a reversed cummin+shift per group:
df['Desired'] = (df[::-1]
.groupby(['Name', 'Subj'])['Score']
.apply(lambda s: s.cummin().shift())
)
Output:
Name Date Subj Score Desired
0 A 2022-05-11 1200 70.88 69.60
1 A 2022-03-20 1200 69.96 69.60
2 A 2022-02-23 1200 69.60 69.63
3 A 2022-01-26 1200 69.63 70.22
4 A 2022-01-05 1200 70.35 70.22
5 A 2021-12-08 1200 70.22 70.69
6 A 2021-11-17 1000 56.73 NaN
7 A 2021-11-10 1200 70.69 NaN
8 B 2022-05-07 1600 96.16 94.53
9 B 2022-04-24 1600 94.53 NaN
10 B 2022-03-20 2000 124.60 NaN
11 B 2022-02-27 1800 109.16 NaN
12 B 2022-02-03 1400 82.54 NaN

Extrapolation of dataset based on index

I am trying to extrapolate my dataset. A snippet looks as follows. A simple linear extrapolation is fine here:
Index Value
3000 NaN
4000 NaN
5000 10
6000 20
6500 33
7000 44
8300 60
9300 NaN
9400 NaN
The extrapolation should consider the index values. As the pandas package only provides a function for interpolation, I am stuck. I looked at scipy package, but cant seem to implement my idea. Would really appreciate any help.
I'm more familiar with scikit-learn:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame([(3000, np.nan),
(4000, np.nan),
(5000, 10),
(6000, 20),
(6500, 33),
(7000, 44 ),
(8300, 60),
(9300, np.nan),
(9400, np.nan)], columns=['Index', 'Value'])
def extrapolate(df, X_col, y_col):
df_ = df[[X_col, y_col]].dropna()
return LinearRegression().fit(
df_[X_col].values.reshape(-1,1), df_[y_col]).predict(
df[X_col].values.reshape(-1,1))
df['Value_'] = extrapolate(df, 'Index', 'Value')
df
You should obtain something like this:
Index Value Value_
0 3000 NaN -23.219022
1 4000 NaN -7.314802
2 5000 10.0 8.589417
3 6000 20.0 24.493637
4 6500 33.0 32.445747
5 7000 44.0 40.397857
6 8300 60.0 61.073342
7 9300 NaN 76.977562
8 9400 NaN 78.567984
# I assume you don't want to extrapolate the orginal values
df['Value'] = df['Value'].fillna(df['Value_'])
df
Gives:
Index Value Value_
0 3000 -23.219022 -23.219022
1 4000 -7.314802 -7.314802
2 5000 10.000000 8.589417
3 6000 20.000000 24.493637
4 6500 33.000000 32.445747
5 7000 44.000000 40.397857
6 8300 60.000000 61.073342
7 9300 76.977562 76.977562
8 9400 78.567984 78.567984

Split a pandas dataframe into multiple dataframes if all rows are nan

I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]

How to read and write table with extra information as a dataframe and adding new columns from the informations

I have a file-like object generated from StringIO which is a table with lines of information ahead the table (see below starting from #TIMESTAMP).
I want to add extra columns to the exisiting table using the information "Date", "UTCoffset - Time (Substraction)" from #Timestamp and "ZenAngle" from #GLOBAL_SUMMARY.
I used pd.read_csv command to read it but it only worked when I skip the first 8 rows which includes the information I need. Also the Error "TypeError: data argument can't be an iterator" was reported as I tried to import the object below as dataframe.
#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
I think you can first use read_csv to create 3 DataFrames:
import pandas as pd
import io
temp=u"""#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
"""
df1 = pd.read_csv(io.StringIO(temp),
skiprows=9)
print (df1)
Wavelength S-Irradiance Time
0 290.0 0.000000 NaN
1 290.5 0.000000 NaN
2 291.0 0.000004 NaN
3 291.5 0.000022 NaN
4 292.0 0.000021 NaN
5 292.5 0.000022 NaN
6 293.0 0.000025 NaN
7 293.5 0.000023 NaN
8 294.0 0.000031 NaN
9 294.5 0.000047 NaN
10 295.0 0.000034 NaN
11 295.5 0.000036 NaN
12 296.0 0.000043 NaN
13 296.5 0.000038 NaN
14 297.0 0.000048 NaN
15 297.5 0.000074 NaN
16 298.0 0.000092 NaN
17 298.5 0.000108 NaN
18 299.0 0.000214 NaN
19 299.5 0.000318 NaN
20 300.0 0.000334 NaN
21 300.5 0.000499 NaN
22 301.0 0.000869 NaN
23 301.5 0.001210 NaN
24 302.0 0.001133 NaN
df2 = pd.read_csv(io.StringIO(temp),
skiprows=1,
nrows=1)
print (df2)
UTCOffset Date Time
0 +00:30:32 2011-09-05 08:32:21
df3 = pd.read_csv(io.StringIO(temp),
skiprows=5,
nrows=1)
print (df3)
Time IntACGIH IntCIE ZenAngle MuValue AzimAngle Flag TempC O3 \
0 08:32:21 7.3576 52.758 59.109 1.929 114.427 0 24 291
Err_O3 SO2 Err_SO2 F324
0 1 NaN NaN 91.9

Dataframe Merge in Pandas

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Categories