Extract sub-DataFrames - python

I have this kind of dataframe in Pandas :
NaN
1
NaN
452
1175
12
NaN
NaN
NaN
145
125
NaN
1259
2178
2514
1
On the other hand I have this other dataframe :
1
2
3
4
5
6
I would like to separate the first one into differents sub-dataframes like this:
DataFrame 1:
1
DataFrame 2:
452
1175
12
DataFrame 3:
DataFrame 4:
DataFrame 5:
145
125
DataFrame 6:
1259
2178
2514
1
How can I do that without a loop?

UPDATE: thanks to #piRSquared for pointing out that the solution above will not work for DFs/Series with non-numeric indexes. Here is more generic solution:
dfs = [x.dropna()
for x in np.split(df, np.arange(len(df))[df['column'].isnull().values])]
OLD answer:
IIUC you can do something like this:
Source DF:
In [40]: df
Out[40]:
column
0 NaN
1 1.0
2 NaN
3 452.0
4 1175.0
5 12.0
6 NaN
7 NaN
8 NaN
9 145.0
10 125.0
11 NaN
12 1259.0
13 2178.0
14 2514.0
15 1.0
Solution:
In [31]: dfs = [x.dropna()
for x in np.split(df, df.index[df['column'].isnull()].values+1)]
In [32]: dfs[0]
Out[32]:
Empty DataFrame
Columns: [column]
Index: []
In [33]: dfs[1]
Out[33]:
column
1 1.0
In [34]: dfs[2]
Out[34]:
column
3 452.0
4 1175.0
5 12.0
In [35]: dfs[3]
Out[35]:
Empty DataFrame
Columns: [column]
Index: []
In [36]: dfs[4]
Out[36]:
Empty DataFrame
Columns: [column]
Index: []
In [37]: dfs[4]
Out[37]:
Empty DataFrame
Columns: [column]
Index: []
In [38]: dfs[5]
Out[38]:
column
9 145.0
10 125.0
In [39]: dfs[6]
Out[39]:
column
12 1259.0
13 2178.0
14 2514.0
15 1.0

w = np.append(np.where(np.isnan(df.iloc[:, 0].values))[0], len(df))
splits = {'DataFrame{}'.format(c): df.iloc[i+1:j]
for c, (i, j) in enumerate(zip(w, w[1:]))}
Print out splits to demonstrate
for k, v in splits.items():
print(k)
print(v)
print()
DataFrame0
0
1 1.0
DataFrame1
0
3 452.0
4 1175.0
5 12.0
DataFrame2
Empty DataFrame
Columns: [0]
Index: []
DataFrame3
Empty DataFrame
Columns: [0]
Index: []
DataFrame4
0
9 145.0
10 125.0
DataFrame5
0
12 1259.0
13 2178.0
14 2514.0
15 1.0

Related

Summarize rows in pandas dataframe by column value and append specific column values as columns [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed last month.
I have a dataframe as follows with multiple rows per id (maximum 3).
dat = pd.DataFrame({'id':[1,1,1,2,2,3,4,4], 'code': ["A","B","D","B","D","A","A","D"], 'amount':[11,2,5,22,5,32,11,5]})
id code amount
0 1 A 11
1 1 B 2
2 1 D 5
3 2 B 22
4 2 D 5
5 3 A 32
6 4 A 11
7 4 D 5
I want to consolidate the df and have only one row per id so that it looks as follows:
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11 B 2 D 5
1 2 B 22 D 5 NaN NaN
2 3 A 32 NaN NaN NaN NaN
3 4 A 11 D 5 NaN NaN
How can I acheive this in pandas?
Use GroupBy.cumcount for counter with reshape by DataFrame.unstack and DataFrame.sort_index, last flatten MultiIndex and convert id to column by DataFrame.reset_index:
df = (dat.set_index(['id',dat.groupby('id').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1, sort_remaining=False))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11.0 B 2.0 D 5.0
1 2 B 22.0 D 5.0 NaN NaN
2 3 A 32.0 NaN NaN NaN NaN
3 4 A 11.0 D 5.0 NaN NaN

Access keys of pandas dataframe when using groupby [duplicate]

I have a multi-index data frame with columns 'A' and 'B'.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B)
df
#can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple
df.ix[df.A ==1]
One way is to use the get_level_values Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you'll be able to use xs with drop_level argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1
You can also use query which is very readable in my opinion and straightforward to use:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])
C
A B
1 10 6
2 20 7
3 50 8
4 80 9
For what you had in mind you can now simply do:
df.query('A == 1')
C
A B
1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50')
C
A B
3 50 8
4 80 9
and or
df.query('A == 1 or B >= 50')
C
A B
1 10 6
3 50 8
4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C
A B
1 10 6
3 50 8
4 80 9
If you want to use variables inside your query, you can use #:
b_threshold = 20
c_threshold = 8
df.query('B >= #b_threshold and C <= #c_threshold')
C
A B
2 20 7
3 50 8
You can use DataFrame.xs():
In [36]: df = DataFrame(np.random.randn(10, 4))
In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]
In [38]: df.columns.names = ['A', 'B']
In [39]: df
Out[39]:
A b a
B d d d d
0 -1.406 0.548 -0.635 0.576
1 -0.212 -0.583 1.012 -1.377
2 0.951 -0.349 -0.477 -1.230
3 0.451 -0.168 0.949 0.545
4 -0.362 -0.855 1.676 -2.881
5 1.283 1.027 0.085 -1.282
6 0.583 -1.406 0.327 -0.146
7 -0.518 -0.480 0.139 0.851
8 -0.030 -0.630 -1.534 0.534
9 0.246 -1.558 -1.885 -1.543
In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A a
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Will give you table like this:
Standard access by column
health_data['Bob']
type HR Temp
year visit
2013 1 22.0 38.6
2 52.0 38.3
2014 1 30.0 38.9
2 31.0 37.3
health_data['Bob']['HR']
year visit
2013 1 22.0
2 52.0
2014 1 30.0
2 31.0
Name: HR, dtype: float64
# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year visit
2013 1 True
2 False
2014 1 False
2 False
health_data['Bob']['HR'][2013]
visit
1 22.0
2 52.0
Name: HR, dtype: float64
health_data['Bob']['HR'][2013][1]
22.0
Access by row
health_data.loc[2013]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 22.0 38.6 40.0 38.9 53.0 37.5
2 52.0 38.3 42.0 34.6 30.0 37.7
health_data.loc[2013,1]
subject type
Bob HR 22.0
Temp 38.6
Guido HR 40.0
Temp 38.9
Sue HR 53.0
Temp 37.5
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']
type
HR 22.0
Temp 38.6
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']['HR']
22.0
Slicing multi-index
idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 22.0 40.0 53.0
2014 1 30.0 52.0 45.0
You can use DataFrame.loc:
>>> df.loc[1]
Example
>>> print(df)
result
A B C
1 1 1 6
2 9
2 1 8
2 11
2 1 1 7
2 10
2 1 9
2 12
>>> print(df.loc[1])
result
B C
1 1 6
2 9
2 1 8
2 11
>>> print(df.loc[2, 1])
result
C
1 7
2 10
Another option is:
filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4
df.iloc[filter1 & filter2]
Out[11]:
0
A B
1 4 1
You can use MultiIndex slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])
col1 col2
A B
bar one 0 8
two 4 8
baz one 6 0
two 7 3
foo one 6 8
two 2 6
qux one 7 0
two 6 4
To select all from A and two from B:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3
foo two 2 6
qux two 6 4
To select bar and baz from A and two from B:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2
A B
bar two 4 8
baz two 7 3

How to sort a big dataframe by two columns?

I have a big dataframe which records all price info for stock market.
in this dataframe, there are two index info, which are 'time' and 'con'
here is the example:
In [15]: df = pd.DataFrame(np.reshape(range(20), (5,4)))
In [16]: df
Out[16]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [17]: df.columns = ['open', 'high', 'low', 'close']
In [18]: df['tme'] = ['9:00','9:00', '9:01', '9:01', '9:02']
In [19]: df['con'] = ['a', 'b', 'a', 'b', 'a']
In [20]: df
Out[20]:
open high low close tme con
0 0 1 2 3 9:00 a
1 4 5 6 7 9:00 b
2 8 9 10 11 9:01 a
3 12 13 14 15 9:01 b
4 16 17 18 19 9:02 a
what i want is some dataframes like this:
## here is the close dataframe, which only contains close info, indexed by 'time' and 'con'
Out[31]:
a b
9:00 3 7.0
9:01 11 15.0
9:02 19 NaN
How can i get this dataframe?
Use df.pivot:
In [117]: df.pivot('tme', 'con', 'close')
Out[117]:
con a b
tme
9:00 3.0 7.0
9:01 11.0 15.0
9:02 19.0 NaN
One solution is to use pivot_table. Try this out:
df.pivot_table(index=df['tme'], columns='con', values='close')
The solution is:

Form single Row from all rows with corresponding values in pandas

I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.

summarizing data frame in pandas - python

df = pd.DataFrame({'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],'d':[1,0,0,1,1,1,0,1,0]})
I'm trying to summarize this dataframe using sum. I thought df.groupby(['a','b']).aggregate(sum) would work but it returns an empty Series.
How can I achieve this result?
a b
x 1 1
y 2 1
import numpy as np
import pandas as pd
NaN = np.nan
df = pd.DataFrame(
{'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],
'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],
'd':[32,12,55,98,23,11,9,91,3]})
melted = pd.melt(df, id_vars=['d'], value_vars=['a', 'b'])
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
print(result)
yields
variable a b
value
x 10.0 17.5
y 55.0 50.5
Explanation:
Melting the DataFrame with melted = pd.melt(df, value_vars=['a', 'b']) produces
d variable value
0 32 a y
1 12 a NaN
2 55 a y
3 98 a NaN
4 23 a NaN
5 11 a x
6 9 a x
7 91 a y
8 3 a NaN
9 32 b NaN
10 12 b x
11 55 b NaN
12 98 b y
13 23 b x
14 11 b NaN
15 9 b NaN
16 91 b NaN
17 3 b y
and now we can use pd.pivot_table to pivot and aggregate the d values:
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
Note that the aggfunc can take a list of functions, such as [np.sum, np.median, np.min, np.max, np.std] if you wish to summarize the data in more than one way.

Categories