How to sort a big dataframe by two columns? - python

I have a big dataframe which records all price info for stock market.
in this dataframe, there are two index info, which are 'time' and 'con'
here is the example:
In [15]: df = pd.DataFrame(np.reshape(range(20), (5,4)))
In [16]: df
Out[16]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [17]: df.columns = ['open', 'high', 'low', 'close']
In [18]: df['tme'] = ['9:00','9:00', '9:01', '9:01', '9:02']
In [19]: df['con'] = ['a', 'b', 'a', 'b', 'a']
In [20]: df
Out[20]:
open high low close tme con
0 0 1 2 3 9:00 a
1 4 5 6 7 9:00 b
2 8 9 10 11 9:01 a
3 12 13 14 15 9:01 b
4 16 17 18 19 9:02 a
what i want is some dataframes like this:
## here is the close dataframe, which only contains close info, indexed by 'time' and 'con'
Out[31]:
a b
9:00 3 7.0
9:01 11 15.0
9:02 19 NaN
How can i get this dataframe?

Use df.pivot:
In [117]: df.pivot('tme', 'con', 'close')
Out[117]:
con a b
tme
9:00 3.0 7.0
9:01 11.0 15.0
9:02 19.0 NaN

One solution is to use pivot_table. Try this out:
df.pivot_table(index=df['tme'], columns='con', values='close')
The solution is:

Related

How can I fill missing data in my dataframe faster for a big dataset, and without a SettingWithCopyWarning?

I have a dataframe with the count of people per day and per location. Without any missing data, I expect to have 4 lines per day: 2 locations and 2 genders. Some data is missing and should be replaced by the mean count, but only if that location has data for that gender on the day before.
If data is missing for more dan 1 day, I assume that there is supposed to be no data. So for example in my example dataframe: Day 2, Location X, Gender F should be filled, because Day 1, Location X, Gender F exists. But Day 4, Location Y, Gender F must stay empty, because Day 3, Location Y, Gender F does not exist.
The code below works for this small dataframe, but it's really slow for my large dataset. Is there a way to do this faster?
Can I avoid the SettingWithCopyWarnings in this case?
import pandas as pd
import numpy as np
import random
data = pd.DataFrame({'day': [1,1,2,3,3,4,5,1,2],
'location': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y'],
'gender': ['F', 'M', 'M','F', 'M','F', 'F','F','F'],
'count': random.sample(range(10, 30), 9)})
print(data.sort_values('day').reset_index(drop=True))
day location gender count
0 1 X F 17
1 1 X M 10
2 1 Y F 12
3 2 X M 20
4 2 Y F 15
5 3 X F 24
6 3 X M 29
7 4 X F 11
8 5 X F 14
data2 = pd.DataFrame()
for e, today in enumerate(list(set(data['day'].sort_values()))[1:]):
yesterday = (list(set(data['day'].sort_values()))[e])
today_df = data[(data['day']==today)].set_index(['location','gender'])
yesterday_df = data[(data['day']==yesterday)].set_index(['location','gender'])
today_missing = [[i[0],i[1]] for i in yesterday_df.index if i not in today_df.index]
for i in today_missing:
new_row = data[(data['day']==yesterday) & (data['location']==i[0]) & (data['gender']==i[1])]
new_row['day'] = today
new_row['count'] = int(np.mean(data['count'][(data['location']==i[0]) & (data['gender']==i[1])]))
data2 = data2.append(new_row, ignore_index=True)
data = data.append(data2).sort_values('day').reset_index(drop=True)
print(data)
day location gender count
0 1 X F 17
1 1 X M 10
2 1 Y F 12
3 2 X M 20
4 2 Y F 15
5 2 X F 16
6 3 X F 24
7 3 X M 29
8 3 Y F 13
9 4 X F 11
10 4 X M 19
11 5 X F 14
One solution can be to re-generate the posible combinations of location, gender and day
df = data.set_index(['location', 'gender', 'day'])
.reindex(pd.MultiIndex.from_product(
[['X', 'Y'], ['F', 'M'], range(1, 8)],
names=['location', 'gender', 'day']))
count
location gender day
X F 1 17.0
2 NaN
3 24.0
4 11.0
5 14.0
6 NaN
7 NaN
M 1 10.0
2 20.0
3 29.0
4 NaN
5 NaN
6 NaN
7 NaN
Y F 1 12.0
2 15.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
M 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
1: Solution filling with mean per location, gender group
df.groupby(['location', 'gender']).transform(lambda x: x.fillna(x.mean(), limit=1)).dropna()
count
location gender day
X F 1 17.000000
2 16.500000
3 24.000000
4 11.000000
5 14.000000
M 1 10.000000
2 20.000000
3 29.000000
4 19.666667
Y F 1 12.000000
2 15.000000
3 13.500000
2: Solution interpolating linearly between days
Another solution can be to interpolate between days within the [location, gender] groups, with a limit of 1 day filling:
df.interpolate(level=['location', 'gender'], limit=1).dropna()
count
location gender day
X F 1 17.000000
2 20.500000
3 24.000000
4 11.000000
5 14.000000
6 12.666667
M 1 10.000000
2 20.000000
3 29.000000
4 25.600000
Y F 1 12.000000
2 15.000000
3 15.000000
You can remove the multiindex doing df.reset_index(). Hope it serves.

Pandas groupby datetime columns by periods

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

pandas: finding maximum for each series in dataframe

Consider this data:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
date A B C D
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
where date is the index
what I want to get back is a tuple of (date, <max>, <series_name>) for each column:
2/1/2016, 18, 'A'
4/1/2016, 17, 'B'
1/1/2016, 19, 'C'
4/1/2016, 18, 'D'
How can this be done in idiomatic pandas?
You could use idxmax and max with axis=0 for that and then join them:
np.random.seed(632)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
In [28]: df
Out[28]:
A B C D
0 10 14 16 1
1 12 13 8 8
2 8 16 11 1
3 8 1 17 12
4 4 2 1 7
In [29]: df.idxmax(axis=0)
Out[29]:
A 1
B 2
C 3
D 3
dtype: int64
In [30]: df.max(axis=0)
Out[30]:
A 12
B 16
C 17
D 12
dtype: int32
In [32]: pd.concat([df.idxmax(axis=0) , df.max(axis=0)], axis=1)
Out[32]:
0 1
A 1 12
B 2 16
C 3 17
D 3 12
I think you can concat max and idxmax. Last you can reset_index, rename column index and reorder all columns:
print df
A B C D
date
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
print pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
max date
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
.rename(columns={'index':'name'})
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Another solution with rename_axis (new in pandas 0.18.0):
print pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
max date
name
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Setup
import numpy as np
import pandas as pd
np.random.seed(314)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
print df
A B C D
2016-04-01 8 13 9 19
2016-04-02 10 14 16 7
2016-04-03 2 7 16 3
2016-04-04 12 7 4 0
2016-04-05 4 13 8 16
Solution
stacked = df.stack()
stacked = stacked[stacked.groupby(level=1).idxmax()]
produces
print stacked
2016-04-04 A 12
2016-04-02 B 14
C 16
2016-04-01 D 19
dtype: int32

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

Categories