Reverting from multiindex to single index dataframe in pandas - python

NI
YEAR MONTH datetime
2000 1 2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 NaN
In the dataframe above, I have a multilevel index consisting of the columns:
names=[u'YEAR', u'MONTH', u'datetime']
How do I revert to a dataframe with 'datetime' as index and 'YEAR' and 'MONTH' as normal columns?

pass level=[0,1] to just reset those levels:
dist_df = dist_df.reset_index(level=[0,1])
In [28]:
df.reset_index(level=[0,1])
Out[28]:
YEAR MONTH NI
datetime
2000-01-01 2000 1 NaN
2000-01-02 2000 1 NaN
2000-01-03 2000 1 NaN
2000-01-04 2000 1 NaN
2000-01-05 2000 1 NaN
you can pass the label names alternatively:
df.reset_index(level=['YEAR','MONTH'])

Another simple way would be to set columns for dataframe
consolidated_data.columns=country_master
ref: https://riptutorial.com/pandas/example/18695/how-to-change-multiindex-columns-to-standard-columns

Related

Python: Add 1 every month in dataframe for all columns

I have a dataframe:
A B C
date
2021-01-01 1 nan 1
2021-01-23 nan 1 1
2021-02-03 1 nan 1
How can I add "1" to all columns at the beginning of each month? (Note I also want to do this quarterly as well) The dataframe should end up looking like this:
A B C
date
2021-01-01 2 nan 2
2021-01-23 nan 1 1
2021-02-01 nan 1 1
2021-02-03 1 nan 1
The beginning of the month should have "nan" in the same place as the last instance of the previous month.
IIUC the logic, you could do:
# ensure datetime
df.index = pd.to_datetime(df.index)
# fill missing starts of month
idx = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(df.index.union(idx))
# update starts of month
prev = df.shift(1).loc[idx] # get last data of previous month
df.loc[idx] = df.loc[idx].add(1).combine_first(prev) # increment/fill
output:
A B C
2021-01-01 2.0 NaN 2.0
2021-01-23 NaN 1.0 1.0
2021-02-01 NaN 1.0 1.0
2021-02-03 1.0 NaN 1.0
df[(df.index.is_month_end) & (df.index >= df.first_valid_index())]+=1
Figured it out. And, for quarterly, it would be .is_quarter_end

how can I check a column value=NAN for a datetime row value being the first of the month?

That was the clearest way I could of asked the question I do apologize. I have monthly data like this, with only the first of the month having a data point
city time value
London 2000-01-01 5
London 2000-01-02 nan
London 2000-01-03 nan
..
London 2000-01-31 nan
London 2000-02-01 nan
London 2000-02-02 nan
London 2000-02-01 nan
...
London 2000-02-31 nan
London 2000-03-01 3
London 2000-01-01 nan
..
I basically want to do this following statement in pandas form:
If value = nan for timestamps with day = 1, replace that first of the month value with -1. I am struggling with the python sub sectioning notation using a condition as a mask.
So from above I want my data to then look like
city time value
London 2000-01-01 5
London 2000-01-02 nan
London 2000-01-03 nan
..
London 2000-01-31 nan
London 2000-02-01 -1
London 2000-02-02 nan
London 2000-02-01 nan
...
London 2000-02-31 nan
London 2000-03-01 3
London 2000-01-01 nan
..
but it obviously continues and there are thousands of rows.
edit-
Below is what I am starting to attempt:
So I saw online that I can make a condition and then use df.loc(that condition) to subsection the data so something like
mask = (df.time.dt.day==1)
So I believe this subsections the times for day=1 but I am not sure how to proceed.
Use numpy.where with pd.to_datetime, Series.eq and Series.isna:
In [503]: import numpy as np
# Convert 'time' column into pandas datetime
In [499]: df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
In [504]: df['value'] = np.where(df['time'].dt.day.eq(1) & df['value'].isna(), -1, df['value'])
In [505]: df
Out[505]:
city time value
0 London 2000-01-01 5.0
1 London 2000-01-02 NaN
2 London 2000-01-03 NaN
3 London 2000-01-31 NaN
4 London 2000-02-01 -1.0
5 London 2000-02-02 NaN
6 London 2000-02-01 -1.0
7 London 2000-03-01 3.0
8 London 2000-01-01 -1.0
OR use df.loc:
In [499]: df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
In [510]: df.loc[df['time'].dt.day.eq(1) & df['value'].isna(), 'value'] = -1

How to deal with a Dataframe with many columns and if statement

Here is my problem
You will find below a sample of my DataFrame:
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
And I started to work on a loop with an If statement like this:
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
module=1/selection.loc[date,symbol]
if selection.loc[date,symbol] > rolling_mean.loc[date,symbol]:
df_end.loc[date,symbol] = module
else:
df_end.loc[date,symbol]=0
return df_end
Then :
test(df,'John_Score')
However, my problem is that I don't know how to deal with many columns at the same time, my goal is to try this function on the whole dataframe (for all columns). This sample has only 2 columns but in reality I have 30 columns and I don't know how to do it.
EDIT :
This is what I have with test(df,'John_Score') :
Paul_Score John_Score
Date
2000-01-03 0 0.125000
2000-01-04 0 0.023810
2000-01-05 0 0.000000
2000-01-06 0 0.017544
2000-01-07 0 0.000000
2000-01-08 0 0.014286
And this is what I have with test(df,'Paul_Score') :
Paul_Score John_Score
Date
2000-01-03 0.333333 0
2000-01-04 0.100000 0
2000-01-05 0.045455 0
2000-01-06 0.031250 0
2000-01-07 0.000000 0
2000-01-08 0.025000 0
And I would like something like that :
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
My goal is to check df every day each column and if the value is greater than the value of its rolling mean 2 days then we compute 1/value of df if it is true and 0 if not.
It may have a simpler way but I'm trying to enhance my coding skills on for/if statement and I found that I have difficulties in doing computation on Dataframes with many columns
If you have any idea, you are welcome
Maybe this code does the job:
import pandas as pd
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
for cols in symbol:
module=1/selection.loc[date,cols]
if selection.loc[date,cols] > rolling_mean.loc[date,cols]:
df_end.loc[date,cols] = module
else:
df_end.loc[date,cols]=0
return df_end
test(df,['Paul_Score', 'John_Score'])
Output:
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286

How to select column and rows in pandas without column or row names?

I have a pandas dataframe(df) like this
Close Close Close Close Close
Date
2000-01-03 00:00:00 NaN NaN NaN NaN -0.033944
2000-01-04 00:00:00 NaN NaN NaN NaN 0.0351366
2000-01-05 00:00:00 -0.033944 NaN NaN NaN -0.0172414
2000-01-06 00:00:00 0.0351366 -0.033944 NaN NaN -0.00438596
2000-01-07 00:00:00 -0.0172414 0.0351366 -0.033944 NaN 0.0396476
in R If I want to select fifth column
five=df[,5]
and without 5th column
rest=df[,-5]
How can I do similar operations with pandas dataframe
I tried this in pandas
five=df.ix[,5]
but its giving this error
File "", line 1
df.ix[,5]
^
SyntaxError: invalid syntax
Use iloc. It is explicitly a position based indexer. ix can be both and will get confused if an index is integer based.
df.iloc[:, [4]]
For all but the fifth column
slc = list(range(df.shape[1]))
slc.remove(4)
df.iloc[:, slc]
or equivalently
df.iloc[:, [i for i in range(df.shape[1]) if i != 4]]
If your DataFrame does not have column/row labels and you want to select some specific columns then you should use iloc method.
example if you want to select first column and all rows:
df = dataset.iloc[:,0]
Here the df variable will contain the value stored in the first column of your dataframe.
Do remember that
type(df) -> pandas.core.series.Series
Hope it helps
If you want the fifth column:
df.ix[:,4]
Stick the colon in there to take all the rows for that column.
To exclude a fifth column you could try:
df.ix[:, (x for x in range(0, len(df.columns)) if x != 4)]
To select filter column by index:
In [19]: df
Out[19]:
Date Close Close.1 Close.2 Close.3 Close.4
0 2000-01-0300:00:00 NaN NaN NaN NaN -0.033944
1 2000-01-0400:00:00 NaN NaN NaN NaN 0.035137
2 2000-01-0500:00:00 -0.033944 NaN NaN NaN -0.017241
3 2000-01-0600:00:00 0.035137 -0.033944 NaN NaN -0.004386
4 2000-01-0700:00:00 -0.017241 0.035137 -0.033944 NaN 0.039648
In [20]: df.ix[:, 5]
Out[20]:
0 -0.033944
1 0.035137
2 -0.017241
3 -0.004386
4 0.039648
Name: Close.4, dtype: float64
In [21]: df.icol(5)
/usr/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
#!/usr/bin/python2
Out[21]:
0 -0.033944
1 0.035137
2 -0.017241
3 -0.004386
4 0.039648
Name: Close.4, dtype: float64
In [22]: df.iloc[:, 5]
Out[22]:
0 -0.033944
1 0.035137
2 -0.017241
3 -0.004386
4 0.039648
Name: Close.4, dtype: float64
To select all columns except index:
In [29]: df[[df.columns[i] for i in range(len(df.columns)) if i != 5]]
Out[29]:
Date Close Close.1 Close.2 Close.3
0 2000-01-0300:00:00 NaN NaN NaN NaN
1 2000-01-0400:00:00 NaN NaN NaN NaN
2 2000-01-0500:00:00 -0.033944 NaN NaN NaN
3 2000-01-0600:00:00 0.035137 -0.033944 NaN NaN
4 2000-01-0700:00:00 -0.017241 0.035137 -0.033944 NaN

Selecting Subset of Pandas DataFrame

I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
I additionally have a DataFrame called "recession" which contains data like the following:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?
Let's modify the example you posted so the dates overlap:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
Then you could join the two dataframes:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
and select rows based on a condition like this:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
To get just the GDP column supply the column name as the second term to combined.loc:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
As PaulH points out, you could also use query, which has a nicer syntax:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1

Categories