Python Pandas: How to set the name of multiindex? - python

I want to add the name of index of multi-index dataframe.
I want to set the name of red box in image as 'Ticker'
How can I do that?

Set index.names (plural because MultiIndex) or use rename_axis:
df.index.names = ['Ticker','date']
#if want extract second name
df.index.names = ['Ticker',df.index.names[1]]
Or:
df = df.rename_axis(['Ticker','date'])
#if want extract second name
df = df.rename_axis(['Ticker',df.index.names[1]])
Sample:
mux = pd.MultiIndex.from_product([['NAVER'], ['2018-11-28','2018-12-01','2018-12-02']],
names=[None, 'date'])
df = pd.DataFrame({'open':[1,2,3]},
index=mux)
print(df)
open
date
NAVER 2018-11-28 1
2018-12-01 2
2018-12-02 3
df = df.rename_axis(['Ticker','date'])
print (df)
open
Ticker date
NAVER 2018-11-28 1
2018-12-01 2
2018-12-02 3

Related

How to count values in a panda dataframe with specific index dates

How can I count with a loop how many 2-up and 2-dn are in a column at the same index date in a panda dataframe?
df1 = pd.DataFrame()
index = ['2020-01-01','2020-01-01','2020-01-01','2020-01-08','2020-01-08','2020-01-08']
df1 = pd.DataFrame(index = index)
bars = ['1-inside','2-up','2-dn','2-up','2-up','1-inside']
df1['Strat'] = bars
df1
Result should be:
2020-01-01 2-up = 1, 2-dn = 1
2020-01-08 2-up = 2, 2-dn = 0
Afterwards I would like to plot the results with matplotlib.
Use SeriesGroupBy.value_counts for count, reshape by Series.unstack and then plot by DataFrame.plot.bar:
need = ['2-up','2-dn']
df1 = df1['Strat'].groupby(level=0).value_counts().unstack(fill_value=0)[need]
print (df1)
Strat 2-up 2-dn
2020-01-01 1 1
2020-01-08 2 0
Or you can filter before counts by Series.isin in boolean indexing:
need = ['2-up','2-dn']
df1 = (df1.loc[df1['Strat'].isin(need), 'Strat']
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
df1.plot.bar()

Merge two dataframes and keep the common values while retaining values based on another column

When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)
One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31

Group by Category and find Percent Change for given frequency

I have a dataset, df, that I wish to group by the category and find the percent change for a given frequency
Cat Value Date
A 1 7/1/2020
A 2 7/2/2020
B 20 7/1/2020
B 40 7/3/2020
Desired Output
Cat Diff pct_change Date
A 1 100 7/2/2020
B 20 100 7/3/2020
This is what I am doing
df1=df.groupby(pd.Grouper(key='Cat', freq='1D')).sum() #Group by the Cat
df1['PercentageDiff'] = df1['Value'].pct_change().mul(100) #Find Pct_change
df1['ValueDiff'] = df1['Value'].diff() #Find Value diff
Any help is appreciated.
I believe you want working per groups with DataFrame.groupby and last remove first values per groups filled by misisng values by DataFrame.dropna:
df['Date'] = pd.to_datetime(df['Date'])
df['Diff'] = df.groupby('Cat')['Value'].diff()
df['pct_change'] = df.groupby('Cat')['Value'].pct_change().mul(100)
df = df.dropna(subset=['pct_change'])[['Cat','Diff','pct_change','Date']]
print (df)
Cat Diff pct_change Date
1 A 1.0 100.0 2020-07-02
3 B 20.0 100.0 2020-07-03
This should help:
def f(x):
d = {}
d['Diff'] = x.iloc[1, 'Value'] - x.iloc[0, 'Value']
d['Perc_change'] = 100*(x.iloc[1, 'Value'] - x.iloc[0, 'Value'])/x.iloc[0,'Value']
d['Date'] = max(x['Data'])
return pd.Series(d, index=['Diff', 'Perc_change', 'Date'])
df['Date'] = pd.to_datetime(df.Date)
df = df.sort('Date')
df.groupby(['Cat']).apply(f)

Parsing Column names as DateTime

Is there a way of parsing the column names themselves as datetime.? My column names look like this:
Name SizeRank 1996-06 1996-07 1996-08 ...
I know that I can convert values for a column to datetime values, e.g for a column named datetime, I can do something like this:
temp = pd.read_csv('data.csv', parse_dates=['datetime'])
Is there a way of converting the column names themselves? I have 285 columns i.e my data is from 1996-2019.
There's no way of doing that immediately while reading the data from a file afaik, but you can fairly simply convert the columns to datetime after you've read them in. You just need to watch out that you don't pass columns that don't actually contain a date to the function.
Could look something like this, assuming all columns after the first two are dates (as in your example):
dates = pd.to_datetime(df.columns[2:])
You can then do whatever you need to do with those datetimes.
You could do something like this.
df.columns = df.columns[:2] + pd.to_datetime (df.columns[2:])
It seems pandas will accept a datetime object as a column name...
import pandas as pd
from datetime import datetime
import re
columns = ["Name", "2019-01-01","2019-01-02"]
data = [["Tom", 1,0], ["Dick",1,1], ["Harry",0,0]]
df = pd.DataFrame(data, columns = columns)
print(df)
newcolumns = {}
for col in df.columns:
if re.search("\d+-\d+-\d+", col):
newcolumns[col] = pd.to_datetime(col)
else:
newcolumns[col] = col
print(newcolumns)
df.rename(columns = newcolumns, inplace = True)
print("--------------------")
print(df)
print("--------------------")
for col in df.columns:
print(type(col), col)
OUTPUT:
Name 2019-01-01 2019-01-02
0 Tom 1 0
1 Dick 1 1
2 Harry 0 0
{'Name': 'Name', '2019-01-01': Timestamp('2019-01-01 00:00:00'), '2019-01-02': Timestamp('2019-01-02 00:00:00')}
--------------------
Name 2019-01-01 00:00:00 2019-01-02 00:00:00
0 Tom 1 0
1 Dick 1 1
2 Harry 0
--------------------
<class 'str'> Name
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-02 00:00:00
For brevity you can use...
newcolumns = {col:(pd.to_datetime(col) if re.search("\d+-\d+-\d+", col) else col) for col in df.columns}
df.rename(columns = newcolumns, inplace = True)

Conditionally merge pd.DataFrames

I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)

Categories