I have attached a photo of how the data is formatted when I print the df in Jupyter, please check that for reference.
Set the DATE column as the index, checked the data type of the index, and converted the index to be a datetime index.
import pandas as pd
df = pd.read_csv ('UMTMVS.csv',index_col='DATE',parse_dates=True)
df.index = pd.to_datetime(df.index)
I need to print out percent increase in value from Month/Year to Month/Year and percent decrease in value from Month/Year to Month/Year.
dataframe format picture
The first correction pertains to how to read your DataFrame.
Passing parse_dates you should define a list of columns to be parsed
as dates. So this instruction should be changed to:
df = pd.read_csv('UMTMVS.csv', index_col='DATE', parse_dates=['DATE'])
and then the second instruction in not needed.
To find the percent change in UMTMVS column, use: df.UMTMVS.pct_change().
For your data the result is:
DATE
1992-01-01 NaN
1992-02-01 0.110968
1992-03-01 0.073036
1992-04-01 -0.040080
1992-05-01 0.014875
1992-06-01 -0.330455
1992-07-01 0.368293
1992-08-01 0.078386
1992-09-01 0.082884
1992-10-01 -0.030528
1992-11-01 -0.027791
Name: UMTMVS, dtype: float64
Maybe you should multiply it by 100, to get true percents.
Related
I have a Pandas DataFrame whose rows and columns are a DatetimeIndex.
import pandas as pd
data = pd.DataFrame(
{
"PERIOD_END_DATE": pd.date_range(start="2018-01", end="2018-04", freq="M"),
"first": list("abc"),
"second": list("efg")
}
).set_index("PERIOD_END_DATE")
data.columns = pd.date_range(start="2018-01", end="2018-03", freq="M")
data
Unfortunately, I am getting a variety of errors when I try to pull out a value:
data['2018-01', '2018-02'] # InvalidIndexError: ('2018-01', '2018-02')
data['2018-01', ['2018-02']] # InvalidIndexError: ('2018-01', ['2018-02'])
data.loc['2018-01', '2018-02'] # TypeError: only integer scalar arrays can be converted to a scalar index
data.loc['2018-01', ['2018-02']] # KeyError: "None of [Index(['2018-02'], dtype='object')] are in the [columns]"
How do I extract a value from a DataFrame that uses a DatetimeIndex?
There are 2 issues:
Since, you are using a DateTimeIndex dataframe, the correct notation to traverse between rows and columns are:
a) data.loc[rows_index_name, [column__index_name]]
or
b) data.loc[rows_index_name, column__index_name]
depending on the type of output you desire.
Notation A will return a series value, while notation (b) returns a string value.
The index names can not be amputated- you must specify the whole string.
As such, your issue will be resolved with:
data.loc['2018-01-31',['2018-01-31']] or data.loc['2018-01-31','2018-01-31']
As long as you already set the date as index, you will not be able to slice or extract any data of it. You can extract the month and date of it as it is a regular column not when it is an index. I had this before and that was the solution.
I kept it as a regular column, extracted the Month, Day and Year as a seperate column for each of them, then I assigned the date column as the index column.
you are accessing as a period (YYYY-MM) on a date columns.
This would help in this case
data.columns = pd.period_range(start="2018-01", end="2018-02", freq='M')
data[['2018-01']]
2018-01
PERIOD_END_DATE
2018-01-31 a
2018-02-28 b
2018-03-31 c
Timestamp indexes are finicky. Pandas accepts each of the following expressions, but they return different types.
data.loc['2018-01',['2018-01-31']]
data.loc['2018-01-31',['2018-01-31']]
data.loc['2018-01','2018-01-31']
data.loc['2018-01-31','2018-01']
data.loc['2018-01-31','2018-01-31']
In pd.read_excel pandas automatically parses the columns names as date. And parses it wrong. The date is dd/mm/yy and it parses it as mm/dd/yy.
The column names are date.
code used
df = pd.read_excel('check.xlsx')
print(df)
The df printed has dates parsed in wrong format
Here's the excel file https://docs.google.com/spreadsheets/d/1rgl0Je5EyxpBunk7FWPHcpZxXFdUZUni/edit?usp=drivesdk&ouid=109057655084381529864&rtpof=true&sd=true . The column names are in dd/mm/Y format.
Use '%Y-%m-%d' for formatting like you wish.
e.g.
import pandas as pd
df = pd.DataFrame({"Date": ["26-12-2007", "27-12-2007", "28-12-2007"]})
df["Date"] = pd.to_datetime(df["Date"]).dt.strftime('%Y-%m-%d')
print(df)
Output:
Date
0 2007-12-26
1 2007-12-27
2 2007-12-28
You can also set the column labels to equal the values in the first row with e.g.
df.columns = df.iloc[0]
Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here
I am trying to resample a time series to get annual maximum values for different time steps(eg., 3h, 6h, etc. The original series is at an hourly resolution. I first converted the date format to pandas date format, used that column as an index, and resampled it. The final output should be the years and the corresponding maximum values at the desired timestep. However, i am getting a list of NaN. I am not sure, how can I incorporate a range in my code. Here is my code so far for a 3H timestep
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df = pd.DataFrame(df[['yyyymmddhh', 'rainfall']])
datin["yyyymmddhh"] = pd.to_datetime(datin["yyyymmddhh"], format="%Y%M%d%H")
datin.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
stn_n;yyyymmddhh;rainfall
xyz;1980123123;-
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
please see the data here: screenshot from Google Colab
I am trying to assign the time 19:00 (7pm) for all records of the column "Beginn_Zeit". For now I put the float 19.00. Now I need to convert it to a time format so that I can subsequently merge it with a date of the column "Beginn_Datum". Once I have this merged column, I need to paste its value to a all records with NaT of a different column "Delta2".
dfd['Beginn'] = pd.to_datetime(df['Beginn'], dayfirst=True)
dfd['Ende'] = pd.to_datetime(df['Ende'], dayfirst=True)
dfd['Delta2'] = dfd['Ende']-dfd['Beginn']
dfd.Ende.fillna(dfd.Beginn,inplace=True)
dfd['Beginn_Datum'] = dfd['Beginn'].dt.date
dfd["Beginn_Zeit"] = 19.00
Edited to better match your updated example.
from datetime import time, datetime
dfd['Beginn_Zeit'] = time(19,0)
# create new column combining date and time
new_col = dfd.apply(lambda row: datetime.combine(row['Beginn_Datum'], row['Beginn_Zeit']), axis=1)
# replace null values in Delta2 with new combined dates
dfd.loc[dfd['Delta2'].isnull(), 'Delta2'] = new_col