I'm having some difficulties converting an integer-type column of a Pandas DataFrame representing dates (in YYYYMMDD format) to a DateTime type column and parsing the result in a specific format (e.g., 01JAN2021). Here's a sample DataFrame to get started:
import pandas as pd
df = pd.DataFrame(data={"CUS_DATE": [19550703, 19631212, 19720319, 19890205, 19900726]})
print(df)
CUS_DATE
0 19550703
1 19631212
2 19720319
3 19890205
4 19900726
Of all the things I have tried so far, the only one that worked was the following:
df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
print(df)
CUS_DATE
0 1955-07-03
1 1963-12-12
2 1972-03-19
3 1989-02-05
4 1990-07-26
But the above is not the result I'm looking for. My desired output should be the following:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
Any additional help would be appreciated.
Do this:
In [1347]: df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
In [1359]: df["CUS_DATE"] = df["CUS_DATE"].apply(lambda x: x.strftime('%d%b%Y').upper())
In [1360]: df
Out[1360]:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
You can use, in addition to pandas.to_datetime, the methods pandas.Series.dt.strftime and pandas.Series.str.upper:
df["CUS_DATE"] = (pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
.dt.strftime('%d%b%Y').str.upper())
# CUS_DATE
#0 03JUL1955
#1 12DEC1963
#2 19MAR1972
#3 05FEB1989
#4 26JUL1990
Also, check this documentation where you can find the datetime format codes.
Related
I have a Dataframe ,you can have it ,by runnnig:
import pandas as pd
from io import StringIO
df = """
case_id duration_time other_column
3456 1 random value 1
7891 3 ddd
1245 0 fdf
9073 null 111
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
Now I first drop the null rows by using dropna function, then calculate the average value of the left duration_time as average_duration:
average_duration=df.duration_time.duration_time.dropna.mean()
The output is average_duration:
1.3333333333333333
My question is how can I convert the result to something similar as:
average_duration = '1 day,8 hours'
Since 1.33 days is 1 day and 7.92 hours
With pandas, you can use timedelta :
td = pd.Timedelta(days=average_duration)
d, h = td.components.days, td.components.hours
Output :
print("average_duration:", f"{d} day(s), {h} hour(s)")
average_duration: 1 day(s), 8 hour(s)
If you need to define the result/string a variable, use this :
average_duration = f"{d} day(s), {h} hour(s)"
There are third party libraries that might do a better job, but it is trivial to define your own function:
def time_passed(duration: float):
days = int(duration)
hours = round((duration - days) * 24)
return f"{days} days, {hours} hours"
print(time_passed(1.3333333333333333))
This should print
1 days, 8 hours
I read csv files into dataframe using
from glob import glob
import pandas as pd
def read_file(f):
df = pd.read_csv(f)
df['ticker'] = f.split('.')[0]
return df
df = pd.concat([read_file(f) for f in glob('*.csv')])
df = df.set_index(['Date','ticker'])[['Close']].unstack()
And got the following dataframe:
Close
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
.
.
.
now I would like to use the 'ticker' to reindex another random dataframe created by
data = np.random.random((df.shape[1], 100))
df1 = pd.DataFrame(data)
which looks like:
0 1 2 3 4 5 6 \...
0 0.493036 0.114539 0.862388 0.156381 0.030477 0.094902 0.132268
1 0.486184 0.483585 0.090874 0.751288 0.042761 0.150361 0.781567
2 0.318586 0.078662 0.238091 0.963334 0.815566 0.274273 0.320380
3 0.708489 0.354177 0.285239 0.565553 0.212956 0.275228 0.597578
4 0.150210 0.423037 0.785664 0.956781 0.894701 0.707344 0.883821
5 0.005920 0.115123 0.334728 0.874415 0.537229 0.557406 0.338663
6 0.066458 0.189493 0.887536 0.915425 0.513706 0.628737 0.132074
7 0.729326 0.241142 0.574517 0.784602 0.287874 0.402234 0.926567
8 0.284867 0.996575 0.002095 0.325658 0.525330 0.493434 0.701801
9 0.355176 0.365045 0.270155 0.681947 0.153718 0.644909 0.952764
10 0.352828 0.557434 0.919820 0.952302 0.941161 0.246068 0.538714
11 0.465394 0.101752 0.746205 0.897994 0.528437 0.001023 0.979411
I tried
df1 = df1.set_index(df.columns.values)
but it seems my df only has one level of index since the error says
IndexError: Too many levels: Index has only 1 level, not 2
But if I check the index by df.index it gives me the Date, can someone help me solve this problem?
You can get the column labels of a particular level of the MultiIndex in df by MultiIndex.get_level_values, as follows:
df_ticker = df.columns.get_level_values('ticker')
Then, if df1 has the same number of columns, you can copy the labels extracted to df1 by:
df1.columns = df_ticker
I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data
I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0
I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine
I have a file with rows like this:
blablabla (CODE1513A15), 9.20, 9.70, 0
I want pandas to read each column, but from the first column I am interested only in the data between brackets, and I want to extract it into additional columns. Therefore, I tried using a Pandas converter:
import pandas as pd
from datetime import datetime
import string
code = 'CODE'
code_parser = lambda x: {
'date': datetime(int(x.split('(', 1)[1].split(')')[0][len(code):len(code)+2]), string.uppercase.index(x.split('(', 1)[1].split(')')[0][len(code)+4:len(code)+5])+1, int(x.split('(', 1)[1].split(')')[0][len(code)+2:len(code)+4])),
'value': float(x.split('(', 1)[1].split(')')[0].split('-')[0][len(code)+5:])
}
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
pd.read_csv('myfile.csv', usecols=[0,1,2,3], names=column_names, converters={'first_column': code_parser})
With this code, I can convert the text between brackets to a dict containing a datetime object and a value.
If the code is CODE1513A15 as in the sample, it will be built from:
a known code (in this example, 'CODE')
two digits for the year
two digits for the day of month
A letter from A to L, which is the month (A for January, B for February, ...)
A float value
I tested the lambda function and it correctly extracts the information I want, and its output is a dict {'date': datetime(15, 1, 13), 'value': 15}. Nevertheless, if I print the result of the pd.read_csv method, the 'first_column' is a dict, while I was expecting it to be replaced by two columns called 'date' and 'value':
first_column second_column third_column fourth_column
0 {u'date':13-01-2015, u'value':15} 9.20 9.70 0
1 {u'date':14-01-2015, u'value':16} 9.30 9.80 0
2 {u'date':15-01-2015, u'value':12} 9.40 9.90 0
What I want to get is:
date value second_column third_column fourth_column
0 13-01-2015 15 9.20 9.70 0
1 14-01-2015 16 9.30 9.80 0
2 15-01-2015 12 9.40 9.90 0
Note: I don't care how the date is formatted, this is only a representation of what I expect to get.
Any idea?
I think it's better to do things step by step.
# read data into a data frame
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
df = pd.read_csv(data, names=column_names)
# extract values using regular expression which is much more robust
# than string spliting
tmp = df.first_column.str.extract('CODE(\d{2})(\d{2})([A-L]{1})(\d+)')
tmp.columns = ['year', 'day', 'month', 'value']
tmp['month'] = tmp['month'].apply(lambda m: str(ord(m) - 64))
Sample output:
print tmp
year day month value
0 15 13 1 15
Then transform your original data frame into the format that you want
df['date'] = (tmp['year'] + tmp['day'] + tmp['month']).apply(lambda d: strptime(d, '%y%d%m'))
df['value'] = tmp['value']
del df['first_column']
Is conversion in the read_csv is mandatory? Otherwise, passing a function which returns Series to apply results in DataFrame.
df
first_column second_column third_column fourth_column
0 blablabla (CODE1513A15) 9.2 9.7 0
1 blablabla (CODE1514A16) 9.2 9.7 0
code_parser = lambda x: pd.Series({
'date': datetime(2000+int(x.split('(', 1)[1].split(')')[0][len(code):len(code)+2]), string.uppercase.index(x.split('(', 1)[1].split(')')[0][len(code)+4:len(code)+5])+1, int(x.split('(', 1)[1].split(')')[0][len(code)+2:len(code)+4])),
'value': float(x.split('(', 1)[1].split(')')[0].split('-')[0][len(code)+5:])
})
df['first_column'].apply(code_parser)
date value
0 2015-01-13 15
1 2015-01-14 16