I have a Dataframe ,you can have it ,by runnnig:
import pandas as pd
from io import StringIO
df = """
case_id duration_time other_column
3456 1 random value 1
7891 3 ddd
1245 0 fdf
9073 null 111
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
Now I first drop the null rows by using dropna function, then calculate the average value of the left duration_time as average_duration:
average_duration=df.duration_time.duration_time.dropna.mean()
The output is average_duration:
1.3333333333333333
My question is how can I convert the result to something similar as:
average_duration = '1 day,8 hours'
Since 1.33 days is 1 day and 7.92 hours
With pandas, you can use timedelta :
td = pd.Timedelta(days=average_duration)
d, h = td.components.days, td.components.hours
Output :
print("average_duration:", f"{d} day(s), {h} hour(s)")
average_duration: 1 day(s), 8 hour(s)
If you need to define the result/string a variable, use this :
average_duration = f"{d} day(s), {h} hour(s)"
There are third party libraries that might do a better job, but it is trivial to define your own function:
def time_passed(duration: float):
days = int(duration)
hours = round((duration - days) * 24)
return f"{days} days, {hours} hours"
print(time_passed(1.3333333333333333))
This should print
1 days, 8 hours
Related
In today's year, if the difference in the year of the corresponding column is 5 or more, it is designed to output 1, but the NaN value comes out.
import pandas as pd
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
x = 1
return x
else:
x = 0
return x
df['VIP'] = df[condition]['DaysSinceJoined'].apply(time)
df['VIP']
Get an error:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
2235 NaN
2236 NaN
2237 NaN
2238 NaN
2239 NaN
Name: VIP, Length: 2240, dtype: float64
The function works just fine. The issue might lie within your initial condition:
Fist lets generate a bit sample data:
foo = pd.DataFrame({'time':['1979-11-10','1962-07-22','1987-09-16','2020-09-16']})
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
return 1
else:
return 0
First we make sure it's not data format issue as I suggested above:
foo['VIP'] =foo['time'].apply(time)
'str' object has no attribute 'year'
We fix this by converting the dates to datetime:
foo['time'] = pd.to_datetime(foo['time'])
Lets test the function:
foo['VIP'] =foo['time'].apply(time)
time VIP
0 1979-11-10 1
1 1962-07-22 1
2 1987-09-16 1
3 2020-09-16 0
All good.
Now lets apply some random condition:
foo['VIP'] =foo[foo['time'].dt.year >1980]['time'].apply(time)
time VIP
0 1979-11-10 NaN
1 1962-07-22 NaN
2 1987-09-16 1.0
3 2020-09-16 0.0
Reason is that you first filter your dataframe to smaller bit and then feed those rows to your function. Because they are never processed they don't get return values.
I suggest you do this with .loc function:
foo.loc[(( today.year - foo['time'].dt.year > 5 ) & (Other_condition_here), 'vip'] = 1
foo.loc[(( today.year - foo['time'].dt.year <= 5 ) & (Other_condition_here), 'vip'] = 0
For more about .loc see documentation
I guess when you use .apply it takes several arguments. Use map:
df['VIP'] = df[condition]['DaysSinceJoined'].map(time)
or:
df['VIP'] = df[condition].apply(lambda x: time(x['DaysSinceJoined']))
If it didn't work, show us some sample data.
I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function
I'm having some difficulties converting an integer-type column of a Pandas DataFrame representing dates (in YYYYMMDD format) to a DateTime type column and parsing the result in a specific format (e.g., 01JAN2021). Here's a sample DataFrame to get started:
import pandas as pd
df = pd.DataFrame(data={"CUS_DATE": [19550703, 19631212, 19720319, 19890205, 19900726]})
print(df)
CUS_DATE
0 19550703
1 19631212
2 19720319
3 19890205
4 19900726
Of all the things I have tried so far, the only one that worked was the following:
df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
print(df)
CUS_DATE
0 1955-07-03
1 1963-12-12
2 1972-03-19
3 1989-02-05
4 1990-07-26
But the above is not the result I'm looking for. My desired output should be the following:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
Any additional help would be appreciated.
Do this:
In [1347]: df["CUS_DATE"] = pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
In [1359]: df["CUS_DATE"] = df["CUS_DATE"].apply(lambda x: x.strftime('%d%b%Y').upper())
In [1360]: df
Out[1360]:
CUS_DATE
0 03JUL1955
1 12DEC1963
2 19MAR1972
3 05FEB1989
4 26JUL1990
You can use, in addition to pandas.to_datetime, the methods pandas.Series.dt.strftime and pandas.Series.str.upper:
df["CUS_DATE"] = (pd.to_datetime(df['CUS_DATE'], format='%Y%m%d')
.dt.strftime('%d%b%Y').str.upper())
# CUS_DATE
#0 03JUL1955
#1 12DEC1963
#2 19MAR1972
#3 05FEB1989
#4 26JUL1990
Also, check this documentation where you can find the datetime format codes.
Suppoose df.bun (df is a Pandas dataframe)is a multi-index(date and name) with variable being category values written in string,
date name values
20170331 A122630 stock-a
A123320 stock-a
A152500 stock-b
A167860 bond
A196030 stock-a
A196220 stock-a
A204420 stock-a
A204450 curncy-US
A204480 raw-material
A219900 stock-a
How can I make this to represent total counts in the same date and its percentage to make table like below with each of its date,
date variable counts Percentage
20170331 stock 7 70%
bond 1 10%
raw-material 1 10%
curncy 1 10%
I have done print(df.groupby('bun').count()) as a resort to this question but it lacks..
cf) Before getting df.bun I used the following code to import nested dictionary to Pandas dataframe.
import numpy as np
import pandas as pd
result = pd.DataFrame()
origDict = np.load("Hannah Lee.npy")
for item in range(len(origDict)):
newdict = {(k1, k2):v2 for k1,v1 in origDict[item].items() for k2,v2 in origDict[item][k1].items()}
df = pd.DataFrame([newdict[i] for i in sorted(newdict)],
index=pd.MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
print(df.bun)
I believe need SeriesGroupBy.value_counts:
g = df.groupby('date')['values']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
counts percentage
date values
20170331 stock-a 6 60.0
bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-b 1 10.0
Another solution with size for counts and then divide by new Series created by transform and sum:
df2 = df.reset_index().groupby(['date', 'values']).size().to_frame('count')
df2['percentage'] = df2['count'].div(df2.groupby('date')['count'].transform('sum')).mul(100)
print (df2)
count percentage
date values
20170331 bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-a 6 60.0
stock-b 1 10.0
Difference between solutions is first sort by values per groups and second sort MultiIndex.
I have a file with rows like this:
blablabla (CODE1513A15), 9.20, 9.70, 0
I want pandas to read each column, but from the first column I am interested only in the data between brackets, and I want to extract it into additional columns. Therefore, I tried using a Pandas converter:
import pandas as pd
from datetime import datetime
import string
code = 'CODE'
code_parser = lambda x: {
'date': datetime(int(x.split('(', 1)[1].split(')')[0][len(code):len(code)+2]), string.uppercase.index(x.split('(', 1)[1].split(')')[0][len(code)+4:len(code)+5])+1, int(x.split('(', 1)[1].split(')')[0][len(code)+2:len(code)+4])),
'value': float(x.split('(', 1)[1].split(')')[0].split('-')[0][len(code)+5:])
}
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
pd.read_csv('myfile.csv', usecols=[0,1,2,3], names=column_names, converters={'first_column': code_parser})
With this code, I can convert the text between brackets to a dict containing a datetime object and a value.
If the code is CODE1513A15 as in the sample, it will be built from:
a known code (in this example, 'CODE')
two digits for the year
two digits for the day of month
A letter from A to L, which is the month (A for January, B for February, ...)
A float value
I tested the lambda function and it correctly extracts the information I want, and its output is a dict {'date': datetime(15, 1, 13), 'value': 15}. Nevertheless, if I print the result of the pd.read_csv method, the 'first_column' is a dict, while I was expecting it to be replaced by two columns called 'date' and 'value':
first_column second_column third_column fourth_column
0 {u'date':13-01-2015, u'value':15} 9.20 9.70 0
1 {u'date':14-01-2015, u'value':16} 9.30 9.80 0
2 {u'date':15-01-2015, u'value':12} 9.40 9.90 0
What I want to get is:
date value second_column third_column fourth_column
0 13-01-2015 15 9.20 9.70 0
1 14-01-2015 16 9.30 9.80 0
2 15-01-2015 12 9.40 9.90 0
Note: I don't care how the date is formatted, this is only a representation of what I expect to get.
Any idea?
I think it's better to do things step by step.
# read data into a data frame
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
df = pd.read_csv(data, names=column_names)
# extract values using regular expression which is much more robust
# than string spliting
tmp = df.first_column.str.extract('CODE(\d{2})(\d{2})([A-L]{1})(\d+)')
tmp.columns = ['year', 'day', 'month', 'value']
tmp['month'] = tmp['month'].apply(lambda m: str(ord(m) - 64))
Sample output:
print tmp
year day month value
0 15 13 1 15
Then transform your original data frame into the format that you want
df['date'] = (tmp['year'] + tmp['day'] + tmp['month']).apply(lambda d: strptime(d, '%y%d%m'))
df['value'] = tmp['value']
del df['first_column']
Is conversion in the read_csv is mandatory? Otherwise, passing a function which returns Series to apply results in DataFrame.
df
first_column second_column third_column fourth_column
0 blablabla (CODE1513A15) 9.2 9.7 0
1 blablabla (CODE1514A16) 9.2 9.7 0
code_parser = lambda x: pd.Series({
'date': datetime(2000+int(x.split('(', 1)[1].split(')')[0][len(code):len(code)+2]), string.uppercase.index(x.split('(', 1)[1].split(')')[0][len(code)+4:len(code)+5])+1, int(x.split('(', 1)[1].split(')')[0][len(code)+2:len(code)+4])),
'value': float(x.split('(', 1)[1].split(')')[0].split('-')[0][len(code)+5:])
})
df['first_column'].apply(code_parser)
date value
0 2015-01-13 15
1 2015-01-14 16