Update dataframe value with other dataframe value if condition met? - python

I've two dfs. I wanted to assign df1.date = df2.start_date if df1.date <= df2.end_date.
df1 = {"date": ['2020-12-23 18:20:37', '2021-08-20 12:17:41.487'], "result": [ 'pass', 'fail']}
df2 = {"start_date": ['2021-08-19 12:17:41.487','2021-08-12 12:17:41.487', '2021-08-26 12:17:41.487'],
"end_date": ['2021-08-26 12:17:41.487', '2021-08-19 12:17:41.487', '2021-09-02 12:17:41.487']}
I just give two rows while in real I'm doing this query on 100,000 rows. How do I achieve this?

assuming im understanding your question correctly and that both your dataframes line up with each other. you could loop through each row and do a compare across to the other df. however if you have thousands of records this could take some time.
df1 = pd.DataFrame({"date": [datetime.date(2014, 12, 29), datetime.date(2015, 1, 26), datetime.date(2015, 2, 26), datetime.date(2015, 3, 8), datetime.date(2015, 4, 10)],
"result": ['pass', 'fail', 'fail', 'pass', 'pass']})
df2 = pd.DataFrame({'start_date': [datetime.date(2015, 1, 1), datetime.date(2015, 2, 1), datetime.date(2015, 3, 1), datetime.date(2015, 4, 1), datetime.date(2015, 5, 1)],
'end_date': [datetime.date(2015, 1, 25), datetime.date(2015, 2, 20), datetime.date(2015, 3, 15), datetime.date(2015, 4, 24), datetime.date(2015, 5, 23)]})
for i in range(len(df1)):
if (df1.date[i] <= df2.end_date[i]):
df1.date[i] = df2.start_date[i]
but again this is assuming that both data frames have the same length and its a direct compare across

We can make use of numpy's where.
# df1.date = df2.start_date if df1.date <= df2.end_date
import numpy as np
df1.date = np.where(df1.date <= df2.end_date, df2.start_date, df1.date)
New df1
date result
0 2015-01-01 pass
1 2015-02-01 pass
2 2015-03-01 fail
3 2015-04-01 fail
4 2015-05-01 pass
Data used
df1 = {"date": [ '2014-12-29', '2015-01-26', '2015-02-26', '2015-03-08', '2015-04-10' ],
"result": [ 'pass', 'pass', 'fail', 'fail', 'pass' ]}
df2 = {"start_date": [ '2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01' ],
"end_date": [ '2015-01-25', '2015-02-20', '2015-03-15', '2015-04-24', '2015-05-23' ]}
df1 = pd.DataFrame(data = df1)
df2 = pd.DataFrame(data = df2)

Related

Identifying rows with same positive and negative values with particular order in pandas dataframe

I am looking for an efficient way to flag the order as returned when there is both positive and negative entry present in data given negative entry is only present at the same day or later date from positive value.
import pandas as pd
import datetime
data = [['US', '100', 'Ven1', datetime.datetime(2020, 5, 17), -100], ['US', '100', 'Ven1', datetime.datetime(2020, 5, 19), 100], ['US', '100', 'Ven1', datetime.datetime(2020, 5, 25), -100], ['CA', 'AR-100', '1276238', datetime.datetime(2020, 3, 25), 10], ['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 12), 2500], ['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 12), -2500], ['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 14), 2500]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['region', 'productid', 'vendor', 'date', 'qty'])
Input:
Expected Output:
Here -100 from 2020-05-17 is not flagged because corresponding positive value is not present in dataset before this date's negative entry.
My current solution is sorting, grouping the data and then checking value row by row inside loops. This works but I think this is not the best way to solve this problem.
match = []
df = df.sort_values(['date', 'qty'], ascending=[True, False])
df = df.groupby(['region', 'productid', 'vendor'])
for name, group in df:
...
for i, outer_row in group.iterrows():
for j, inner_row in group.iterrows():
if (j > i) & (outer_row.qty + inner_row.qty = 0) & (i not in match) & (j not in match):
match.append(i)
match.append(j)
...
df['returned'].at[i] = 'Y'
df['returned'].at[j] = 'Y'
I have found some solutions but they do not consider any order when finding the rows with positive and negative values. Any suggestion would be really appreciated.
You can create groups with the same absolute qty and then apply your logic:
import pandas as pd
import datetime
data = [['US', '100', 'Ven1', datetime.datetime(2020, 5, 17), -100],
['US', '100', 'Ven1', datetime.datetime(2020, 5, 19), 100],
['US', '100', 'Ven1', datetime.datetime(2020, 5, 25), -100],
['CA', 'AR-100', '1276238', datetime.datetime(2020, 3, 25), 10],
['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 12), 2500],
['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 12), -2500],
['UK', 'UKV', 'Daily', datetime.datetime(2022, 1, 14), 2500]]
df = pd.DataFrame(data, columns=['region', 'productid', 'vendor', 'date', 'qty'])
df['abs_qty'] = df['qty'].abs()
df['returned'] = False
def return_logic(d):
if not len(d) > 1 or not (d['qty'] > 0).any():
return d['returned']
g = d.sort_values(['date', 'qty'], ascending=[True, False])
g = g.loc[g[g['qty'] > 0].index[0]:] # cut rows with no positive value before them
ind = g[g['qty'] < 0].index.values
g.loc[ind, 'returned'] = True
g.loc[ind - 1, 'returned'] = True
d.loc[g.index, 'returned'] = g['returned']
return d['returned']
df['returned'] = df.groupby(['region', 'productid', 'vendor', 'abs_qty'], group_keys=False).apply(return_logic)
In the logic function:
Check to make sure there is more than 1 row and that there's at least 1 positive value.
Find the first row with a positive value
Flag every negative qty line and the line before.
This won't handle situations where there are consecutive negative qty rows

Python Pandas counting the occurrences of an event in each year

I've got a dataframe describing events in a company and it looks like this:
employee_id event event_start_date event_end_date hire_date
1 "data change" 1.01.2018 1.01.2018 1.09.2005
2 "data change" 4.04.2018 4.04.2018 1.06.2007
2 "termination" 2.10.2020 NaT 1.06.2007
3 "hire" 23.05.2019 23.05.2019 23.05.2019
3 "leave" 23.07.2019 30.07.2019 23.05.2019
3 "termination" 3.11.2020 NaT 23.05.2019
Table is indexed by employee_id and event, and sorted by event_start_date.
So one employee has one or more events listed in the table. "Hired" event is not always in the "event" column, so I assume that information about hiring date is only available in "hire_date" column. I would like to:
count the number of hiring events in each year
count the number of termination events in each year
Count the number of active employees in each year
Build the example df:
import pandas as pd
import datetime
import numpy as np
# example df
emp = [1, 2, 2, 3, 3, 3]
event = ["data change", "data change", "termination", "hire", "leave", "termination"]
s_date = [datetime.datetime(2018, 1, 1), datetime.datetime(2018, 4, 4), datetime.datetime(2020, 10, 2),
datetime.datetime(2019, 5, 23), datetime.datetime(2019, 7, 23), datetime.datetime(2020, 11, 3)]
e_date = [datetime.datetime(2018, 1, 1), datetime.datetime(2018, 4, 4), np.datetime64('NaT'),
datetime.datetime(2019, 5, 23), datetime.datetime(2019, 7, 30), np.datetime64('NaT')]
h_date = [datetime.datetime(2005, 9, 1), datetime.datetime(2007, 6, 1), datetime.datetime(2017, 6, 1),
datetime.datetime(2019, 5, 23), datetime.datetime(2019, 5, 23), datetime.datetime(2019, 5, 23)]
df = pd.DataFrame(emp, columns=['employee_id'])
df['event'] = event
df['event_start_date'] = s_date
df['event_end_date'] = e_date
df['hire_date'] = h_date
1st question
def calculate_hire_for_year():
df['hire_year'] = pd.DatetimeIndex(df['hire_date']).year
dict_years = {}
ids = set(list(df['employee_id']))
for id in ids:
result = df[df['employee_id'] == id]
year = list(result['hire_year'])[0]
dict_years[year] = dict_years.get("b", 0) + 1
return dict_years
print("Number of hiring events in each year:")
print(calculate_hire_for_year())
2nd question
def calculate_termination_per_year():
df['year'] = pd.DatetimeIndex(df['event_start_date']).year
result = df[df['event'] == "termination"]
count_series = result.groupby(["event", "year"]).size()
return count_series
print("Number of termination events in each year:")
print(calculate_termination_per_year())
3rd question
def calculate_employee_per_year():
dict_years = {}
df['year'] = pd.DatetimeIndex(df['event_start_date']).year
years = set(list(df['year']))
for year in years:
result = df[df['year'] == year]
count_emp = len(set(list(result['employee_id'])))
dict_years[year] = count_emp
return dict_years
print("Number of active employees in each year:")
print(calculate_employee_per_year())

Add tuple values to dict if one value is a key in the dict

So I have been stuck on this for a while now.
My data looks as follows:
Initialer Start uge Start dag Start tid End uge End dag End tid
0 MBAU 18 3 09:00:00 18 5 12:00:00
1 MBAU 22 2 14:00:00 22 2 15:00:00
2 MBAU 13 4 09:00:00 13 4 10:00:00
3 AMPE 14 1 12:00:00 14 1 13:30:00
4 AMPE 26 6 09:00:00 27 2 22:00:00
I am trying to generate a dictionary with the 'Initialer' as keys and the values should consist of two tuples or lists, one containing the"Start"-columns and one containing the "End"-columns.
Like this { 'Initialer': [(Start uge, Start dag, Start tid), (End uge, End dag, End tid)] }
:
{'MBAU': [[(18, 3, 09:00:00), (18, 5, 12:00:00)],
[(22, 2, 14:00:00), (22, 2, 15:00:00)],
[(13, 4, 09:00:00), (13, 4, 10:00:00)]],
'AMPE': [[(14, 1, 12:00:00), (14, 1, 13;30:00)],
[(26, 6, 09:00:00), (27, 2, 22:00:00)]] }
However, I am strugling getting it right. I have tried generating two lists of tuples containing the start columns and end columns respectively:
start_tuple = self.u_data[['Initialer','Start uge', 'Start dag', 'Start tid']].apply(tuple, axis=1).values
>>>
[('MBAU', 18, 3, datetime.time(9, 0))
('MBAU', 22, 2, datetime.time(14, 0))
('MBAU', 13, 4, datetime.time(9, 0))
('AMPE', 14, 1, datetime.time(12, 0))
('AMPE', 26, 6, datetime.time(9, 0))]
end_tuple = self.u_data[['Initialer','End uge', 'End dag', 'End tid']].apply(tuple, axis=1).values
>>>
[('MBAU', 18, 5, datetime.time(12, 0))
('MBAU', 22, 2, datetime.time(15, 0))
('MBAU', 13, 4, datetime.time(10, 0))
('AMPE', 14, 1, datetime.time(13, 30))
('AMPE', 27, 2, datetime.time(22, 0))]
I then created a dict based on unique values in 'Initialer' and tried to use list comprehension to populate it as such:
start_dict = {k:[] for k in self.u_data.Initialer.unique()}
(start_dict[initialer].append((x,y,z)) for initialer, x, y, z in start_tuple)
>>>
{'MBAU': [], 'AMPE': []}
But this returns only empty values {'MBAU': [], 'AMPE': []}. I have tried to research how I could do this but without any luck.
Is there a smart way to accomplish this?
Why it fails?
The reason you are getting {'MBAU': [], 'AMPE': []} is because list.append() is an inplace operation and doesn't return anything and, the (i for i in l) creates a generator object instead of actually running the append operation.
You can see what happens here instead -
start_tuple = df[['Initialer','Start_uge', 'Start_dag', 'Start_tid']].apply(tuple, axis=1)
start_dict = {k:[] for k in df.Initialer.unique()}
#list comprehension runs the append operation but returns None
[start_dict[initialer].append((x,y,z)) for initialer, x, y, z in start_tuple]
### Returns:
### [None, None, None, None, None]
### But if you print start_dict
print(start_dict)
{'MBAU': [(18, 3, '09:00:00'), (22, 2, '14:00:00'), (13, 4, '09:00:00')], 'AMPE': [(14, 1, '12:00:00'), (26, 6, '09:00:00')]}
This means that the operation would run and return None, but the original start_dict object will get modified since now you are actually iterating and not creating a generator.
A modified approach with defaultdict
More inline to the approach that you have used already, but with using collections.defaultdict. -
from collections import defaultdict
init = df['Initialer'].tolist()
start_tuple = df[['Start_uge', 'Start_dag', 'Start_tid']].apply(tuple, axis=1)
end_tuple = df[['End_uge', 'End_dag', 'End_tid']].apply(tuple, axis=1)
items = zip(init, start_tuple, end_tuple)
d = defaultdict(list)
for i,j,k in items:
d[i].append([j,k])
output = dict(d)
output
{'MBAU': [[(18, 3, '09:00:00'), (18, 5, '12:00:00')],
[(22, 2, '14:00:00'), (22, 2, '15:00:00')],
[(13, 4, '09:00:00'), (13, 4, '10:00:00')]],
'AMPE': [[(14, 1, '12:00:00'), (14, 1, '13:30:00')],
[(26, 6, '09:00:00'), (27, 2, '22:00:00')]]}
Another variation
You can solve in a bit shorter way using collections.defaultdict as -
from collections import defaultdict
d = defaultdict(list)
for _,row in df.iterrows():
vals = row.tolist()
d[vals[0]].append([tuple(vals[1:4]),tuple(vals[4:])])
output = dict(d)
output
{'MBAU': [[(18, 3, '09:00:00'), (18, 5, '12:00:00')],
[(22, 2, '14:00:00'), (22, 2, '15:00:00')],
[(13, 4, '09:00:00'), (13, 4, '10:00:00')]],
'AMPE': [[(14, 1, '12:00:00'), (14, 1, '13:30:00')],
[(26, 6, '09:00:00'), (27, 2, '22:00:00')]]}

Grouping data by id, var1 into consecutive dates in python using pandas

I have some data that looks like:
df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
"val": [9, 2, 4, 7, 6, 3, 2],
"dates": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.
For example the final output should be:
df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
"val": [13, 2, 13, 3, 2],
"start_date": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)],
"end_date": [pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.
For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'
s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)
I have also tried:
df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()
Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.
EDIT: adding more context
Another approach is:
df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)
which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window
Extended example that has a break in the date range for group (102,'a').
df_raw_dates = pd.DataFrame(
{
"id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
"var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
"val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
"dates": [
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 7),
pd.Timestamp(2020, 1, 8),
pd.Timestamp(2020, 1, 9),
pd.Timestamp(2020, 1, 21),
pd.Timestamp(2020, 1, 25),
],
}
)
Further example
This is using the anwser below from wwii
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?
Not sure what the difference with this example and the extended example above is and why this seems to not work?
I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.
Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.
import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])
for k,g in gb:
print(g)
dt = g['dates']
# find difference in days between rows
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
# create a Series to identify consecutive ranges to group by
# this cumsum trick can be found in many SO answers
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
# split into date ranges
date_groups = g.groupby(groups)
for _,date_range in date_groups:
print(date_range)
print(sep)
You can see that the (102,'a') group has been split into two groups.
id var1 val dates
0 102 a 9 2020-01-01
2 102 a 4 2020-01-02
7 102 a 1 2020-01-03
id var1 val dates
8 102 a 2 2020-01-07
9 102 a 3 2020-01-08
10 102 a 4 2020-01-09
Going a bit further: while iterating construct a dictionary to make a new DataFrame with.
import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed,var = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(g.val.mean())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(),date_range['dates'].max()
val = date_range.val.mean()
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>>
id var1 val start end
0 102 a 4.666667 2020-01-01 2020-01-03
1 102 a 3.000000 2020-01-07 2020-01-09
2 102 b 2.000000 2020-01-01 2020-01-01
3 103 a 3.000000 2020-01-05 2020-01-05
4 103 b 6.500000 2020-01-02 2020-01-03
5 104 c 2.000000 2020-03-12 2020-03-12
6 108 a 99.000000 2020-01-21 2020-01-25
Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.
It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.
id var1 val dates
11 108 a 99 2020-01-21
12 108 a 99 2020-01-25
You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

How do I save a pandas df to excel sheet then format it?

import pandas as pd
from datetime import datetime, date
df = pd.DataFrame({'Date and time': [datetime(2015, 1, 1, 11, 30, 55),
datetime(2015, 1, 2, 1, 20, 33),
datetime(2015, 1, 3, 11, 10 ),
datetime(2015, 1, 4, 16, 45, 35),
datetime(2015, 1, 5, 12, 10, 15)],
'Dates only': [date(2015, 2, 1),
date(2015, 2, 2),
date(2015, 2, 3),
date(2015, 2, 4),
date(2015, 2, 5)],
})
writer = pd.ExcelWriter("pandas_datetime_format.xlsx",
engine='xlsxwriter',
datetime_format='mmm d yyyy hh:mm:ss',
date_format='mmmm dd yyyy')
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_bc = workbook.add_format({
'font_name': 'Arial',
'font_size' : 14,
'font_color': 'white',
'bold': 0,
'border': 1,
'align': 'left',
'valign': 'vcenter',
'text_wrap': 1,
'fg_color': '#005581'})
worksheet.set_column('B:C', 20, format_bc)
writer.save()
The above code was expected to generated a formatted excel sheet, with the columns B and B having blue background and other aspects specified in format_bc.
Instead I received a file shown in the image below.
Not formatting the desired cells
Is there a way to write a dataframe to an excel sheet with formatting?
Unfortunately as seen here https://github.com/jmcnamara/XlsxWriter/issues/336 , this is not possible to format date and datetime values with XlsxWriter.
What you could do instead is add df = df.astype(str) to change your dataframe formats from date / datetime to string.
df = pd.DataFrame(
{
'Date and time': [
datetime(2015, 1, 1, 11, 30, 55),
datetime(2015, 1, 2, 1, 20, 33),
[ ... ]
],
'Dates only': [
date(2015, 2, 1),
date(2015, 2, 2),
[ ... ]
]
}
)
df = df.astype(str)
writer = pd.ExcelWriter("pandas_datetime_format.xlsx",
engine='xlsxwriter',
[ ... ])
[ ... ]
Outputs :
Note that if you want the headers to be overwritten with the new format, add to the beginning of your code:
import pandas.io.formats.excel
pandas.io.formats.excel.header_style = None

Categories