I have this multiindex column df:
None INT INT INT PP PP PP
DATE 2021-12-01 2021-12-02 2021-12-03 2021-12-04 2021-12-05 2021-12-06
0 1.0 0.0 2.0 2.0 4.0 2.0
1 NaN NaN NaN NaN NaN NaN
2 0.0 0.0 2.0 0.0 3.0 4.0
3 0.0 2.0 2.0 2.0 3.0 2.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0
7 2.0 1.0 0.0 1.0 2.0 0.0
8 NaN NaN NaN NaN NaN NaN
9 0.0 0.0 0.0 0.0 0.0 0.0
I want to give a background color style to only values in 'PP' columns (and export to Excel) based on their values (white to values 0, lightgray to values 1, etc.). So I have in mind this:
###############################################################################
n=len(df.columns)
def colors_excel(s):
if s.PP == 0:
return ['background-color: white']*n
elif s.PP == 1:
return ['background-color: lightgray']*n
elif s.PP == 2:
return ['background-color: gray']*n
elif s.PP == 3:
return ['background-color: yellow']*n
elif s.PP == 4:
return ['background-color: orange']*n
elif s.PP == 5:
return ['background-color: red']*n
else:
return ['background-color: black']*n
###############################################################################
exceldata=df.style.apply(colors_excel, axis=0)
exceldata.to_excel('ROUTE/name_of_thefile.xlsx',
engine='openpyxl', index=True)
But this doesn't work in a multiindex column. And I don't want to drop the date of the multiindex columns. How can I solve this?
Here is another example of what I expect to get:
I will appreciate any help.
Thanks in advance.
Fixed Cell Format
Let's prepare the data:
import pandas as pd
from io import StringIO
data = '''None INT INT INT PP PP PP
DATE 2021-12-01 2021-12-02 2021-12-03 2021-12-04 2021-12-05 2021-12-06
0 1.0 0.0 2.0 2.0 4.0 2.0
1 NaN NaN NaN NaN NaN NaN
2 0.0 0.0 2.0 0.0 3.0 4.0
3 0.0 2.0 2.0 2.0 3.0 2.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 5.0 0.0
6 0.0 0.0 0.0 6.0 0.0 0.0
7 2.0 1.0 0.0 1.0 2.0 0.0
8 NaN NaN NaN NaN NaN NaN
9 0.0 0.0 0.0 0.0 0.0 0.0
'''
df = pd.read_csv(StringIO(data), sep='\s+', header=[0,1], index_col=0)
In order to replace values with the specified colors, I'd use a dictionary and applymap method:
colors = {
0: 'white',
1: 'lightgray',
2: 'gray',
3: 'yellow',
4: 'orange',
5: 'red'
}
default_color = 'black'
get_color = lambda x: colors.get(x, default_color)
color_map = lambda df: 'background-color: ' + df.applymap(get_color).values + ';'
In the last line I used .values to switch form DataFrame to numpy.ndarray to avoid any mismatch with index or column labels.
Next, in the styler I'd use apply with:
color_map as a function,
axis=None to pass a frame as an argument to color_map,
and subset='PP' to restrict the whole frame to those with PP in headers:
exceldata = df.style.apply(color_map, subset='PP', axis=None)
exceldata.to_excel('file.xlsx', engine='openpyxl', index=True)
Conditional Format
As far as data are formatted conditionaly, it seems natural to use conditional formattig in a Excel file. This can be handy if we are going to continue working with data in Excel. The way how to do this depends on the engine we use (openpyxl, xlsxwriter, etc.).
Let's stick to openpyxl:
file = 'test.xlsx'
writer = pd.ExcelWriter(file, engine='openpyxl')
df.to_excel(writer, sheet_name = 'Data')
Now, before closing writer we have to assing a conditional formatting. For this we need to find out the upper left and lower right corner of the range where df.PP is placed. Note that by default a line for index naming will be placed between headers and data. So the row where data start is (df.columns.nlevels + 1) + 1:
row_start = df.columns.nlevels + 2
row_end = row_start + len(df) - 1
As for the columns, we could use sort of df.columns.get_level_values(0) == 'PP' to find columns with PP in headers, or something like df.columns.get_loc('PP') which in this case will return a slice from 3rd to 6th column. Let's do it with get_loc:
from openpyxl.utils import get_column_letter
col_slice = df.columns.get_loc('PP')
col_start = get_column_letter(col_slice.start + 2)
col_end = get_column_letter(col_slice.stop + 1)
range_str = f'{col_start}{row_start}:{col_end}{row_end}'
Here:
range_str is an address of df.PP data at the worksheet, sort of 'E4:G13';
col_xlise.start + 2: we add +1 because Excel starts indexing of columns from 1, and +1 because the first column is occupied by indices;
col_slice.stop + 1: plus 2 for the same reason as previously, and minus 1 because the .stop value in a slice is an unreachable limit, i.e. the real last value in this slice is col_slise.stop - 1.
Now we can add conditional formats:
from openpyxl.styles import PatternFill, Font
from openpyxl.formatting.rule import CellIsRule
BLACK = '010101'
colors = {
'""': BLACK, # black for blank cells
'0': "FFFFFF", # white
'1': "CCCCCC", # lightgray
'2': "999999", # gray
'3': "FFFF00", # yellow
'4': "FFCC00", # orange
'5': "FF0000", # red
}
colors = {k: PatternFill(bgColor=v, fill_type='solid') for k, v in colors.items()}
for value, color in colors.items():
sheet.conditional_formatting.add(
range_str,
CellIsRule('equal', formula=[value], stopIfTrue=True, fill=color),
)
sheet.conditional_formatting.add(
range_str,
CellIsRule('notBetween', formula=['0','5'], stopIfTrue=True,
fill=PatternFill(bgColor=BLACK, fill_type='solid'),
font=Font(color='FFFFFF')),
)
Notes:
blank cells are equal to zero in Excel, so we have to check them first (before comparing with 0) by comparing to empty string;
the black color "0x000000" is treated by openpyxl as white (don't know why), so we have to define it as almost black;
at the end we add an additional rule for values out of the interval [0, 5]; to make this more specific like not in the list [0,1,2,3,4,5] we have to come up with some other rule.
Full Code
Part 1
colors = {
0: 'white',
1: 'lightgray',
2: 'gray',
3: 'yellow',
4: 'orange',
5: 'red'
}
default_color = 'black'
get_color = lambda x: colors.get(x, default_color)
color_map = lambda df: 'background-color: ' + df.applymap(get_color).values + ';'
exceldata = df.style.apply(color_map, subset='PP', axis=None)
exceldata.to_excel('file.xlsx', engine='openpyxl', index=True)
Part 2
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill, Font
from openpyxl.formatting.rule import CellIsRule
file = 'test.xlsx'
writer = pd.ExcelWriter(file, engine='openpyxl')
df.to_excel(writer, sheet_name = 'Data')
book = writer.book
sheet = book['Data']
row_start = df.columns.nlevels + 2
row_end = row_start + len(df) - 1
col_slice = df.columns.get_loc('PP')
col_start = get_column_letter(col_slice.start + 2)
col_end = get_column_letter(col_slice.stop + 1)
range_str = f'{col_start}{row_start}:{col_end}{row_end}'
BLACK = '010101'
colors = {
'""': BLACK, # black for blank cells
'0': "FFFFFF", # white
'1': "CCCCCC", # lightgray
'2': "999999", # gray
'3': "FFFF00", # yellow
'4': "FFCC00", # orange
'5': "FF0000", # red
}
colors = {k: PatternFill(bgColor=v, fill_type='solid') for k, v in colors.items()}
for value, color in colors.items():
sheet.conditional_formatting.add(
range_str,
CellIsRule('equal', formula=[value], stopIfTrue=True, fill=color),
)
sheet.conditional_formatting.add(
range_str,
CellIsRule('notBetween', formula=['0','5'], stopIfTrue=True,
fill=PatternFill(bgColor=BLACK, fill_type='solid'),
font=Font(color='FFFFFF')),
)
writer.close()
python: 3.10.7
pandas: 1.5.1
openpyxl: 3.0.10
You can use Styler.apply(func, axis=None, subset) to get a DataFrame with valid index and columns labels considering subset:
def highlight_cols(df):
color = pd.DataFrame().reindex_like(df.droplevel(0, axis=1))
colors = ['yellow', 'lightgray', 'gray']
color = color.fillna(f'background-color: {colors[-1]}')
for idx, col in enumerate(df.columns.get_level_values(1)):
if idx < len(colors) - 1:
color = color.mask(df.eq(idx).values, f'background-color: {colors[idx]}')
return color.values
idx = pd.IndexSlice
style = df.style.apply(highlight_cols, axis=None, subset=idx[:, idx['PP', :]])
style.to_excel('74075209.xlsx')
p = df[df['Name'].str.contains(NER, na = True)]
p['Result'] = p['Result'].astype('float')
p.groupby(["Name"])["Result"].plot(legend=True ,figsize=(15,10))
plt.legend(loc ='upper right')
plt.savefig('figure.png')
plt.close()
How can I turn the Timestamp column into the values of the X axis in the plot
I tried:
p.set_index("Timestamp", inplace=True)
But the x-axis starts from 00:00:00 and not from the time of the first index (09:34:54).
p after the line: p['Result'] = p['Result'].astype('float')
Timestamp Name Result
7 09:34:54 TRX0_NER_M0 1.0
8 09:34:54 TRX0_NER_M1 1.0
9 09:34:54 TRX1_NER_M0 1.0
10 09:34:54 TRX1_NER_M1 1.0
11 09:34:54 TRX2_NER_M0 1.0
... ... ... ...
401465 09:47:00 TRX1_NER_M1 1.0
401466 09:47:00 TRX2_NER_M0 1.0
401467 09:47:00 TRX2_NER_M1 1.0
401468 09:47:00 TRX3_NER_M0 1.0
401469 09:47:01 TRX3_NER_M1 1.0
[38341 rows x 3 columns]
i can see you are using time in format hh:mm:ss you can use this code to get the number of seconds which can be used for x axis. I created a function which u can use.
def get_seconds(timestamp):
time=list(map(int,timestamp.split(":")))
h=time[0]
m=time[1]
s=time[2]
total=h*3600+m*60+s
return total
print(get_seconds("09:34:54"))
I would need to visualize labels in a network where I extract kcore information.
The dataset is
Source Target Edge_Weight Label_Source Label_Target
0 A F 29.1 0.0 0.0
1 A G 46.9 0.0 1.0
2 A B 24.4 0.0 1.0
3 C F 43.4 0.0 0.0
4 C N 23.3 0.0 1.0
5 D S 18.0 1.0 0.0
6 D G 67.6 1.0 0.0
7 D B 37.2 1.0 1.0
8 D E 46.9 1.0 2.0
For extracting kcore information I used the code
G = nx.from_pandas_edgelist(df, 'Source', 'Target')
kcore=nx.k_core(G)
plt.subplot(122)
nx.draw(kcore)
plt.show()
Do you know I can add the label information?
My expected value would be a graph which has colors based on their labels (it does not matter which color to assign to distinct labels values. The values are 0, 1, 2).
Many thanks
A way to do what you want is to create a colormap and associate it to your node label. You can then use the node_colors argument from the nx.draw function to set up the color of the nodes. Additionally, you can use plt.scatter to create empty plots to set up a legend for your labels in your graph.
See code below:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import cm
df=pd.read_fwf('graph.txt') #Stored your dataset in a file called 'graph.txt'
G = nx.from_pandas_edgelist(df, 'Source', 'Target')
kcore=nx.k_core(G)
N_colors=3
cm_dis=np.linspace(0, 1,N_colors)
colors = [cm.viridis(x) for x in cm_dis]
color_nodes=[]
for node in kcore:
#Finding out label of the node
temp_src=df.index[df['Source'] == node].tolist()
temp_targ=df.index[df['Target']==node].tolist()
if len(temp_targ)!=0:
label=df['Label_Target'][temp_targ[0]]
color=colors[int(label)]
elif len(temp_src)!=0:
label=df['Label_Source'][temp_src[0]]
color=colors[int(label)]
#Setting up legend
if color not in color_nodes:
plt.scatter([],[],color=color,label=str(label))
color_nodes.append(color)
#Draw graph
nx.draw(kcore,with_labels=True,node_color=color_nodes)
plt.legend()
plt.show()
And the output gives:
I'm working on creating a linear trendline from data that contains dates and another measure (volume). The goal is to create a linear trendline that shows how volume trends over time.
The data looks as follows:
date typeID lowPrice highPrice avgPrice volume orders \
0 2003-11-30 22.0 9000.00 9000.00 9000.00 5.0 1.0
1 2003-12-31 22.0 9000.00 9000.00 9000.00 2.0 1.0
2 2004-01-31 22.0 15750.00 15750.00 15750.00 9.5 1.0
3 2004-02-29 22.0 7000.00 7000.00 7000.00 11.0 1.0
4 2004-03-31 22.0 7000.00 7000.00 7000.00 8.0 1.0
6 2004-05-31 22.0 15000.00 15000.00 15000.00 16.0 1.0
10 2004-09-30 22.0 6500.00 6500.00 6500.00 27.0 1.0
The issue is that for some months (the interval in which the dates are stored) there is no volume data available as can be seen above, thus the following is the approach I currently take at creating a trendline from the available dates.
x = df2["date"]
df2["inc_dates"] = np.arange(len(x))
y = df2["ln_vold"]
plt.subplot(15, 4, count)
plt.plot_date(x, y, xdate = True)
model = smf.ols('ln_vold ~ inc_dates', missing = "drop", data = df2).fit()
intercept, coef = model.params
l = [intercept]
for i in range(len(x) -1):
l.append(intercept + coef*i)
plt.plot_date(x, l, "r--", xdate = True)
However the output for this currently shows:
Which clearly isn't the right trendline (seen by the beginning being non-linear).
Now I don't see how this could go wrong, as all I do in the for-loop is add constant values to an increasing integer. All I'd like to see is a linear trendline going straight from the intercept to the end.
I am trying to loop through a list to create a series of boxplots using Matplotlib. Each item in the list should print a plot that has 2 boxplots, 1 using df1 data and 1 using df2 data.
I am successfully plotting x1, but x2 is blank and I don't know why.
I am using jupyter notebook with Python 3. Any help is appreciated!
df1 = df[df.order == 1]
df2 = df[df.order == 0]
lst = ['device', 'ship', 'bill']
i = 0
for item in lst:
plt.figure(i)
x1= df1[item].values
x2 = df2[item].values
plt.boxplot([x1, x2])
plt.title(item)
i = i+1
The series that I'm trying to plot have the following format with several thousand observations each:
df[order] == 1
df['device'] df['ship'] df['bill']
0.0 0.0 0.0
19.0 5.0 0.0
237.0 237.0 237.0
df[order] == 0
df['device'] df['ship'] df['bill']
1.0 21.0 0.0
75.0 31.0 100.0
5.0 18.0 71.0
The dataframe contains data for orders. The columns listed in lst is made up of dtype float64
Solved it...there were a couple of NaN values appear to have prevented me from plotting.