I want to plot variable by date, days and month. Grid is uneven when month is changing. How to force size of grid in this case?
Data is loaded via Pandas, as DataFrame.
ga =
Reference Organic_search Direct Date
0 0 0 0 2021-11-22
1 0 0 0 2021-11-23
2 0 0 0 2021-11-24
3 0 0 0 2021-11-25
4 0 0 0 2021-11-26
5 0 0 0 2021-11-27
6 0 0 0 2021-11-28
7 42 19 35 2021-11-29
8 69 33 48 2021-11-30
9 107 32 35 2021-12-01
10 62 30 26 2021-12-02
11 20 26 30 2021-12-03
12 22 22 20 2021-12-04
13 40 41 20 2021-12-05
14 14 39 26 2021-12-06
15 18 25 34 2021-12-07
16 8 21 13 2021-12-08
17 11 21 17 2021-12-09
18 23 27 20 2021-12-10
19 46 26 17 2021-12-11
20 29 42 20 2021-12-12
21 122 37 19 2021-12-13
22 97 25 29 2021-12-14
23 288 51 39 2021-12-15
24 96 29 26 2021-12-16
25 51 25 36 2021-12-17
26 23 16 21 2021-12-18
27 47 32 10 2021-12-19
code:
fig, ax = plt.subplots(figsize = (15,5))
ax.plot(ga.date, ga.reference)
ax.set(xlabel = 'Data',
ylabel = 'Ruch na stronie')
date_form = DateFormatter('%d/%m')
ax.xaxis.set_major_formatter(date_form)
graph
Looking at the added data, I realized why the interval was not constant.
This is because the number of days corresponding to each month is different.
So I just made the date data into one string data. And the grid spacing was forced to be the same.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_excel('test.xlsx', index_col=0)
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(df['Date'].dt.strftime('%d/%y'), df.Refference)
ax.set(xlabel='Data',
ylabel='Ruch na stronie')
ax.grid(True)
# set xaxis interval
interval = 3
ax.xaxis.set_major_locator(ticker.MultipleLocator(interval))
Related
I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.
I am using python pandas to calculate efficiency of the employees. I have a data frame describing employees of some company. Each employee have unique employee id. The data frame shows monthly record of the number of hours for all employees. So there might be some days missing from DF for each employee. So those dates range have to filled as zero rows with column dates as missing date and Id as employee id. Example -
Employee WH Date C3 C4 C5
11 6 2021-06-03 - - -
11 7 2021-06-06
11 8 2021-06-08
13 5 2021-06-01
13 7 2021-06-02
13 7 2021-06-28
The missing date for employee id 11 is 01,02,04,05,07,09---30.
The missing date for employee id 13 is 03,--27, 29,30. Like so there can be multiple employees with missing date range. The DF needs to be filled with all those missing values having Id and date and the rest of the columns as 0. and to be reindexed.
This can be accomplished by reindexing
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
You'll need to first construct the new index you want (all employees, daily frequency), then set the identifying columns in the original dataframe as the index, and finally reindex and specify the fill value as 0.
id_cols = ['Date','Employee']
new_index = pd.MultiIndex.from_product(
[pd.date_range(start='2021-06-01', end='2021-06-30', freq='D'),
list_of_unique_employee_IDs],
names = id_cols
)
df2 = df.set_index(id_cols).reindex(new_index, fill_value = 0)
If you don't already have a list of all of the unique employee IDs, you can instead just get it from your original df with df.Employee.unique()
If you want to go back to the default integer index rather than keeping the MultiIndex of Employee and Date, you can add a .reset_index() to the end of the last line and it will insert Employee and Date back as columns in the dataframe
Instead using static date_range, you can generate it automatically:
def missing_dates_from_series(sr):
return pd.date_range(df['Date'].min().strftime('%Y-%m'),
(df['Date'].max() + pd.DateOffset(months=1)).strftime('%Y-%m'),
closed='left', freq='D')
mi = pd.MultiIndex.from_frame(df.groupby('Employee')['Date']
.apply(missing_dates_from_series)
.explode().reset_index())
out = df.set_index(['Employee', 'Date']).reindex(mi, fill_value=0).reset_index()
>>> out
Employee Date WH
0 11 2021-06-01 0
1 11 2021-06-02 0
2 11 2021-06-03 6
3 11 2021-06-04 0
4 11 2021-06-05 0
5 11 2021-06-06 7
6 11 2021-06-07 0
7 11 2021-06-08 8
8 11 2021-06-09 0
9 11 2021-06-10 0
10 11 2021-06-11 0
11 11 2021-06-12 0
12 11 2021-06-13 0
13 11 2021-06-14 0
14 11 2021-06-15 0
15 11 2021-06-16 0
16 11 2021-06-17 0
17 11 2021-06-18 0
18 11 2021-06-19 0
19 11 2021-06-20 0
20 11 2021-06-21 0
21 11 2021-06-22 0
22 11 2021-06-23 0
23 11 2021-06-24 0
24 11 2021-06-25 0
25 11 2021-06-26 0
26 11 2021-06-27 0
27 11 2021-06-28 0
28 11 2021-06-29 0
29 11 2021-06-30 0
30 13 2021-06-01 5
31 13 2021-06-02 7
32 13 2021-06-03 0
33 13 2021-06-04 0
34 13 2021-06-05 0
35 13 2021-06-06 0
36 13 2021-06-07 0
37 13 2021-06-08 0
38 13 2021-06-09 0
39 13 2021-06-10 0
40 13 2021-06-11 0
41 13 2021-06-12 0
42 13 2021-06-13 0
43 13 2021-06-14 0
44 13 2021-06-15 0
45 13 2021-06-16 0
46 13 2021-06-17 0
47 13 2021-06-18 0
48 13 2021-06-19 0
49 13 2021-06-20 0
50 13 2021-06-21 0
51 13 2021-06-22 0
52 13 2021-06-23 0
53 13 2021-06-24 0
54 13 2021-06-25 0
55 13 2021-06-26 0
56 13 2021-06-27 0
57 13 2021-06-28 7
58 13 2021-06-29 0
59 13 2021-06-30 0
Try the pd.DataFrame.reindex method. The inspiration for this solution is taken from this excellent post. Since it is a dataframe instead of series, a few extra steps will be needed to get to the expected output, as shown below.
idx = pd.date_range('2021-06-01', '2021-06-30') #Set your date range
df.set_index('Date', inplace=True)
df.index = pd.DatetimeIndex(df.index)
output = df.groupby('Employee').apply(pd.DataFrame.reindex, idx, fill_value=0)\
.drop('Employee',1)\
.reset_index()\
.rename(columns={'level_1':'Date'})
print(output)
Employee Date WH
0 11 2021-06-01 0
1 11 2021-06-02 0
2 11 2021-06-03 5
3 11 2021-06-04 7
4 11 2021-06-05 0
5 11 2021-06-06 0
6 11 2021-06-07 0
7 11 2021-06-08 0
8 11 2021-06-09 0
9 11 2021-06-10 0
10 11 2021-06-11 0
11 11 2021-06-12 0
12 11 2021-06-13 0
13 11 2021-06-14 0
14 11 2021-06-15 0
15 11 2021-06-16 0
16 11 2021-06-17 0
17 11 2021-06-18 0
18 11 2021-06-19 0
19 11 2021-06-20 0
20 11 2021-06-21 0
21 11 2021-06-22 0
22 11 2021-06-23 0
23 11 2021-06-24 0
24 11 2021-06-25 0
25 11 2021-06-26 0
26 11 2021-06-27 0
27 11 2021-06-28 0
28 11 2021-06-29 0
29 11 2021-06-30 0
30 13 2021-06-01 0
31 13 2021-06-02 0
32 13 2021-06-03 0
33 13 2021-06-04 8
34 13 2021-06-05 5
35 13 2021-06-06 0
36 13 2021-06-07 0
37 13 2021-06-08 0
38 13 2021-06-09 0
39 13 2021-06-10 0
40 13 2021-06-11 0
41 13 2021-06-12 0
42 13 2021-06-13 0
43 13 2021-06-14 0
44 13 2021-06-15 0
45 13 2021-06-16 0
46 13 2021-06-17 0
47 13 2021-06-18 0
48 13 2021-06-19 0
49 13 2021-06-20 0
50 13 2021-06-21 0
51 13 2021-06-22 0
52 13 2021-06-23 0
53 13 2021-06-24 0
54 13 2021-06-25 0
55 13 2021-06-26 0
56 13 2021-06-27 0
57 13 2021-06-28 0
58 13 2021-06-29 0
59 13 2021-06-30 0
I have a simple dataframe df that contains three columns:
Time: expressed in seconds
A: set of values that can vary between -inf to +inf
B: set of angles (degrees) which range between 0 and 359
Here is the dataframe
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
And it looks like this:
Time A B
0 0 5 300
1 12 7 358
2 23 9 4
3 25 8 10
4 44 11 2
5 50 6 350
My idea is to interpolate the data from 0 to 50 seconds and I was able to achieve my goal using the following lines of code:
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
Problem: even though column A is interpolated correctly, column B is wrong because the interpolation of an angle between 360 degrees is not performed! Here is an example:
Time A B
12 12 7.000000 358.000000
13 13 7.181818 325.818182
14 14 7.363636 293.636364
15 15 7.545455 261.454545
16 16 7.727273 229.272727
17 17 7.909091 197.090909
18 18 8.090909 164.909091
19 19 8.272727 132.727273
20 20 8.454545 100.545455
21 21 8.636364 68.363636
22 22 8.818182 36.181818
23 23 9.000000 4.000000
Question: can you suggest me a smart and efficient way to solve this issue and being able to interpolate correctly the angles between 0/360 degrees?
You should be able to use the method described in this question for the angle column:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
df['B'] = np.rad2deg(np.unwrap(np.deg2rad(df['B'])))
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
df['B'] %= 360
print(df)
Output:
Time A B
0 0 5.000000 300.000000
1 1 5.166667 304.833333
2 2 5.333333 309.666667
3 3 5.500000 314.500000
4 4 5.666667 319.333333
5 5 5.833333 324.166667
6 6 6.000000 329.000000
7 7 6.166667 333.833333
8 8 6.333333 338.666667
9 9 6.500000 343.500000
10 10 6.666667 348.333333
11 11 6.833333 353.166667
12 12 7.000000 358.000000
13 13 7.181818 358.545455
14 14 7.363636 359.090909
15 15 7.545455 359.636364
16 16 7.727273 0.181818
17 17 7.909091 0.727273
18 18 8.090909 1.272727
19 19 8.272727 1.818182
20 20 8.454545 2.363636
21 21 8.636364 2.909091
22 22 8.818182 3.454545
23 23 9.000000 4.000000
24 24 8.500000 7.000000
25 25 8.000000 10.000000
26 26 8.157895 9.578947
27 27 8.315789 9.157895
28 28 8.473684 8.736842
29 29 8.631579 8.315789
30 30 8.789474 7.894737
31 31 8.947368 7.473684
32 32 9.105263 7.052632
33 33 9.263158 6.631579
34 34 9.421053 6.210526
35 35 9.578947 5.789474
36 36 9.736842 5.368421
37 37 9.894737 4.947368
38 38 10.052632 4.526316
39 39 10.210526 4.105263
40 40 10.368421 3.684211
41 41 10.526316 3.263158
42 42 10.684211 2.842105
43 43 10.842105 2.421053
44 44 11.000000 2.000000
45 45 11.000000 2.000000
46 46 11.000000 2.000000
47 47 11.000000 2.000000
48 48 11.000000 2.000000
49 49 11.000000 2.000000
Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?
I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]
Assume that I have a dataframe like this:
Date Artist percent_gray percent_blue percent_black percent_red
33 Leonardo 22 33 36 46
45 Leonardo 23 47 23 14
46 Leonardo 13 34 33 12
23 Michelangelo 28 19 38 25
25 Michelangelo 24 56 55 13
26 Michelangelo 21 22 45 13
13 Titian 24 17 23 22
16 Titian 45 43 44 13
19 Titian 17 45 56 13
24 Raphael 34 34 34 45
27 Raphael 31 22 25 67
I want to get maximum color differences of different pictures for the same artist. I can also compare percent_gray with percent_blue e.g. for Lenoardo the biggest difference is percent_red (date:46) - percent_blue(date:45) =12 - 47 = -35. I wanna see how it evolves over time, so I just wanna compare new pictures of the same artist with the old ones(in this case I can compare third picture with first and second ones, and second picture only with first one) and get the maximum differences. So dataframe should look like
Date Artist max_d
33 Leonardo NaN
45 Leonardo -32
46 Leonardo -35
23 Michelangelo NaN
25 Michelangelo 37
26 Michelangelo -43
13 Titian NaN
16 Titian 28
19 Titian 43
24 Raphael NaN
27 Raphael 33
I think I have to use groupby but couldn't manage to get the output I want.
You can use:
#first sort in real data
df = df.sort_values(['Artist', 'Date'])
mi = df.iloc[:,2:].min(axis=1)
ma = df.iloc[:,2:].max(axis=1)
ma1 = ma.groupby(df['Artist']).shift()
mi1 = mi.groupby(df['Artist']).shift()
mad1 = mi - ma1
mad2 = ma - mi1
df['max_d'] = np.where(mad1.abs() > mad2.abs(), mad1, mad2)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
Explanation (with new columns):
#get min and max per rows
df['min'] = df.iloc[:,2:].min(axis=1)
df['max'] = df.iloc[:,2:].max(axis=1)
#get shifted min and max by Artist
df['max1'] = df.groupby('Artist')['max'].shift()
df['min1'] = df.groupby('Artist')['min'].shift()
#get differences
df['max_d1'] = df['min'] - df['max1']
df['max_d2'] = df['max'] - df['min1']
#if else of absolute values
df['max_d'] = np.where(df['max_d1'].abs() > df['max_d2'].abs(), df['max_d1'], df['max_d2'])
print (df)
percent_red min max max1 min1 max_d1 max_d2 max_d
0 46 22 46 NaN NaN NaN NaN NaN
1 14 14 47 46.0 22.0 -32.0 25.0 -32.0
2 12 12 34 47.0 14.0 -35.0 20.0 -35.0
3 25 19 38 NaN NaN NaN NaN NaN
4 13 13 56 38.0 19.0 -25.0 37.0 37.0
5 13 13 45 56.0 13.0 -43.0 32.0 -43.0
6 22 17 24 NaN NaN NaN NaN NaN
7 13 13 45 24.0 17.0 -11.0 28.0 28.0
8 13 13 56 45.0 13.0 -32.0 43.0 43.0
9 45 34 45 NaN NaN NaN NaN NaN
10 67 22 67 45.0 34.0 -23.0 33.0 33.0
And if use second explanation solution, remove columns:
df = df.drop(['min','max','max1','min1','max_d1', 'max_d2'], axis=1)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
How about a custom apply function. Does this work?
from operator import itemgetter
import pandas
import itertools
p = pandas.read_csv('Artits.tsv', sep='\s+')
def diff(x):
return x
def max_any_color(cols):
grey = []
blue = []
black = []
red = []
for row in cols.iterrows():
date = row[1]['Date']
grey.append(row[1]['percent_gray'])
blue.append(row[1]['percent_blue'])
black.append(row[1]['percent_black'])
red.append(row[1]['percent_red'])
gb = max([abs(a[0] - a[1]) for a in itertools.product(grey,blue)])
gblack = max([abs(a[0] - a[1]) for a in itertools.product(grey,black)])
gr = max([abs(a[0] - a[1]) for a in itertools.product(grey,red)])
bb = max([abs(a[0] - a[1]) for a in itertools.product(blue,black)])
br = max([abs(a[0] - a[1]) for a in itertools.product(blue,red)])
blackr = max([abs(a[0] - a[1]) for a in itertools.product(black,red)])
l = [gb,gblack,gr,bb,br,blackr]
c = ['grey/blue','grey/black','grey/red','blue/black','blue/red','black/red']
max_ = max(l)
between_colors_index = l.index(max_)
return c[between_colors_index], max_
p.groupby('Artist').apply(lambda x: max_any_color(x))
Output:
Leonardo (blue/red, 35)
Michelangelo (blue/red, 43)
Raphael (blue/red, 45)
Titian (black/red, 43)