How to split in train and test by month - python

I have a dataframe structured like this
Time Z X Y
01-01-18 1 20 10
02-01-18 20 4 15
03-01-18 34 16 21
04-01-18 67 38 8
05-01-18 89 10 18
06-01-18 45 40 4
07-01-18 22 10 13
08-01-18 1 46 11
...
24-12-20 56 28 9
25-12-20 6 14 22
26-12-20 9 5 40
27-12-20 56 11 10
28-12-21 78 61 35
29-12-21 33 23 29
30-12-21 2 35 12
31-12-21 0 31 7
I have data for all days and months from 2018 to 2021, with around 50k observations
How can I aggregate all the data for the same month and perform a Train-Test splitting for each month? I.e. for all the data of the months of January, February, March and so on.

try this:
df['month'] = df.Time.apply(lambda x: x.split('-')[1]) #get month

Related

Pandas - merging start/end time ranges with short gaps

Say I have a series of start and end times for a given event:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,5,30).cumsum().reshape(-1, 2), columns = ["start", "end"])
start end
0 2 6
1 7 8
2 12 14
3 18 20
4 24 25
5 26 28
6 29 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I'd like to merge time ranges with a gap less than or equal to n, so for n = 1 the result would be:
fn(df, n = 1)
start end
0 2 8
2 12 14
3 18 20
4 24 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I can't seem to find a way to do this with pandas without iterating and building up the result line-by-line. Is there some simpler way to do this?
You can subtract shifted values, compare by N for mask, create groups by cumulative sum and pass to groupby for aggregate max and min:
N = 1
g = df['start'].sub(df['end'].shift())
df = df.groupby(g.gt(N).cumsum()).agg({'start':'min', 'end':'max'})
print (df)
start end
1 2 8
2 12 14
3 18 20
4 24 33
5 35 36
6 39 41
7 44 45
8 48 50
9 53 54
10 58 59
11 62 63
12 65 68

How to drop a row based on the condition of a column in pandas?

I am trying to drop the rows that contain 'CENTER' in column 2.
Works:
BuildingNameContains_center = dframe[dframe[2].str.contains('CENTER')]
Output works:
BuildingNameContains_center
Produces Error:
dframe.drop(BuildingNameContains_center, inplace= True)
KeyError: '[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23\n 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41] not found in axis'
IIUC, instead of subsetting a dataframe and then trying to drop rows. Just subset the inverse of your original selection.
dframe = dframe.loc[~dframe[2].str.contains('CENTER'), :]

python pandas assign yyyy-mm-dd from multiple years into accumulated week numbers

Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?
I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

Find maximum value in python dataframe combining several rows

I have a dataframe looks like following(I have sorted it according to item column already). For example, item 1- 10,11-20,...(every 10 items) are in the same category, I want to find the item in each category that have the highest score and return it.
What is the most efficient way to do that?
item score
1 1 10
3 4 1
4 6 6
39 11 2
8 12 1
9 13 1
10 15 24
11 17 9
12 18 12
13 20 7
14 22 1
59 25 3
18 28 3
19 29 2
22 34 2
23 37 1
24 38 3
25 39 2
26 40 2
27 42 3
29 45 1
31 48 1
32 53 4
33 58 4
assuming your dataframe is stored in df
g = df.groupby(pd.cut(df.item, np.arange(1, df.item.max(), 10), right=False)
)
get the max values from each category
max_score_ids = g.score.agg('idxmax')
this gives you the ids of the rows that contain the max score in each category
item
[1, 11) 1
[11, 21) 10
[21, 31) 59
[31, 41) 24
[41, 51) 27
then get the items associated with these ids
df.loc[max_score_ids].item
1 1
10 15
59 25
24 38
27 42

Categories