Python Data Frame: How do I work with rows? - python

I have imported this file as a Data Frame in Pandas. The left-most column is time (7 am to 9:15 am. Rows show traffic volume at intersection in 15 minute increments. How do I find the peak hour? or the hour with most volume? To get the hourly volumes, I have to add 4 rows.
I am a newbie with Python and any help is appreciated.
import pandas as pd
f_path ="C:/Users/reggi/Dropbox/1. 2020/6. Programming Python/Text Files/TMC118txt.txt"
df = pd.read_csv(f_path, index_col=0, sep='\s+')
Sample of the data file below:: First column is time in 15 minute increments, first row is traffic count by movement.
NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR
715 8 3 12 1 1 0 4 93 18 36 68 4
730 16 5 20 5 2 1 0 135 12 39 128 3
745 9 1 29 6 2 3 4 169 21 28 163 6
800 10 2 33 4 0 4 4 147 8 34 174 6
815 11 1 30 1 4 3 4 93 10 28 140 8

My approach would be to move the time to a column:
df.reset_index(inplace=True)
Then I would create a new column for hour and one for minutes:
df['hour'] = df['index'].apply(lambda x: x[:-2])
df['minute'] = df['index'].apply(lambda x: x{-2:]
Then you could do a groupby on hour and sum the traffic movements, sort, etc.
hourly = df.groupby(by='hour').sum()

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

vertically integrate a datasheet with a over 1500 columns

I have a data sheet with about 1700 columns and 100 rows of data w/ a unique identifier. It is survey data and every employee of an organization answer the same 9 questions but its compiled into one row of data for every organization. Is there a way in python/pandas to vertically integrate this data as opposed to the elongated format on the x-axis it already is at? I am cutting and pasting currently.
You can reshape the underlying numpy array and reindex with proper companies:
# sample data, assuming index is the company
df = pd.DataFrame(np.arange(36).reshape(2,-1))
# new index
idx = df.index.repeat(df.shape[1]//9)
# new data:
new_df = pd.DataFrame(df.values.reshape(-1,9), index=idx)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
0 9 10 11 12 13 14 15 16 17
1 18 19 20 21 22 23 24 25 26
1 27 28 29 30 31 32 33 34 35

Creating a Box-Plot but by value_counts() [Number of events occurred]

I have the following dataframe. Each entry is an event that occurred [550624 events]. Suppose we are interested in a box-plot of the number of events occurring per day each month.
print(df)
Month Day
0 4 1
1 4 1
2 4 1
3 4 1
4 4 1
... ...
550619 10 31
550620 10 31
550621 10 31
550622 10 31
550623 10 31
[550624 rows x 2 columns]
df2 = df.groupby('Month')['Day'].value_counts().sort_index()
Month Day
4 1 2162
2 1564
3 1973
4 1620
5 1860
10 27 2022
28 1606
29 1316
30 1674
31 1726
sns.boxplot(x = df2.index.get_level_values('Month'), y = df2)
Output of sns.boxplot
My question is whether this way is the most efficient/direct way to create this visual info or if I am taking a round-about way of achieving this.
Is there a more direct way to achieve this visual?

Compare Relative Start Dates in Pandas

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

create new columns with info of other column on python pandas DataFrame

I have a grouped dataframe
id num week
101 23 7 3
8 1
9 2
102 34 8 4
9 1
10 2
...
And I need to create new columns and have a dataFrame like this
id num 7 8 9 10
101 23 3 1 2 0
102 34 0 4 1 2
...
As you may see, the values of the week column turned into several columns.
I may also have the input dataFrame not grouped, or with reset_index, like this:
id num week
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
...
but I don't know with which would be easier to start.
Notice that id and num are both keys
Use unstack() and fillna(0) to not have NaNs.
Let's load the data:
id num week val
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
s = pd.read_clipboard(index_col=[0,1,2], squeeze=True)
Notice I have set the index to be id, num and week. If you haven't yet, use set_index.
Now we can unstack: move from the index (rows) to the columns. By default it does it to the last level in line, which is week here, but you could specify it using level=-1 or level='week'
s.unstack().fillna(0)
Note that as pointed out by #piRsquared you can do s.unstack(fill_value=0) to do it in one go.

Categories