Compare Relative Start Dates in Pandas

Compare Relative Start Dates in Pandas - python

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!

First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.

The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

Related

how do I combine a single row with it's own index with an unstacked multi-index dataframe?

I have a dataframe created by the following code:
dfHubR2I=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby(['SHOP_CODE', dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2I=dfHubR2I['median'].unstack('SHOP_CODE')
dfHubR2I=dfHubR2I.iloc[:date.month-1]
dfHubR2I
It looks like this:
shop code A B C D All Shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
where ind is months and the letters are different shops
I then got the median across all the shops for each month from this code:
dfHubR2Imonthallshops=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby([dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2Imonthallshops=dfHubR2Imonthallshops.rename(columns={'median':'All Shops'})
dfHubR2Imonthallshops=dfHubR2Imonthallshops.iloc[:date.month-1]
dfHubR2Imonthallshops
which looks like this:
A B C D All shops
median 2 3 4 5 2
And I need to append it onto the bigger dataframe as a row but when I try to use pd.concat I get the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I'm assuming it's because the larger dadtaframe has 2 levels but I'm not sure how to go about getting my final desired result:
shop code A B C D All shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
YTD 2 3 4 5 2

Have you tried to do it with an assignment?
dfHubR2I.loc['YTD', :] = dfHubR2Imonthallshops.loc['median', :]
Eleonora

Adding a column to a dataframe through a mapping between 2 dataframes in Python?

I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.

Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...

Python Data Frame: How do I work with rows?

I have imported this file as a Data Frame in Pandas. The left-most column is time (7 am to 9:15 am. Rows show traffic volume at intersection in 15 minute increments. How do I find the peak hour? or the hour with most volume? To get the hourly volumes, I have to add 4 rows.
I am a newbie with Python and any help is appreciated.
import pandas as pd
f_path ="C:/Users/reggi/Dropbox/1. 2020/6. Programming Python/Text Files/TMC118txt.txt"
df = pd.read_csv(f_path, index_col=0, sep='\s+')
Sample of the data file below:: First column is time in 15 minute increments, first row is traffic count by movement.
NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR
715 8 3 12 1 1 0 4 93 18 36 68 4
730 16 5 20 5 2 1 0 135 12 39 128 3
745 9 1 29 6 2 3 4 169 21 28 163 6
800 10 2 33 4 0 4 4 147 8 34 174 6
815 11 1 30 1 4 3 4 93 10 28 140 8

My approach would be to move the time to a column:
df.reset_index(inplace=True)
Then I would create a new column for hour and one for minutes:
df['hour'] = df['index'].apply(lambda x: x[:-2])
df['minute'] = df['index'].apply(lambda x: x{-2:]
Then you could do a groupby on hour and sum the traffic movements, sort, etc.
hourly = df.groupby(by='hour').sum()

Single column with value counts from multiple column dataframe

I would like to sum the frequencies over multiple columns with pandas. The amount of columns can vary between 2-15 columns. Here is an example of just 3 columns:
code1 code2 code3
27 5 56
534 27 78
27 312 55
89 312 27
And I would like to have the following result:
code frequency
5 1
27 4
55 1
56 2
78 1
312 2
534 1
To count values inside one column is not the problem, just need a sum of all frequencies in a dataframe a value can appear, no matter the amount of columns.

You could stack and take the value_counts on the resulting series:
df.stack().value_counts().sort_index()
5 1
27 4
55 1
56 1
78 1
89 1
312 2
534 1
dtype: int64

How to select and replace similar occurrences in a column

I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!

There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare Relative Start Dates in Pandas - python

The trick is to use .apply() combined with dropna(). df.T.apply(lambda x: pd.Series(x.dropna().values)).T

Related

how do I combine a single row with it's own index with an unstacked multi-index dataframe?

Adding a column to a dataframe through a mapping between 2 dataframes in Python?

Python Data Frame: How do I work with rows?

Single column with value counts from multiple column dataframe

How to select and replace similar occurrences in a column

Categories

Resources