I am facing a weird scenario.
I have a data frame with having 3 largest scores for unique row like this:
id rid code score
1 9 67 43
1 8 87 22
1 4 32 20
2 3 56 43
3 10. 22 100
3. 5 67. 50
Here id column is same but row wise it is different.
I want to make my data frame like this:
id first_code second_code third_code
1 67 87 32
2. 56. none. none
3 22. 67. none
So I have made my dataframe which is showing highest top 3 scores. If there is not top 3 value I am taking top 2 or the only value which is the score. So depending on score value, I want to re-arrange the code column into three different columns as example first_code is representing the highest_score, second_score is representing second-highest, third_code is representing the third highest value. If not found then I will make those blanks.
Kindly help me to solve this.
Use GroupBy.cumcount for counter, create MultiIndex and reshape by Series.unstack:
df = df.set_index(['id',df.groupby('id').cumcount()])['code'].unstack()
df.columns=['first_code', 'second_code', 'third_code']
df = df.reset_index()
print (df)
id first_code second_code third_code
0 1.0 67.0 87.0 32.0
1 2.0 56.0 NaN NaN
2 3.0 22.0 67.0 NaN
Btw, cumcount should be used also in previous code for filter top3 values.
Related
I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!
IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.
I have a column named volume in a pandas data frame and I wanted to look back previous 5 volumes from current column # and find 40 percentile .
Volume data - as follows
1200
3400
5000
2300
4502
3420
5670
5400
4320
7890
8790
For 1st 5 values we don’t have enough data to look back , but from 6th value 3420 we should find percentile (40) of previous 5 volumes 1200,3400,5000,2300,4502 and keep doing this for rest of the data by taking previous 5 data from current value.
Not sure if I understand correctly since there is no mcve
However, sounds like you want a rolling quantile
>>> s.rolling(5).quantile(0.4)
0 NaN
1 NaN
2 NaN
3 NaN
4 2960.0
5 3412.0
6 4069.2
7 4069.2
8 4429.2
9 4968.0
10 5562.0
dtype: float64
I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6
I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.
You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857
You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857
I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary