How to summarize only certain columns of dataframe (python pandas) - python

I want to get new dataframe, in which I need to see sum of certain columns for rows which have same value of 'Index' columns (campaign_id and group_name in my example)
This is sample (example) of my dataframe:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 20 5 50 bar 12
102 red 7 3 25 bar 12
102 brown 5 0 18 bar 12
this is what I want to get:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 27 8 75 bar 12
102 brown 5 0 18 bar 12
I tried:
df = df.groupby(['campaign_id','group_name'])['clicks','conversions','cost'].sum().reset_index()
but this gives my only mentioned (summarized) columns (and Index), like this:
campaign_id group_name clicks conversions cost
101 blue 40 15 100
102 red 27 8 75
102 brown 5 0 18
I can try to add leftover columns after this operation, but I'm not sure if this will be optimal and adequate way to solve the problem
Is there simple way to summarize certain columns and leave other columns untouched (I don't care if they would differ, because in my data all leftover columns have same data for rows with same corresponding values in 'Index' columns (which are campaign_id and group_name)

When I finished my post I saw the answer right away: since all columns except those which I want to summarize - have matching values - I just need to take all those columns as part of multi-index, for this operation. Like this:
df = df.groupby(['campaign_id','group_name','lavel','city_id'])['clicks','conversions','cost'].sum().reset_index()
In this case I got exacty what I wanted.

Related

lookup within filtered range

I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!
IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Is it possible to do full text search in pandas dataframe

currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83
Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)

Creating Multiple DataFrames from single DataFrame based on different values of single column

I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times

Performing calculations on subset of data frame subset in Python

user_id char_id rating
100 33 3
100 44 2
100 33 1
100 44 4
111 55 5
111 44 4
111 55 5
I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id.
It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.
I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:
def divide_by_user(data):
for user in data['user_id']:
user_data = data.where(data['user_id'] == user)
return user_data
There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:
grouped = df.groupby(['user_id', 'char_id'])
Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:
df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
# including all values that have been seen so far.
df
Out[43]:
user_id char_id rating cum_average
0 100 33 3 3
1 100 44 2 2
2 100 33 1 2
3 100 44 4 3
4 111 55 5 5
5 111 44 4 4
6 111 55 5 5
Using a larger randomly-generated dataset to demonstrate rolling_window():
df = pd.DataFrame({
'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
# observed for that subset, you can change this using the
# min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
char_id rating user_id cum_average
3 33 1 100 NaN
19 33 2 100 NaN
22 33 5 100 NaN
34 33 1 100 NaN
47 33 1 100 NaN
48 33 1 100 NaN
49 33 1 100 1.714286
51 33 4 100 2.142857
55 33 2 100 2.142857
60 33 2 100 1.714286
66 33 2 100 1.857143
...
etc.
Try this:
"df" is the dataFrame
mean=pd.rolling_mean(df.rating, 7)

Categories