Summarize dataframe by extracting and grouping column with pandas - python

I would like to summarize column from a csv file. Pretty much extract column data and match it up with relevant ratings and count.
Also, any idea how should I match the expected dataframe with the website image?
website rate
1 two 5
2 two 3
3 two 5
4 one 2
5 one 4
6 one 4
7 one 2
8 one 2
9 two 2
website rate(over 5) count appeal(rate over 5 / count >= 0.5)
one 0 5 0
two 2 4 1

You can use a groupby operation:
res = df.assign(rate_over_5=df['rate'].ge(5))\
.groupby('website').agg({'rate_over_5': ['sum', 'size']})\
.xs('rate_over_5', axis=1).reset_index()
res['appeal'] = ((res['sum'] / res['size']) >= 0.5).astype(int)
print(res)
website sum size appeal
0 one 0.0 5 0
1 two 2.0 4 1

Related

Merging tables and concatenate strings in specific table

I am trying to merge two different tables into one table.
The first table is Pandas data frame that contains information about the period years from 2000 until 2005 or six observations:
time_horizon=pd.DataFrame(range(2000,2005+1))
Now I want to concatenate this text 'WT' with the previous time_horizon
time_horizon+str('WT')
After this next step should be to add specific values for this observation
values=pd.DataFrame(range(1,7))
In the end, I need to have a data frame as data frame showed on the pic below
The second step for concatenation not works for me so I can't implement the third step and make this table.
So can anybody help me how to make this table?
solution to the second step that failed for you.
str('WT')+(time_horizon).astype(str)
0
0 WT2000
1 WT2001
2 WT2002
3 WT2003
4 WT2004
5 WT2005
One way to solve it is
# create a df, with columns only
df=pd.DataFrame(columns=range(2000,2005+1)).add_prefix('WT')
# fill first column with range of values
df.iloc[:,0]= range(1,7)
# forward fill across rows
df.ffill(axis=1)
WT2000 WT2001 WT2002 WT2003 WT2004 WT2005
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Having trouble with calculation on Dataframe

I am having trouble creating a couple new calculated columns to my Dataframe. Here is what I'm looking for:
Original DF:
Col_IN Col_OUT
5 2
1 2
2 2
3 0
3 1
What I want to add is two columns. One is 'running end of day total' that takes in the net of the current day plus total of day before. Second column I want 'Available Units' - which factors in the previous day end total plus incoming units. Desired result:
Desired DF:
Col_IN Available_Units Col_OUT End_Total
5 5 2 3
1 4 2 2
2 4 2 2
3 5 0 5
3 8 1 7
It's a weird one - anybody have an idea? Thanks.
For the End_Total you can use np.cumsum and for Available Units you can use shift.
df = pd.DataFrame({
'Col_IN': [5,1,2,3,3],
'Col_OUT': [2,2,2,0,1]
})
df['End_Total'] = np.cumsum(df['Col_IN'] - df['Col_OUT'])
df['Available_Units'] = df['End_Total'].shift().fillna(0) + df['Col_IN']
print(df)
will output
Col_IN Col_OUT End_Total Available_Units
0 5 2 3 5.0
1 1 2 2 4.0
2 2 2 2 4.0
3 3 0 5 5.0
4 3 1 7 8.0
Running totals are also known as cumulative sums, for which pandas has the cumsum() function.
The end totals can be calculated through the cumulative sum of incoming minus the cumulative sum of outgoing:
df["End_Total"] = df["Col_IN"].cumsum() - df["Col_OUT"].cumsum()
The available units can be calculated in the same way, if you shift the outgoing column one down:
df["Available_Units"] = df["Col_IN"].cumsum() - df["Col_OUT"].shift(1).fillna(0).cumsum()

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column? [duplicate]

This question already has answers here:
Pandas groupby multiple fields then diff
(2 answers)
Closed 4 years ago.
This is part of a larger project, but I've broken my problem down into steps, so here's the first step. Take a Pandas dataframe, like this:
index | user time
---------------------
0 F 0
1 T 0
2 T 0
3 T 1
4 B 1
5 K 2
6 J 2
7 T 3
8 J 4
9 B 4
For each unique user, can I extract the difference between the values in column "time," but with some conditions?
So, for example, there are two instances of user J, and the "time" difference between these two instances is 2. Can I extract the difference, 2, between these two rows? Then if that user appears again, extract the difference between that row and the previous appearance of that user in the dataframe?
I believe need DataFrameGroupBy.diff:
df['new'] = df.groupby('user')['time'].diff()
print (df)
user time new
0 F 0 NaN
1 T 0 NaN
2 T 0 0.0
3 T 1 1.0
4 B 1 NaN
5 K 2 NaN
6 J 2 NaN
7 T 3 2.0
8 J 4 2.0
9 B 4 3.0
I think np.where and pandas shifts does this
This subtract between two consecutive Time, only if the users are same
df1 = np.where (df['users'] == df['users'].shifts(-1), df['time'] - df['time'].shifts(-1), 'NaN')

Categories