Creating a groupby column based on ids and category - python

I have a table below where i need to create a column of "Relevant" and "Non-Relevant" based on IDs.
The table looks like something below:
+----+--------------+--------+
| ID | Experience | Length |
+----+--------------+--------+
| 1 | Relevant | 2 |
| 1 | Non-Relevant | 1 |
| 4 | Relevant | 3 |
| 4 | Relevant | 4 |
| 4 | Non-Relevant | 0 |
| 5 | Relevant | 1 |
| 5 | Relevant | 1 |
+----+--------------+--------+
This is the output I am trying to get
+----+----------+--------------+
| ID | Relevant | Non-Relevant |
+----+----------+--------------+
| 1 | 2 | 1 |
| 4 | 7 | 0 |
| 5 | 2 | 0 |
+----+----------+--------------+

import pandas as pd
df = pd.DataFrame({'id': [1, 1, 4, 4, 4, 5, 5], 'exp': [x for x in 'rnrrnrr'], 'len':[2, 1, 3, 4, 0, 1, 1]})
pd.pivot_table(df, index='id', values='len', columns='exp', aggfunc='sum', fill_value=0)
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

To create the dataframe:
ID = [1,1,4,4,4,5,5]
Experience = ['Relevant', 'Non-Relevant', 'Relevant', 'Relevant', 'Non-Relevant',
'Relevant', 'Relevant']
length = [2,1,3,4,0,1,1]
dictionary = {'ID' : ID,
'Experience' : Experience,
'Length' : length}
To group it and then unstack:
df.groupby(by=['ID','Experience']).sum().unstack()['Length'].fillna(0)

Related

Make a row-wise Conditional Column

I got this dataframe:
Df = pd.DataFrame({'TIPOIDPRESTADOR': ['CC', 'NI', 'CE', 'RS'],
'Levels': [0, 1, np.nan, np.nan]
})
| TIPOIDPRESTADOR | Levels |
| -------- | -------- |
| CC | 0 |
| NI | 1 |
| CE | NaN |
| RS | NaN |
and a want to make a loop that given the maximun value of the column 'Levels' (in this case 1) if the netx row is nan, then pass the maximun value of the column plus 1 and so on
the desired output should be something like this:
Desired_Output = pd.DataFrame({'TIPOIDPRESTADOR': ['CC', 'NI', 'CE', 'RS'],
'Levels': [0, 1, 2, 3]
})
| TIPOIDPRESTADOR | Levels |
| -------- | -------- |
| CC | 0 |
| NI | 1 |
| CE | 2 |
| RS | 3 |
i was trying to use iterrows like this
for row in Df.iterrows():
Max_value = float(max(Df[["TIPOIDPRESTADOR"]))
Df['TIPOIDPRESTADOR'] = np.where(Df["TIPOIDPRESTADOR"].isna()==True, Max_value+1, Df["TIPOIDPRESTADOR"])
Max_value = Max_value+1
but i'm getting something like this:
| TIPOIDPRESTADOR | Levels |
| -------- | -------- |
| CC | 0 |
| NI | 1 |
| CE | 2 |
| RS | 2 |
i know that it's a simple task but it's really struggling me
I would greatly appreciate your help
You were performing operations on TIPOIDPRESTADOR column rather than on Levels (assume those were typos, otherwise you wouldn't have got your result) and when using np.where() in a loop you probably have filled all NaN values in the first iteration and there has become nothing to update afterwards.
Try this:
for i, row in Df.iterrows():
if pd.isna(row['Levels']) == True:
Df.loc[i, 'Levels'] = Df['Levels'].max() + 1
else:
pass
Df
Output:
TIPOIDPRESTADOR Levels
0 CC 0.0
1 NI 1.0
2 CE 2.0
3 RS 3.0

Pandas crosstab dataframe and setting the new columns as True/False/Null based on if they existed or not and based on another column

As the title states I want to pivot/crosstab my dataframe
Let's say I have a df that looks like this:
df = pd.DataFrame({'ID' : [0, 0, 1, 1, 1],
'REV' : [0, 0, 1, 1, 1],
'GROUP' : [1, 2, 1, 2, 3],
'APPR' : [True, True, NULL, NULL, True})
+----+-----+-------+------+
| ID | REV | GROUP | APPR |
+----+-----+-------+------+
| 0 | 0 | 1 | True |
| 0 | 0 | 2 | True |
| 1 | 1 | 1 | NULL |
| 1 | 1 | 2 | NULL |
| 1 | 1 | 3 | True |
+----+-----+-------+------+
I want to do some kind of pivot so my result of the table looks like
+----+-----+------+------+-------+
| ID | REV | 1 | 2 | 3 |
+----+-----+------+------+-------+
| 0 | 0 | True | True | False |
| 1 | 1 | NULL | NULL | True |
+----+-----+------+------+-------+
Now the values from the GROUP column becomes there own column. The value of each of those columns is T/F/NULL based on APPR only for the T/NULL part. I want it be False when the group didn't exist for the ID REV combo.
similar question I've asked before, but I wasn't sure how to make this answer work with my new scenario:
Pandas pivot dataframe and setting the new columns as True/False based on if they existed or not
Hope that makes, sense!
Have you tried to pivot?
pd.pivot(df, index=['ID','REV'], columns=['GROUP'], values='APPR').fillna(False).reset_index()

The total number of times User U bought Item T out of the 3 recent months

I have the following dataframe:
| order_id | item_id | user_id | order_date |
| -------- | -------------- | -------- | ----------- |
| 383706 | 1 | A | 2012-09-11 |
| 354776 | 2 | A | 2018-05-19 |
| 33333 | 2 | A | 2014-01-19 |
| 383706 | 3 | B | 2013-12-10 |
and i want to calculate this following variable: total_buy_m5(User U, Item T) is the total number of times User U bought Item T out of the 5 most recent months (between 2019-12-01 and 2019-07-01).
I want this final table:
| user_id | item_id | count |
| -------------- | -------- | -------- |
| A | 1 | 100 |
| A | 2 | 1 |
| A | 3 | 12 |
| B | 1 | 5 |
Assuming that your order_date is of datetime type, you can do this to filter. If not, you have to convert that column to the datetime type.
df = df[(df['user_id'] == U) & (df['item_id'] == T) & ((df['order_date'] >= start_date) & (df['order_date'] <= end_date))]
In order to get your final desired table, you can use a groupby.
import pandas as pd
from datetime import datetime
# Creating some sample data to illustrate the example
df = pd.DataFrame(columns=['user_id', 'item_id', 'order_date'], data=[['a', 1, datetime(2020, 1, 1)], ['a', 1, datetime(2020, 1, 2)]])
# Filter the DataFrame based on your function arguments
df = df[(df['user_id'] == 'a') &
(df['item_id'] == 1) &
((df['order_date'] >= '2019-02-01') & (df['order_date'] <= '2020-02-02'))]
# Now do a groupby and rename the order_date column to count
df2 = df.groupby(['user_id', 'item_id']).count().reset_index()
df3 = df2.rename(columns={'order_date': 'count'})
print(df3)

Pandas keeping certain rows based on strings in other rows

I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks
Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1

How to convert dict to spark map output

I'm working with spark and python. I would like to transform my input dataset.
My input dataset (RDD)
-------------------------------------------------------------
| id | var |
-------------------------------------------------------------
| 1 |"[{index: 1, value: 200}, {index: 2, value: A}, ...]" |
| 2 |"[{index: 1, value: 140}, {index: 2, value: C}, ...]" |
| .. | ... |
-------------------------------------------------------------
I would like to have this DataFrame (output dataset)
----------------------
| id | index | value |
----------------------
| 1 | 1 | 200 |
| 1 | 2 | A |
| 1 | ... | ... |
| 2 | 1 | 140 |
| 2 | 2 | C |
| ...| ... | ... |
----------------------
I create a map function
def process(row):
my_dict = {}
for item in row['value']:
my_dict['id'] = row['id']
my_dict['index'] = item['index']
my_dict['value'] = item['value']
return my_dict
I would like to map my process function like this:
output_rdd = input_rdd.map(process)
Is it possible to do this on this way (or a simpler way)?
I found the solution:
output_rdd = input_rdd.map(lambda row:process(row)).flatMap(lambda x: x)

Categories