Better way to split column by value into new columns? [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I have the following dataframe:
data = {'ID': [1, 1, 2, 2, 3, 3, 4, 4], 'User': ['type 1', 'type 2', 'type 1', 'type 2', 'type 1', 'type 2', 'type 1', 'type 2',], 'Rating': [4,5,3,2,1,5,4,3]}
df = pd.DataFrame(data)
print(df)
I want to create two new columns based on "User", one for type 1 and another for type 2. I suspect I have to create a new dataframe:
Type_1 = df[df.User == 'type 1']
Type_2 = df[df.User == 'type 2']
df1 = pd.merge(Type_1, Type_2, how="left", on=['ID'])
print(df1)
Is there a quicker way of accomplishing this?

IIUC,
df.set_index(['ID','User'])['Rating'].unstack('User').reset_index()
OR
df.pivot('ID','User','Rating').reset_index()
Output:
User ID type 1 type 2
0 1 4 5
1 2 3 2
2 3 1 5
3 4 4 3

Related

Removing rows from a pandas dataframe if a column contains a particular word alone

I am working on a pandas dataframe from which i have to remove rows if it contains a particular word alone. For example,
df = pd.DataFrame({'team': ['Team 1', 'Team 1 abc', 'Team 2',
'Team 3', 'Team 2', 'Team 3'],
'Subject': ['Math', 'Science', 'Science',
'Math', 'Science', 'Math'],
'points': [10, 8, 10, 6, 6, 5]})
I tried to remove rows which contain Team 1 alone. For that I tried,
df = df[df["team"].str.contains("Team 1") == False]
and I got the dataframe like
The dataframe I needed as follows with row number also should be in order
Just use != instead of .str.contains:
df = df[df["team"] != "Team 1"]
Output:
>>> df
team Subject points
1 Team 1 abc Science 8
2 Team 2 Science 10
3 Team 3 Math 6
4 Team 2 Science 6
5 Team 3 Math 5

Transforming my concatenated tuple into a pandas DataFrame [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed last year.
I wrote the following code (not sure if this is the best approach), just know the data I have is divided into two separate lists, in the correct order. Z[0] is steps, and z[1] is the lists.
for i,z in enumerate(zip(steps,userids_list)):
print(z)
This results in the following tuple values:
# SAMPLE
(('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
My goal is to transform that style of data into the following pandas DataFrame.
Column 1 Column 2
Step 1 User id
Step 1 User id
Step 2 User id
Step 2 User id
Step 3 User id
Step 3 User id
Unfortunately I couldn't find a way to transform the data into what I want. Any ideas on what I could try to do?
explode is perfect for this. Load your data into a dataframe and then explode the column containing the lists:
df = pd.DataFrame({
'Column 1': Z[0],
'Column 2': Z[1],
})
df = df.explode('Column 2')
For example:
steps = ['Step 1', 'Step 2', 'Step 3']
user_ids = [
['user a', 'user b'],
['user a', 'user b', 'user c'],
['user c'],
]
df = pd.DataFrame({
'step': steps,
'user_id': user_ids,
})
df = df.explode('user_id').reset_index(drop=True)
print(df)
Output:
step user_id
0 Step 1 user a
1 Step 1 user b
2 Step 2 user a
3 Step 2 user b
4 Step 2 user c
5 Step 3 user c
data = (('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
df = pd.DataFrame(data, columns=['Column 1', 'Column 2'])
This probably do the job

Skip item with more columns when creating Pandas DataFrame

I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!
First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3
try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list

How can I turn pandas dataframe into an ordered list with many to one relationship?

I currently have a pandas dataframe where there are many answers joined on a single question, so I am trying to turn it into a list so I can do cosine similarity.
Currently I have the dataframe, where the questions are joined by the answers through the parent_id = q_id, as shown in the picture:
many answers to one question dataframe
print (df)
q_id q_body parent_id a_body
0 1 question 1 1 answer 1
1 1 question 1 1 answer 2
2 1 question 1 1 answer 3
3 2 question 2 2 answer 1
4 2 question 2 2 answer 2
and the product I am looking for is:
("question 1", "answer 1", "answer 2", "answer 3")
("question 2", "answer 1", "answer 2")
Any help would be appreciated! Thank you very much.
I think you need groupby with apply:
#output is tuple with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x)))
print (df)
q_body
question 1 (question 1, answer 1, answer 2, answer 3)
question 2 (question 2, answer 1, answer 2)
Name: a_body, dtype: object
#output is list with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: [x.name] + list(x))
print (df)
q_body
question 1 [question 1, answer 1, answer 2, answer 3]
question 2 [question 2, answer 1, answer 2]
Name: a_body, dtype: object
#output is list without question value
df = df.groupby('q_body')['a_body'].apply(list)
print (df)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
Name: a_body, dtype: object
#grouping by parent_id without question value
df = df.groupby('parent_id')['a_body'].apply(list)
print (df)
parent_id
1 [answer 1, answer 2, answer 3]
2 [answer 1, answer 2]
Name: a_body, dtype: object
#output is string, values are concanecated by ,
df = df.groupby('parent_id')['a_body'].apply(', '.join)
print (df)
parent_id
1 answer 1, answer 2, answer 3
2 answer 1, answer 2
Name: a_body, dtype: object
But if need output as list add tolist:
L = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x))).tolist()
print (L)
[('question 1', 'answer 1', 'answer 2', 'answer 3'), ('question 2', 'answer 1', 'answer 2')]
df = pd.DataFrame([
['question 1', 'answer 1'],
['question 1', 'answer 2'],
['question 1', 'answer 3'],
['question 2', 'answer 1'],
['question 2', 'answer 2'],
], columns=['q_body', 'a_body'])
print(df)
q_body a_body
0 question 1 answer 1
1 question 1 answer 2
2 question 1 answer 3
3 question 2 answer 1
4 question 2 answer 2
apply(list)
df.groupby('q_body').a_body.apply(list)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
See if it helps you
result = df.groupby('q_id').agg({'q_body': lambda x: x.iloc[0], 'a_body': lambda x: ', '.join(x)})
result['output'] = result.q_body + ', ' + result.a_body
This will create a new column output with the desired result.

Filtering Pandas dataframe on two criteria where one column is a list

I have a Pandas Dataframe with columns Project Type and Parts. I would like to know how many part As are used in projects of Project Type 1. I am trying to use .count(), but it doesn't return just a single number.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
print (parts_df[(parts_df['Project Type'] == 'Type 1') & ('A' in parts_df['Parts'])]).count()
Output:
Project Type 0
Parts 0
dtype: int64
Desired Output:
1
you can try something like this :
sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
sample :
In[32]: parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['A']], ['Type 1', ['C']]], columns=['Project Type', 'Parts'])
In[33]: sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
Out[33]: 1
IIUC you want the following:
In [13]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A'))
Out[13]:
0 1
Name: Parts, dtype: int64
If you want the scalar value rather than a series then you can call .values attribute and index into the np array:
In [15]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A')).values[0]
Out[15]:
1
You could just add a column that counts the 'A' parts:
In [17]:
parts_df['A count'] = parts_df['Parts'].apply(lambda x: x.count('A'))
parts_df
Out[17]:
Project Type Parts A count
0 Type 1 [A, B] 1
1 Type 2 [B] 0
you can then filter:
In [18]:
parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['A count'] > 0)]
Out[18]:
Project Type Parts A count
0 Type 1 [A, B] 1
Change the 'A' in df['Parts'] to a lambda.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
res = (parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['Parts'].apply(lambda x: 'A' in x))]).count()
res.max()
Result:
1
You can spend a second to re-format the columns, and make life a little easier:
parts_df.Parts = parts_df.Parts.map(lambda x: ' '.join(x))
# Project type Parts
#0 Type 1 A B
#1 Type 2 B
Now you can use the Series.str.get_dummies method:
dummies = parts_df.Parts.str.get_dummies( sep=' ')
# A B
#0 1 1
#1 0 1
which shows the presence or absence of each "Part" using either a 1 or 0 respectively. Use this dummies frame to create a dataframe that can easily be manipulated using all of the standard pandas methods (pandas doesn't like lists in columns):
new_parts_df = pandas.concat( (parts_df['Project Type'], dummies), axis=1)
# Project type A B
#0 Type 1 1 1
#1 Type 2 0 1
You can now easily count groups in several ways. The most efficient thing to do would be use pandas.DataFrame.query, but the unfortunate white space in your column name "Project Type" makes this difficult. I would avoid white spaces in column names whenever possible. Try this:
new_parts_df.rename( columns={'Project Type': 'Project_Type'}, inplace=True)
print(len(new_parts_df.query( 'Project_Type=="Type 1" and A==1')))
# 1

Categories