This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed last year.
I wrote the following code (not sure if this is the best approach), just know the data I have is divided into two separate lists, in the correct order. Z[0] is steps, and z[1] is the lists.
for i,z in enumerate(zip(steps,userids_list)):
print(z)
This results in the following tuple values:
# SAMPLE
(('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
My goal is to transform that style of data into the following pandas DataFrame.
Column 1 Column 2
Step 1 User id
Step 1 User id
Step 2 User id
Step 2 User id
Step 3 User id
Step 3 User id
Unfortunately I couldn't find a way to transform the data into what I want. Any ideas on what I could try to do?
explode is perfect for this. Load your data into a dataframe and then explode the column containing the lists:
df = pd.DataFrame({
'Column 1': Z[0],
'Column 2': Z[1],
})
df = df.explode('Column 2')
For example:
steps = ['Step 1', 'Step 2', 'Step 3']
user_ids = [
['user a', 'user b'],
['user a', 'user b', 'user c'],
['user c'],
]
df = pd.DataFrame({
'step': steps,
'user_id': user_ids,
})
df = df.explode('user_id').reset_index(drop=True)
print(df)
Output:
step user_id
0 Step 1 user a
1 Step 1 user b
2 Step 2 user a
3 Step 2 user b
4 Step 2 user c
5 Step 3 user c
data = (('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
df = pd.DataFrame(data, columns=['Column 1', 'Column 2'])
This probably do the job
Related
I am working on a pandas dataframe from which i have to remove rows if it contains a particular word alone. For example,
df = pd.DataFrame({'team': ['Team 1', 'Team 1 abc', 'Team 2',
'Team 3', 'Team 2', 'Team 3'],
'Subject': ['Math', 'Science', 'Science',
'Math', 'Science', 'Math'],
'points': [10, 8, 10, 6, 6, 5]})
I tried to remove rows which contain Team 1 alone. For that I tried,
df = df[df["team"].str.contains("Team 1") == False]
and I got the dataframe like
The dataframe I needed as follows with row number also should be in order
Just use != instead of .str.contains:
df = df[df["team"] != "Team 1"]
Output:
>>> df
team Subject points
1 Team 1 abc Science 8
2 Team 2 Science 10
3 Team 3 Math 6
4 Team 2 Science 6
5 Team 3 Math 5
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I have the following dataframe:
data = {'ID': [1, 1, 2, 2, 3, 3, 4, 4], 'User': ['type 1', 'type 2', 'type 1', 'type 2', 'type 1', 'type 2', 'type 1', 'type 2',], 'Rating': [4,5,3,2,1,5,4,3]}
df = pd.DataFrame(data)
print(df)
I want to create two new columns based on "User", one for type 1 and another for type 2. I suspect I have to create a new dataframe:
Type_1 = df[df.User == 'type 1']
Type_2 = df[df.User == 'type 2']
df1 = pd.merge(Type_1, Type_2, how="left", on=['ID'])
print(df1)
Is there a quicker way of accomplishing this?
IIUC,
df.set_index(['ID','User'])['Rating'].unstack('User').reset_index()
OR
df.pivot('ID','User','Rating').reset_index()
Output:
User ID type 1 type 2
0 1 4 5
1 2 3 2
2 3 1 5
3 4 4 3
A dataframe as below and I want to convert it to one row, with new columns created from its original rows and columns,
data = {'Contract' : ["Team A", "Team B", "Team C"],
'Revenue': [11,7,10],
'Cost' : [5,2,9],
'Tax' : [4,2,2]}
like:
I tried:
import pandas as pd
df = pd.DataFrame(data)
print (df.values.flatten())
The result is not ideal:
['Team A' 5L 11L 4L 'Team B' 2L 7L 2L 'Team C' 9L 10L 2L]
How can I achieve it?
Check
s = df.set_index('Contract').stack().to_frame(0).T
s.columns=s.columns.map('_'.join)
s
Team A_Revenue Team A_Cost ... Team C_Cost Team C_Tax
0 11 5 ... 9 2
[1 rows x 9 columns]
I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!
First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3
try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list
I have a Pandas Dataframe with columns Project Type and Parts. I would like to know how many part As are used in projects of Project Type 1. I am trying to use .count(), but it doesn't return just a single number.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
print (parts_df[(parts_df['Project Type'] == 'Type 1') & ('A' in parts_df['Parts'])]).count()
Output:
Project Type 0
Parts 0
dtype: int64
Desired Output:
1
you can try something like this :
sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
sample :
In[32]: parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['A']], ['Type 1', ['C']]], columns=['Project Type', 'Parts'])
In[33]: sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
Out[33]: 1
IIUC you want the following:
In [13]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A'))
Out[13]:
0 1
Name: Parts, dtype: int64
If you want the scalar value rather than a series then you can call .values attribute and index into the np array:
In [15]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A')).values[0]
Out[15]:
1
You could just add a column that counts the 'A' parts:
In [17]:
parts_df['A count'] = parts_df['Parts'].apply(lambda x: x.count('A'))
parts_df
Out[17]:
Project Type Parts A count
0 Type 1 [A, B] 1
1 Type 2 [B] 0
you can then filter:
In [18]:
parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['A count'] > 0)]
Out[18]:
Project Type Parts A count
0 Type 1 [A, B] 1
Change the 'A' in df['Parts'] to a lambda.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
res = (parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['Parts'].apply(lambda x: 'A' in x))]).count()
res.max()
Result:
1
You can spend a second to re-format the columns, and make life a little easier:
parts_df.Parts = parts_df.Parts.map(lambda x: ' '.join(x))
# Project type Parts
#0 Type 1 A B
#1 Type 2 B
Now you can use the Series.str.get_dummies method:
dummies = parts_df.Parts.str.get_dummies( sep=' ')
# A B
#0 1 1
#1 0 1
which shows the presence or absence of each "Part" using either a 1 or 0 respectively. Use this dummies frame to create a dataframe that can easily be manipulated using all of the standard pandas methods (pandas doesn't like lists in columns):
new_parts_df = pandas.concat( (parts_df['Project Type'], dummies), axis=1)
# Project type A B
#0 Type 1 1 1
#1 Type 2 0 1
You can now easily count groups in several ways. The most efficient thing to do would be use pandas.DataFrame.query, but the unfortunate white space in your column name "Project Type" makes this difficult. I would avoid white spaces in column names whenever possible. Try this:
new_parts_df.rename( columns={'Project Type': 'Project_Type'}, inplace=True)
print(len(new_parts_df.query( 'Project_Type=="Type 1" and A==1')))
# 1