ALTER COLUMN equivalents in pandas - python

When I'm working in SQL, I find almost all the things I do with a column are related to the following four operations:
Add a column.
Remove a column.
Change a column type.
What is the preferred way to do these three DML operations in pandas. For example, let's suppose I am starting with the following DataFrame:
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
How would I:
Change the df.id column from a string (or object as it says) to an int64 ?
Rename the column product to product_type ?
Add a new column called 'cost' with values [2.99, 3.99] ?
Remove the column called brand ?

Simple and complete:
import numpy as np
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
# Change the df.id column from a string (or object as it says) to an int64 ?
df['id'] = df['id'].astype(np.int64)
# Rename the column product to product_type ?
df = df.rename(columns={'product':'prouduct_type'})
# Add a new column called 'cost' with values [2.99, 3.99] ?
df['cost'] = pd.Series([2.99, 3.99])
# Remove the column called brand ?
df = df.drop(columns='brand')
This functions can also be chained together. I would not recommend it as it is not fixable as above:
# do all the steps above with a single line
df = df.astype({'id':np.int64},
axis=1
).rename(columns={'product':'prouduct_type'}
).assign(cost=[2.99, 3.99]
).drop(columns='brand')
There is also another way to which you can use inplace=True . This does the assignment. I don’t recommend it as it is not explicitly as the first method
# Using inplace=True
df['id'].astype(np.int64, inplace=True)
df.rename(columns={'product':'prouduct_type'}, inplace=True)
# No change from previous
df['cost'] = pd.Series([2.99, 3.99])
# pop brand out
df.pop('brand')
print(df)

You can perform these steps like this (starting with your original data frame):
# add a column
df = pd.concat([df, pd.Series([2.99, 3.99], name='cost')], axis=1)
# change column name
df = df.rename(columns={'product': 'product_type'})
# remove brand
df = df.drop(columns='brand')
# change data type
df['id'] = df['id'].astype('int')
print(df)
product_type id cost
0 drink 19 2.99
1 cup 11 3.99

you cound do:
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
df = (df.assign(cost=[2.99, 3.99],
id=lambda d: d.id.astype(int))
.drop(columns=['brand'])
.rename({"product": 'product_type'}, axis=1))

This should work
# change datatype
>>> df['id'] = df['id'].astype('int64')
>>> df.dtypes
brand object
id int64
product object
# rename column
df.rename(columns={'product': 'product_type'}, inplace=True)
>>> df
brand id product_type
0 spindrift 19 drink
1 None 11 cup
# create new column
df['Cost'] = pd.Series([2.99, 3.99])
>>> df
brand id product_type Cost
0 spindrift 19 drink 2.99
1 None 11 cup 3.99
# drop column
>>> df.drop(['brand'], axis=1, inplace=True)
>>> df
id product_type Cost
0 19 drink 2.99
1 11 cup 3.99

Related

How can I map multiple row entries in df1 (name, category, amount) to one row entry per name under the respective category columns of another df2?

A first dataframe has a column containing categories which are the same as the headers of the second. There are multiple row entries for one name in df1. df2 will have 1 row entry per name. df1 has 1 row entry per category per name. All rows for one name occur in sequence in df1.
As shown:
enter image description here
The headers of df2 are as follows:
enter image description here
And the desired output is below:
enter image description here
How can I map data from the df1 to df2?
More specifically, how can I map multiple rows from df1 to 1 row and the respective columns of df2 in a more efficient way than looping twice to check for each category under each name?
Any help is appreciated,
Have a great day
Code:
import pandas as pds
df1 = pds.DataFrame({'Client': ['Rick', 'Rick', 'John'], 'Category': ['Service1', 'Service2', 'Service1'], 'Amount': [250, 6, 79]})
df2 = pds.DataFrame(columns = ['Client', 'Due_Date', 'Service1', 'Service2'])
output = pds.DataFrame({'Client': ['Rick', 'John'], 'Due_Date': [None,None] , 'Service1': [250, 79], 'Service2': [6, 0]})
This is an alternative approach using .pivot() and .assign()
df1_pivot = (df1.pivot(index='Client', columns='Category', values='Amount')
.reset_index().assign(Due_Date=None))
df_out = df2.assign(**df1_pivot)
print(df_out)
Client Due_Date Service1 Service2
0 John None 79.0 NaN
1 Rick None 250.0 6.0
You're looking for pandas.DataFrame.pivot :
out = (df1.pivot(index="Client", columns="Category")
.reset_index()
.set_axis(["Client", "Service1", "Service2"], axis=1)
.assign(Due_Date= None)
)
NB : I suggest you to use import pandas as pd as per the import convention
.
Output :
​
print(out)
Client Service1 Service2 Due_Date
0 John 79.0 NaN None
1 Rick 250.0 6.0 None

How to preserve column heading while using .split() function

My goal below is to create 1 single column of all the individual words of each string in the 'Name' column.
Although I am achieving this, I am losing the column header on df = df['Name'].str.split(' ', expand=True) . I would like to preserve the header if possible so that I can refer to it later in the script.
I am also ending up with multiple indexes, which is fine, but if there is a way to not have this, it would be great.
Any help is appreciated greatly. Thank you
import pandas as pd
data = {'Name':['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ', expand=True)
df = df.stack(dropna=True)
print(df)
Try this:
data = {'Name': ['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ').explode().to_frame()
print(df)
Prints:
Name
0 Tom
0 Wilson
1 nick
1 snyder
2 krish
2 moham
3 jack
3 oconnell

Python Pivot Table multi Sub-totals in column

I would like to be able to show the sub-total column from a multi-index pivot table in different ways for example, I would like to show the sum for a selected row and the max for another, is this possible?
I managed to get half code correct but I am stuck in replicating the code without offsetting the previous one and I am not able to loop this code over.
In my example I want to get the max value from Toyota and the sum value from Honda shown in the newly created Total column.
cars = {'Brand': ['Honda','Toyota', 'Honda','Toyota'],
'Target': ['A','B', 'A','B'],
'Speed': [20, 80, 30 , 10],
'Date' : ['13/02/2019', '18/02/2019', '18/02/2019', '13/02/2019']
}
df = pd.DataFrame(cars)
table = pd.pivot_table(df, values=['Speed'],
index=['Target', 'Brand'],
columns=['Date'],
fill_value=0, aggfunc=np.sum, dropna=True)
table
the code craeted: (which works only for the last line as it overwrites the first one)
table['Total'] = table.loc(axis=0)[:, ['Toyota']].max(axis=1)
table['Total'] = table.loc(axis=0)[:, ['Honda']].sum(axis=1)
Current output:
Disired Output:
I would like to be able to see also the max value for Toyota which would be 80.
Use slicers for set new values in both sides, here : means all values for levels:
idx = pd.IndexSlice
table.loc[idx[:, 'Toyota'], 'Total'] = table.max(axis=1)
table.loc[idx[:, 'Honda'], 'Total'] = table.sum(axis=1)
print (table)
Speed Total
Date 13/02/2019 18/02/2019
Target Brand
A Honda 20 30 50.0
B Toyota 10 80 80.0
You can set and select in both sides:
idx = pd.IndexSlice
table.loc[idx[:, 'Toyota'], 'Total'] = table.loc[idx[:, 'Toyota'], :].max(axis=1)
table.loc[idx[:, 'Honda'], 'Total'] = table.loc[idx[:, 'Honda'], :].sum(axis=1)

Populating several values in an empty column in a row, based on a value from another column

I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.
So for instance, let's say that for whatever reason, I have the following table:
The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.
In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.
So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:
data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}
After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':
for index, row in df.iterrows:
if row["Employee ID"] in data:
row["Age"] = data["Employee ID"]["Age"]
row["Employment Duration"] = data["Employee ID"]["Employement Duration"]
That's my plan for populating the missing values!
I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!
Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:
Assume we have a dataframe:
data = pd.DataFrame({
'name' : ['john', 'john', 'mary', 'mary'],
'age' : ['', '', 25, 25]
})
Which looks like:
name age
0 john
1 john
2 mary 25
3 mary 25
We can apply a lambda function like so:
data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)
Or we can use pandas .loc:
data['age'].loc[data.name == 'john'] = 27
Test them out and compare how long each take to execute vs. iterating over rows.
Ensure missing values are represented as null values (np.NaN). The second set of information should be stored in another DataFrame with the same column labels.
Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
'Age': [14,14, np.NaN, np.NaN],
'Employment Duration': [3,3, np.NaN, np.NaN],
'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
"21": {"Age": 45, "Employment Duration": 12}}
df2 = pd.DataFrame.from_dict(data, orient='index')
Code
#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')
df.update(df2, overwrite=False)
print(df)
Name Age Employment Duration Clients Handled
Employee ID
11 Alan 14.0 3.0 A
11 Alan 14.0 3.0 B
14 Brian 31.0 3.0 C
21 Dennis 45.0 12.0 G

Applying Conditional Exclusions to Pandas DataFrame using Counts

I have the following DataFrame in pandas:
import pandas as pd
example_data = [{'ticker': 'aapl', 'loc': 'us'}, {'ticker': 'mstf', 'loc': 'us'}, {'ticker': 'baba', 'loc': 'china'}, {'ticker': 'ibm', 'loc': 'us'}, {'ticker': 'db', 'loc': 'germany'}]
df = pd.DataFrame(example_data)
print df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 us ibm
4 germany db
I want to create a new DataFrame such that each row is created from the original df but rows with loc counts greater than 2 are excluded. That is, the new df is created by looping through the old df, counting the number of loc rows that have come before, and including / excluding the row based on this count.
The following code gives the desired output.
country_counts = {}
output = []
for row in df.values:
if row[0] not in country_counts:
country_counts[row[0]] = 1
else:
country_counts[row[0]] +=1
if country_counts[row[0]] <= 2:
output.append({'loc': row[0], 'ticker': row[1]})
new_df = pd.DataFrame(output)
print new_df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 germany db
The output excludes the 4th row in the original df because its loc count is greater than 2 (i.e. 3).
Does there exist a better method to perform this type of operation? Any help is greatly appreciated.
How about groupby and .head:
In [90]: df.groupby('loc').head(2)
Out[90]:
loc ticker
0 us aapl
1 us mstf
2 china baba
4 germany db
Also, be careful with your column names, since loc clashes with the .loc method.

Categories