Pandas loop into variables adding suffix and transforming original column - python

I would like to loop into some variable name and the equivalent column with an added suffix "_plus"
#original dataset
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
df
#desired dataset
df['time']=df['time']*df['time_plus']
df['zone']=df['zone']*df['zone_plus']
df
I would like to do the multiplication in a more elegant way, through a loop, since I have many variables with this pattern: original name * transformed variable with the _plus suffix
something similar to this or better
my_list=['time','zone']
for i in my_list:
df[i]=df[i]*df[i+"_plus"]

Try:
for c in df.filter(regex=r".*(?<!_plus)$", axis=1):
df[c] *= df[c + "_plus"]
print(df)
Prints:
time zone time_plus zone_plus
0 10 0 5 0
1 6 9 6 9
2 8 18 2 6
3 6 0 3 5
Or:
for c in df.columns:
if not c.endswith("_plus"):
df[c] *= df[c + "_plus"]

raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
# Take every column that doesn't have a "_plus" suffix
cols = [i for i in list(df.columns) if "_plus" not in i]
# Calculate new columns
for col in cols:
df[str(col+"_2")] = df[col]*df[str(col+"_plus")]
I decided to create the new columns with a "_2" suffix, this way we don't mess up the original data.

for c in df.columns:
if f"{c}_plus" in df.columns:
df[c] *= df[f"{c}_plus"]

Related

Python pandas groupby agg- sum one column while getting the mean of the rest

Looking to group my fields based on date, and get a mean of all the columns except a binary column which I want to sum in order to get a count.
I know I can do this by:
newdf=df.groupby('date').agg({'var_a': 'mean', 'var_b': 'mean', 'var_c': 'mean', 'binary_var':'sum'})
But there is about 50 columns (other than the binary) that I want to mean, and I feel there must be a simple, quicker way of doing this instead of writing each 'column title' :'mean' for all of them. I've tried to make a list of column names but when I put this in the agg function, it says a list is an unhashable type.
Thanks!
Something like this might work -
df = pd.DataFrame({'a':['a','a','b','b','b','b'], 'b':[10,20,30,40,20,10], 'c':[1,1,0,0,0,1]}, 'd':[20,30,10,15,34,10])
df
a b c d
0 a 10 1 20
1 a 20 1 30
2 b 30 0 10
3 b 40 0 15
4 b 20 0 34
5 b 10 1 10
Assuming c is the binary variable column. Then,
cols = [ val for val in df.columns if val != 'c']
temp = pd.concat([df.groupby(['a'])[cols].mean(), df.groupby(['a'])['c'].sum()], axis=1).reset_index()
temp
a b d c
0 a 15.0 25.00 2
1 b 25.0 17.25 1
In general, I would build the agg dict automatically:
sum_cols = ['binary_val']
agg_dict = {col: 'sum' if col in sum_cols else 'mean'
for col in df.columns if col != 'date'}
df.groupby('date').agg(agg_dict)

Pandas groupby different aggregation values with big dataframe

I have a dataframe with 700+ columns. I am doing a groupby with one column, lets say df.a, and I want to aggregate every column by mean except the last 10, which I want to aggregate my max. I am aware of creating a conditional dictionary and then passing into a groupby like this:
d = {'DATE': 'last', 'AE_NAME': 'last', 'ANSWERED_CALL': 'sum'}
res = df.groupby(df.a).agg(d)
However, with so many columns, I do not want to have to write this all out. Is there a quick way to do this?
You could use zip and some not really elegant code imo but it works:
cols = df.drop("A", axis=1).columns # drop groupby column since not in agg
len_means = len(cols[:-10]) # grabbing all cols except the last ten ones
len_max = len(cols[-10:] # grabbing the last ten cols length
d_means = {i:j for i,j in zip(cols[:-10], ["mean"]*len_means)}
d_max = {i:j for i,j in zip(cols[-10:], ["max"]*len_max)}
d = d_means.update(d_max}
res = df.groupby(df.a).agg(d)
Edit : since OP mentioned the columns are named differently (ending with letter c then)
c_cols = [col for col in df.columns if col.endswith('c')]
non_c_cols = [col for col in df.columns if col not in c_cols]
and one only needs to plug the cols in the code above the get the result
I would approach this problem the following:
Define a cutoff for which columns to select
Select the columns you need
Create both your mean and max aggregation with GroupBy
Join both dataframes together:
# example dataframe
df = pd.DataFrame(np.random.rand(5,10), columns=list('abcdefghij'))
df.insert(0, 'ID', ['aaa', 'bbb', 'aaa', 'ccc', 'bbb'])
ID a b c d e f g h i j
0 aaa 0.228208 0.822641 0.407747 0.416335 0.039717 0.854789 0.108124 0.666190 0.074569 0.329419
1 bbb 0.285293 0.274654 0.507607 0.527335 0.599833 0.511760 0.747992 0.930221 0.396697 0.959254
2 aaa 0.844373 0.431420 0.083631 0.656162 0.511913 0.486187 0.955340 0.130358 0.759013 0.181874
3 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127
4 bbb 0.823457 0.172947 0.907415 0.719616 0.632012 0.199703 0.672745 0.563852 0.120827 0.092455
cutoff = 7
mean_cols = df.columns[:cutoff]
max_cols = ['ID'] + df.columns[cutoff:].tolist()
df1 = df[mean_cols].groupby('ID').mean()
df2 = df[max_cols].groupby('ID').max()
df = df1.join(df2).reset_index()
ID a b c d e f g h i j
0 aaa 0.536290 0.627031 0.245689 0.536248 0.275815 0.670488 0.955340 0.666190 0.759013 0.329419
1 bbb 0.554375 0.223800 0.707511 0.623476 0.615923 0.355732 0.747992 0.930221 0.396697 0.959254
2 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127

Run a function that requires multiple arguments through multiple columns - Pandas

Hi I currently have a function that is able to split values in a same cell that is delimited by a new line. However the function below only accepts me to pass through one column at a time was thinking if there is any other ways that I can pass it through multiple columns or in fact the whole dataframe.
A sample would be like this
A B C
1\n2\n3 2\n\5 A
The code is below
def tidy_split(df, column, sep='|', keep=False):
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
It currently works when I run
df1 = tidy_split(df, 'A', '\n')
After running the function of selecting only column A
A B C
1 2\n5 A
2 2\n5 A
3 2\n5 A
I was hoping to be able to pass in more than just an accepted argument and in this case splitting column 'B' as well. Previously I have attempted passing in lambda or attempted using apply but it requires a positional argument which is 'column'. Would appreciate any help given! Was thinking if a loop is possible
EDIT: Desired output as each number refer to something important
Before
A B C
1\n2\n3 2\n5 A
After
A B C
1 2 A
2 5 A
3 n/a A
Input:
A B C
0 1\n2\n3 2\n5 A
Code:
import pandas as pd
cols = df.columns.tolist()
# create list in each cell by detecting '\n'
for col in cols:
df[col] = df[col].apply(lambda x: str(x).split("\n"))
# empty dataframe to store result
dfs = pd.DataFrame()
# loop over rows to construct small dataframes
# and then accumulate each to the resulting dataframe
for ind, row in df.iterrows():
a_vals = row['A']
b_vals = row['B'] + ["n/a"] * (len(a_vals) - len(row['B']))
c_vals = row['C'] + [row['C'][0]] * (len(a_vals) - len(row['C']))
temp = pd.DataFrame({'A': a_vals, 'B': b_vals, 'C': c_vals})
dfs = pd.concat([dfs, temp], axis=0, ignore_index=True)
Output:
A B C
0 1 2 A
1 2 5 A
2 3 n/a A

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Categories