I have a dataframe with two levels of columns index.
Reproducible Dataset.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','D'],
['Name','Name','Company','Company']])
I want to rename the duplicated MultiIndex columns, only when level-0 and level-1 combined is duplicated. Then add a suffix number to the end. Like the one below.
Below is a solution I found, but it only works for single level column index.
class renamer():
def __init__(self):
self.d = dict()
def __call__(self, x):
if x not in self.d:
self.d[x] = 0
return x
else:
self.d[x] += 1
return "%s_%d" % (x, self.d[x])
df = df.rename(columns=renamer())
I think the above method can be modified to support the multi level situation, but I am too new to pandas/python.
Thanks in advance.
#Datanovice
This is to clarify to you about the output what I need.
I have the snippet below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([
['A','A','C','A'],
['A','A','C','A'],
['Company','Company','Company','Name']]))
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby(0).cumcount()
s = [np.where(cond.gt(0),s[i] + '_' + cond.astype(str),s[i]) for i in
range(df.columns.nlevels)]
s = pd.DataFrame(s)
#print(s)
df.columns = pd.MultiIndex.from_arrays(s.values.tolist())
print(df)
The current result is-
What I need is the last piece of column index should not be counted as duplicated, as as "A-A-Name" is not same with the first two.
Thank you again.
Might be a better way to do this, but you could return a dataframe from your columns and apply a conditional operation on them and re-assign them.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','A'],
['Name','Name','Company','Company']])
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby([0,1]).cumcount()
s[0] = np.where(cond.gt(0),s[0] + '_' + cond.astype(str),s[0])
s[1] = np.where(cond.gt(0),s[1] + '_' + cond.astype(str),s[1])
df.columns = pd.MultiIndex.from_frame(s)
print(df)
0 A A_1 C D
1 Name Name_1 Company Company
0 Gaz Gaz Gaz Gaz
1 X X X X
2 Y Y Y Y
3 Z Z Z Z
Try this -
arrays = [['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.randn(3, 8), columns=index)
A B
A A A B C C D D
0 0 0 1 3 1 2 1 4
1 0 1 1 1 1 3 0 1
2 1 1 4 2 3 2 1 4
suffix = pd.DataFrame(df.columns)
suffix['count'] = suffix.groupby(0).cumcount()
suffix['new'] = [((i[0]+'_'+str(j)),(i[1]+'_'+str(j))) for i,j in zip(suffix[0],suffix['count'])]
new_index = pd.MultiIndex.from_tuples(list(suffix['new']))
df.columns = new_index
Related
I have a data frame and a dictionary like this:
thresholds = {'column':{'A':10,'B':11,'C':9}}
df:
Column
A 13
A 7
A 11
B 12
B 14
B 14
C 7
C 8
C 11
For every index group, I want to calculate the count of values less than the threshold and greater than the threshold value.
So my output looks like this:
df:
Values<Thr Values>Thr
A 1 2
B 0 3
C 2 1
Can anyone help me with this
You can use:
import numpy as np
t = df.index.to_series().map(thresholds['column'])
out = (pd.crosstab(df.index, np.where(df['Column'].gt(t), 'Values>Thr', 'Values≤Thr'))
.rename_axis(index=None, columns=None)
)
Output:
Values>Thr Values≤Thr
A 2 1
B 3 0
C 1 2
syntax variant
out = (pd.crosstab(df.index, df['Column'].gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
apply on many column based on the key in the dictionary
def count(s):
t = s.index.to_series().map(thresholds.get(s.name, {}))
return (pd.crosstab(s.index, s.gt(t))
.rename_axis(index=None, columns=None)
.rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
)
out = pd.concat({c: count(df[c]) for c in df})
NB. The key of the dictionary must match exactly. I changed the case for the demo.
Output:
Values≤Thr Values>Thr
Column A 1 2
B 0 3
C 2 1
Here another option:
import pandas as pd
df = pd.DataFrame({'Column': [13, 7, 11, 12, 14, 14, 7, 8, 11]})
df.index = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
thresholds = {'column':{'A':10,'B':11,'C':9}}
df['smaller'] = df['Column'].groupby(df.index).transform(lambda x: x < thresholds['column'][x.name]).astype(int)
df['greater'] = df['Column'].groupby(df.index).transform(lambda x: x > thresholds['column'][x.name]).astype(int)
df.drop(columns=['Column'], inplace=True)
# group by index summing the greater and smaller columns
sums = df.groupby(df.index).sum()
sums
My dataframe looks like this:
id column1 column2
a x l
a x n
a y n
b y l
b y m
Currently, I generate value counts with this
def value_occurences(grouped, column_name):
return (grouped[column_name].value_counts(normalize=False, dropna=False)
.to_frame('count_'+column_name)
.reset_index(level=1))
result = value_occurences(grouped, 'column1')
"""
>>>result
id column1 count_column1
a x 2
a y 1
b y 1
"""
And I need to count value occurrences in this format:
id column1 column2
a 'x:2; y:1' 'l:1; n:2'
b 'y:1' 'l:1; m:1'
how can I turn my result into that format?
I know this is not using Pandas, but it might still help you:
from collections import defaultdict
import pandas as pd
df = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b'], 'column1': ['x', 'x', 'y', 'y', 'y'], 'column2': ['l', 'n', 'n', 'l', 'm']})
# id column1 column2
# 0 a x l
# 1 a x n
# 2 a y n
# 3 b y l
# 4 b y m
c1_counter = defaultdict(lambda: defaultdict(int))
c2_counter = defaultdict(lambda: defaultdict(int))
for idx, row in df.iterrows():
c1_counter[row['id']][row['column1']] += 1
c2_counter[row['id']][row['column2']] += 1
new_data = defaultdict(list)
for k, v in c1_counter.items():
new_data['id'].append(k)
c1_items = [f'{v_}:{f}' for v_, f in v.items()]
c2_items = [f'{v_}:{f}' for v_, f in c2_counter[k].items()]
new_data['column1'].append(';'.join(c1_items))
new_data['column2'].append(';'.join(c2_items))
df = pd.DataFrame(new_data)
then df will look like:
id column1 column2
0 a x:2;y:1 l:1;n:2
1 b y:2 l:1;m:1
You can first generate groups of the df by df.groupby(['id']) and apply value_counts to each group:
import io, pandas as pd
def seqdict(x):
return ', '.join('{}:{}'.format(*i) for i in sorted(x.items()))
def value_occurences(df):
return pd.DataFrame({c: {i: seqdict(d.iloc[:,j].value_counts().to_dict())
for i, d in df.groupby(by=['id']) }
for j, c in enumerate(df.keys())
})
grouped = pd.read_table(io.StringIO("""id column1 column2
a x l
a x n
a y n
b y l
b y m
"""), sep='\s+')
value_occurences(grouped)
Results:
column1 column2
a x:2, y:1 l:1, n:2
b y:2 l:1, m:1
You can use groupby twice. Add first you count the values and then you join them together:
dfs = []
for column in ['column1', 'column2']:
df_ = df.groupby(['id'])[column].value_counts()
df_ = df_.index.get_level_values(-1) + ':' + df_.astype(str)
df_ = df_.groupby('id').agg(lambda x: '; '.join(x)).rename(column)
dfs.append(df_)
pd.concat(dfs, axis=1)
If I have a dataframe,
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
I'd like to attach a mark '#' with the name if column['name'] equal to list containing 'A' and 'B'. Then I can see something like below in the result, does anyone know how to do it using pandas in elegant way?
name_list = ['A','B','D'] # But we only have A and B in df.
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
If name_list is the same length as the length of the Series name, then you could try this:
df1['name_list'] = ['A','B','D']
df1.ix[df1.name == df1.name_list, 'name'] = '#'+df1.name
This would only prepend a '#' when the value of name and name_list are the same for the current index.
In [81]: df1
Out[81]:
john_01 mary_02 name name_list
0 1 4 #A A
1 2 5 #B B
2 3 6 C D
In [82]: df1.drop('name_list', axis=1, inplace=True) # Drop assist column
If the two are not the same length - and therefore you don't care about index - then you could try this:
In [84]: name_list = ['A','B','D']
In [87]: df1.ix[df1.name.isin(name_list), 'name'] = '#'+df1.name
In [88]: df1
Out[88]:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
I hope this helps.
Use df.loc[row_indexer,column_indexer] operator with isin method of a Series object:
df.loc[df.name.isin(name_list), 'name'] = '#'+df.name
print(df)
The output:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
http://pandas.pydata.org/pandas-docs/stable/indexing.html
You can use isin to check whether the name is in the list, and use numpy.where to prepend #:
df['name'] = np.where(df['name'].isin(name_list), '#', '') + df['name']
df
Out:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
import pandas as pd
def exclude_list (x):
list_exclude = ['A','B']
if x in list_exclude:
x = '#' + x
return x
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
df['name'] = df['name'].apply(lambda row: exclude_list(row))
print(df)
I need to change individual elements in a DataFrame. I tried doing something like this, but it doesn't work:
for index, row in df.iterrows():
if df.at[row, index] == 'something':
df.at[row, index] = df.at[row, index] + 'add a string'
else:
df.at[row, index] = df.at[row, index] + 'add a value'
How can I do that?
If need modify all columns in DataFrame use numpy.where with DataFrame constructor, because where return numpy array:
df = pd.DataFrame(np.where(df == 'something', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
If only one column col:
df['col'] = np.where(df['col'] == 'something',
df['col'] + 'add a string',
df['col'] + 'add a value')
Sample:
df = pd.DataFrame({'col': ['a', 'b', 'a'], 'col1': ['a', 'b', 'b']})
print (df)
col col1
0 a a
1 b b
2 a b
df = pd.DataFrame(np.where(df == 'a', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
print (df)
col col1
0 aadd a string aadd a string
1 badd a value badd a value
2 aadd a string badd a value
df['col'] = np.where(df['col'] == 'a',
df['col'] + 'add a string',
df['col'] + 'add a value')
print (df)
col col1
0 aadd a string a
1 badd a value b
2 aadd a string b
You can use .ix and apply a function like this:
import pandas as pd
D = pd.DataFrame({'A': ['a', 'b', 3,7,'b','a'], 'B': ['a', 'b', 3,7,'b','a']})
D.ix[D.index%2 == 0,'A'] = D.ix[D.index%2 == 0,'A'].apply(lambda s: s+'x' if isinstance(s,str) else s+1)
D.ix[D.index[2:5],'B'] = D.ix[D.index[2:5],'B'].apply(lambda s: s+'y' if isinstance(s,str) else s-1)
First example appends x to each string or alternatively adds 1 to each non-string on column A for every even index.
The second example appends y to each string or alternatively subtracts 1 from each non-string on column B for the indices 2,3,4.
Original Frame:
A B
0 a a
1 b b
2 3 3
3 7 7
4 b b
5 a a
Modified Frame:
A B
0 ax a
1 b b
2 4 2
3 7 6
4 bx by
5 a a
I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)