Renaming columns on slice of dataframe not performing as expected - python

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})

This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))

To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Related

Make a new dataframe from multiple dataframes

Suppose I have 3 dataframes that are wrapped in a list. The dataframes are:
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
The list of the dfs is:
df_list = [df_1, df_2, df_3]
Now I want to make a for loop such that goes on df_list, and for each df takes the text column and merge them on a new dataframe with a new column head called topic. Now since each text column is different from each dataframe I want to populate the headers as topic_1, topic_2, etc. The desired outcome should be as follow:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
I can easily extract the text columns as:
lst = []
for i in range(len(df_list)):
lst.append(df_list[i]['text'].tolist())
It is just that I am stuck on the last part, namely bringing the columns into 1 df without using brute force.
You can extract the wanted columns with a list comprehension and concat them:
pd.concat([d['text'].rename(f'topic_{i}')
for i,d in enumerate(df_list, start=1)],
axis=1)
output:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
Generally speaking you want to avoid looping anything on a pandas DataFrame. However, in this solution I do use a loop to rename your columns. This should work assuming you just have these 3 dataframes:
import pandas as pd
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
df_list = [df_1.text, df_2.text, df_3.text]
df_combined = pd.concat(df_list,axis=1)
df_combined.columns = [f"topic_{i+1}" for i in range(len(df_combined.columns))]
>>> df_combined
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

how to write a function to filter rows based on a list values one by one and make analysis

I have a list contains more than 10 values and I have a full dataframe. I'd like to filter each value from the list to a subdataframe and do some analysis on each of them. How can I write a function so I don't need to copy paste and change value so many times.
eg.
list = ['A','B','C']
df1 = df[df['column1']=='A']
df2 = df[df['column1']=='B']
df3 = df[df['column1']=='C']
for each subdataframe ,I will do a groupby and value count
df1.groupby(['column2']).size()
df2.groupby(['column2']).size()
df3.groupby(['column2']).size()
First many DataFrames is here not necessary.
You can filter only necessary values for column1 and pass both columns to groupby:
L = ['A','B','C']
s = df1[df1['column1'].isin(L)].groupby(['column1', 'column2']).size()
Last select by values of list:
s.loc['A']
s.loc['B']
s.loc['C']
If want function:
def f(df, x):
return df[df['column1'].eq(L)].groupby(['column2']).size()
print (f(df1, 'A'))
You can use locals() to create variables dynamically (it's not really a good practice but it works well):
df = pd.DataFrame({'column1': list('ABC') * 3, 'column2': list('IKJIJKLNK')})
lst = ['A', 'B', 'C']
for idx, elmt in enumerate(lst, 1):
locals()[f"df{idx}"] = df[df['column1'] == elmt]
>>> df3
column1 column2
2 C J
5 C K
8 C K
>>> df3.value_counts('column2')
column1 column2
column2
K 2
J 1
dtype: int64
>>> df
column1 column2
0 A I
1 B K
2 C J
3 A I
4 B J
5 C K
6 A L
7 B N
8 C K

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

How to merge strings pandas df

I am trying merge specific strings in a pandas df. The df below is just an example. The values in my df will differ but the basic rules will apply. I basically want to merge all rows until there's a 4 letter string.
Whilst the 4 letter string in this df is always Excl, my df will contain numerous 4 letter strings.
import pandas as pd
d = ({
'A' : ['Include','Inclu','Incl','Inc'],
'B' : ['Excl','de','ude','l'],
'C' : ['X','Excl','Excl','ude'],
'D' : ['','Y','ABC','Excl'],
})
df = pd.DataFrame(data=d)
Out:
A B C D
0 Include Excl X
1 Inclu de Excl Y
2 Incl ude Excl ABC
3 Inc l ude Excl
Intended Output:
A B C D
0 Include Excl X
1 Include Excl Y
2 Include Excl ABC
3 Include Excl
So row 0 stays the same as col B has 4 letters. Row 1 merges Col A,B as Col C 4 letters. Row 2 stays the same as above. Row 3 merges Col A,B,C as Col D has 4 letters.
I have tried to do this manually by merging all columns and then go back and removing unwanted values.
df["Com"] = df["A"].map(str) + df["B"] + df["C"]
But I would have to manually go through each row and remove different lengths of letters.
The above df is just an example. The central similarity is I need to merge everything before the 4 letter string.
You could do something like
mask = (df.iloc[:, 1:].applymap(len) == 4).cumsum(1) == 0
df.A = df.A + df.iloc[:, 1:][mask].apply(lambda x: x.str.cat(), 1)
df.iloc[:, 1:] = df.iloc[:, 1:][~mask].fillna('')
try this,
Sorry for the clumsy solution, I'll try to improve the performance ,
temp=df.eq('Excl').shift(-1,axis=1)
df['end']= temp.apply(lambda x:x.argmax(),axis=1)
res=df.apply(lambda x:x.loc[:x['end']].sum(),axis=1)
mask=temp.replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
del df['end']
df[:]=np.where(mask,'',df)
df['A']=res
print df
Output:
A B C D
0 Include Excl X
1 Include Excl Y
2 Include Excl ABC
3 Include Excl
Improved solution:
res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
mask=df.eq('Excl').shift(-1,axis=1).replace(False,np.NaN).fillna(method='ffill').fillna(False).astype(bool)
df[:]=np.where(mask,'',df)
df['A']=res
More simplified solution:
t=df.eq('Excl').shift(-1,axis=1)
res= df.apply(lambda x:x.loc[:x.eq('Excl').shift(-1).argmax()].sum(),axis=1)
df[:]=np.where(t.fillna(0).astype(int).cumsum() >= 1,'',df)
df['A']=res
I am giving you a rough approach,
Here, we are finding the location of the 'Excl' and merging the column values up it so as to obtain our desired output.
ls=[]
for i in range(len(df)):
end=(df.loc[i,:].index[(df.loc[i,:]=='Excl')][0])
ls.append(''.join(df.loc[i,:end].replace({'Excl':''}).values))
df['A']=ls

Find all duplicate columns in a collection of data frames

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.
Example
The input are 3 data frames df1, df2 and df3:
df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})
a b e
0 1 3 0
1 5 9 7
df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})
d e f
0 2 0 2
1 3 7 1
df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})
b c e
0 3 8 0
1 9 2 7
The output is a list [b, e]
pd.Series.duplicated
Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:
# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])
# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()
print(res)
array(['b', 'e'], dtype=object)
pd.Series.value_counts
Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()
res = s[s > 1].index
print(res)
Index(['e', 'b'], dtype='object')
collections.Counter
The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.
from itertools import chain
from collections import Counter
c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))
res = [k for k, v in c.items() if v > 1]
here is my code for this problem, for comparing with only two data frames, with out concat them.
def getDuplicateColumns(df1, df2):
df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
df_compare["df2"] = ""
# Iterate over all the columns in dataframe
for x in range(df1.shape[1]):
# Select column at xth index.
col = df1.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
duplicateColumnNames = []
for y in range(df2.shape[1]):
# Select column at yth index.
otherCol = df2.iloc[:, y]
# Check if two columns at x y index are equal
if col.equals(otherCol):
duplicateColumnNames.append(df2.columns.values[y])
df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)
return df_compare

Categories