I have data frame of 20 columns. All of them have a common text and a serial number. I want to trim the text part and make the name shorter. Below is an example:
xdf = pd.DataFrame({'Column1':[10,20],'Column2':[80,90]})
Column1 Column2
0 10 80
1 20 90
Expected output:
C1 C2
0 10 80
1 20 90
Solution1:
oldcols = ['Column1','Column2']
newcols = ['C1','C2']
xdf.rename(columns=dict(zip(oldcols,newcols)),inplace=True)
C1 C2
0 10 80
1 20 90
Solution2:
for i in range(len(oldcols)):
xdf.rename(columns={'%s'%(xdf[i]):'%s'%(xdf[i].replace('Column','C'))},inplace=True)
raise KeyError(key) from err
Solution1 works fine but I have to prepare an old and new column names list. Instead, I want to iterate through each column name and replace the column text. However, solution2 is not working.
You could use str.findall on the columns to split into text and number; then use a list comprehension to take only the first letter and join it with the numbers for each column name:
xdf.columns = [x[0]+y for li in xdf.columns.str.findall(r'([A-Za-z]+)(\d+)') for x,y in li]
Output:
C1 C2
0 10 80
1 20 90
Related
I have this example CSV file:
Name,Dimensions,Color
Chair,!12:88:33!!9:10:50!!40:23:11!,Red
Table,!9:10:50!!40:23:11!,Brown
Couch,!40:23:11!!12:88:33!,Blue
I read it into a dataframe, then split Dimensions by ! and take the first value of each !..:..:..!-section. I append these as new columns to the dataframe, and delete Dimensions. (code for this below)
import pandas as pd
df = pd.read_csv("./data.csv")
df[["first","second","third"]] = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df = df.drop("Dimensions", axis=1)
And I get this:
Name Color first second third
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
I named them ["first","second","third"] by manually here.
But what if there are more than 3 in the future, or only 2, or I don't know how many there will be, and I want them to be named using a string + an enumerating number?
Like this:
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Question:
How do I make the naming automatic, based on the string "data_" so it gives each column the name "data_" + the number of the column? (So I don't have to type in names manually)
Use DataFrame.pop for use and drop column Dimensions, add DataFrame.add_prefix to default columnsnames and append to original DataFrame by DataFrame.join:
df = (df.join(df.pop('Dimensions')
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]).add_prefix('data_')))
print (df)
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Nevermind, hahah, I solved it.
import pandas as pd
df = pd.read_csv("./data.csv")
df2 = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df[[ ("data_"+str(i)) for i in range(len(df2.columns)) ]] = df2
df = df.drop("Dimensions", axis=1)
I'm trying to multiply certain columns within a data frame by one specific column. While the specific column will always be the same, the names and number of the other columns will vary so I cannot specify those column names except for what the string they begin with. I have been trying to use .mul but have had trouble getting it to work. Any way I can do this?
Original DF
ID #
Column To Multiply By
ABC Column1
ABC Column2
12
2
1
2
13
3
1
2
14
4
1
2
Desired DF
ID #
Column To Multiply By
ABC Column1
ABC Column2
12
2
2
4
13
3
3
6
14
4
4
8
I tried using the line below but am met with a syntax error:
df2 = df1.loc[:, df1.columns.str.startswith('ABC').mul(df1.Column to Multiply By, axis=0)
This should do the work:
df2 = df1
df2['ABC Column1'] = df1['Column To Multiply By'] * df1['ABC Column1']
df2['ABC Column2'] = df1['Column To Multiply By'] * df1['ABC Column2']
Edit: If the name of ABC columns changes but the prefix is constant, this should work:
df[df.columns[pd.Series(df.columns).str.startswith('ABC')]] = df[df.columns[pd.Series(df.columns).str.startswith('ABC')]].multiply(df['Column To Multiply By'], axis='index')
I have a dataframe with many rows like this:
ID
Variable
1
A1_1 - Red
2
A1_2 - Blue
3
A1_3 - Yellow
I'm trying to iterate over all rows so that all the 2nd column's values change to just "A1". The code I've come up with is:
for row in df.iterrows():
current_response_id=row[1][0]
columncount=0
for columncount in range(2):
variable=row[1][1];
row[1][1]=variable.split("_")[0].split(" -")[0]
variable=row[1][1];
However, this isn't achieving the desired result. How could I go about this?
Try:
df["Variable"] = df["Variable"].str.split("_").str[0]
print(df)
Prints:
ID Variable
0 1 A1
1 2 A1
2 3 A1
I have a dataframe df_in, which contains column names that start with pi and pm.
df_in = pd.DataFrame([[1,2,3,4,"",6,7,8,9],["",1,32,43,59,65,"",83,97],["",51,62,47,58,64,74,86,99],[73,51,42,67,54,65,"",85,92]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
If a row in column name which starts with pi is blank, make the same rows of columns which starts with pm also blank till we have a new column which starts with pi. And repeat the same process for other columns also.
Expected Output:
df_out = pd.DataFrame([[1,2,3,4,"","",7,8,9],["","","","",59,65,"","",""],["","","","",58,64,74,86,99],[73,51,42,67,54,65,"","",""]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
How to do it?
You can create groups by compare columns names by str.startswith with cumulative sum and then compare values by empty spaces in groupby for mask used for set empty spaces in DataFrame.mask:
g = df_in.columns.str.startswith('pi').cumsum()
df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform(lambda x: x.iat[0]), '')
#first for me failed in pandas 1.2.3
#df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform('first'), '')
print (df)
piabc pmed pmrde pmret pirtc pmere piuyt pmfgf pmthg
0 1 2 3 4 7 8 9
1 59 65
2 58 64 74 86 99
3 73 51 42 67 54 65
This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.