Python 3 match values based on column name similarity - python

I have a dataframe of the following form:
Year 1 Grade
Year 2 Grade
Year 3 Grade
Year 4 Grade
Year 1 Students
Year 2 Students
Year 3 Students
Year 4 Students
60
70
80
100
20
32
18
25
I would like to somehow transpose this table to the following format:
Year
Grade
Students
1
60
20
2
70
32
3
80
18
4
100
25
I created a list of years and initiated a new dataframe with the "year" column. I was thinking of matching the year integer to the column name containing it in the original DF, match and assign the correct value, but got stuck there.

You need a manual reshaping using a split of the Index into a MultiIndex:
out = (df
.set_axis(df.columns.str.split(expand=True), axis=1) # make MultiIndex
.iloc[0] # select row as Series
.unstack() # unstack Grade/Students
.droplevel(0) # remove literal "Year"
.rename_axis('Year') # set index name
.reset_index() # index to column
)
output:
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
Or using pivot_longer from janitor:
# pip install pyjanitor
import janitor
out = (df.pivot_longer(
names_to = ('ignore', 'Year', '.value'),
names_sep = ' ')
.drop(columns='ignore')
)
out
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
The .value determines which parts of the sub labels in the columns are retained; the labels are split apart by names_sep, which can be a string or a regex. Another option is by using a regex, with names_pattern to split and reshape the columns:
df.pivot_longer(names_to = ('Year', '.value'),
names_pattern = r'.+(\d)\s(.+)')
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25

Here's one way to do it. Feel free to ask questions about how it works.
import pandas as pd
cols = ["Year 1 Grade", "Year 2 Grade", "Year 3 Grade" , "Year 4 Grade",
"Year 1 Students", "Year 2 Students", "Year 3 Students", "Year 4 Students"]
vals = [60,70,80,100,20,32,18,25]
vals = [[v] for v in vals]
df = pd.DataFrame({k:v for k,v in zip(cols,vals)})
grades = df.filter(like="Grade").T.reset_index(drop=True).rename(columns={0:"Grades"})
students = df.filter(like="Student").T.reset_index(drop=True).rename(columns={0:"Students"})
pd.concat([grades,students], axis=1)

I came up with these. grades here are your first row.
df = pd.DataFrame(grades) # your dataframe without columns
grades = np.array(df.iloc[0]).reshape(4,2) # extract the first row, turn it into an array and reshape it to 4*2
new_df = pd.DataFrame(grades).reset_index()
new_df.columns = ['Year', 'Grade', 'Students'] # rename the columns

Related

Pandas: Give (string + numbered) name to unknown number of added columns

I have this example CSV file:
Name,Dimensions,Color
Chair,!12:88:33!!9:10:50!!40:23:11!,Red
Table,!9:10:50!!40:23:11!,Brown
Couch,!40:23:11!!12:88:33!,Blue
I read it into a dataframe, then split Dimensions by ! and take the first value of each !..:..:..!-section. I append these as new columns to the dataframe, and delete Dimensions. (code for this below)
import pandas as pd
df = pd.read_csv("./data.csv")
df[["first","second","third"]] = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df = df.drop("Dimensions", axis=1)
And I get this:
Name Color first second third
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
I named them ["first","second","third"] by manually here.
But what if there are more than 3 in the future, or only 2, or I don't know how many there will be, and I want them to be named using a string + an enumerating number?
Like this:
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Question:
How do I make the naming automatic, based on the string "data_" so it gives each column the name "data_" + the number of the column? (So I don't have to type in names manually)
Use DataFrame.pop for use and drop column Dimensions, add DataFrame.add_prefix to default columnsnames and append to original DataFrame by DataFrame.join:
df = (df.join(df.pop('Dimensions')
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]).add_prefix('data_')))
print (df)
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Nevermind, hahah, I solved it.
import pandas as pd
df = pd.read_csv("./data.csv")
df2 = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df[[ ("data_"+str(i)) for i in range(len(df2.columns)) ]] = df2
df = df.drop("Dimensions", axis=1)

Rearanging data frame table Pandas Python

I have the following kind of data frame.
Id Name Exam Result Exam Result
1 Bob Maths 10 Physics 9
2 Mar ML 8 Chemistry 10
What I would like to have is removing the duplicate columns and adding their value to the corresponding rows. Something below
Id Name Exam Result
1 Bob Maths 10
1 Bob Physics 9
2 Mar ML 8
2 Mar Chemistry 10
Is there any way to do this in Python?
Any help is appreciated!
First create MultiIndex by first columns, which are not duplicated by DataFrame.set_index, then create MultiIndex in columns by counter of duplicates nameswith GroupBy.cumcount wotking with Series, so Index.to_series and last reshape by DataFrame.stack with DataFrame.reset_index for remove helper level and then for MultiIndex to columns:
df = df.set_index(['Id','Name'])
s = df.columns.to_series()
df.columns = [s, s.groupby(s).cumcount()]
df = df.stack().reset_index(level=2, drop=True).reset_index()
print (df)
Id Name Exam Result
0 1 Bob Maths 10
1 1 Bob Physics 9
2 2 Mar ML 8
3 2 Mar Chemistry 10
This is an alternative using pandas melt:
#flip table into long format
(df.melt(['Id','Name'])
#sort by Id so that result follows immediately after Exam
.sort_values('Id')
#create new column on rows that have result in the variable column
.assign(Result=lambda x: x.loc[x['variable']=="Result",'value'])
.bfill()
#get rid of rows that contain 'result' in variable column
.query('variable != "Result"')
.drop(['variable'],axis=1)
.rename(columns={'value':'Exam'})
)
Id Name Exam Result
0 1 Bob Maths 10
4 1 Bob Physics 9
1 2 Mar ML 8
5 2 Mar Chemistry 10
Alternatively, just for fun :
df = df.set_index(['Id','Name'])
#get boolean of duplicated columns
dupes = df.columns.duplicated()
#concatenate first columns and their duplicates
pd.concat([df.loc[:,~dupes],
df.loc[:,dupes]
]).sort_index()

Slicing pandas dataframe on equal column values

I have a pandas df that looks like this:
import pandas as pd
df = pd.DataFrame({0:[1],5:[1],10:[1],15:[1],20:[0],25:[0],
30:[1],35:[1],40:[0],45:[0],50:[0]})
df
The column names reflect coordinates. I would like to retrieve the start and end coordinate of columns with consecutive equal numbers.
The output should be something like this:
# start,end
0,15
20,25
30,35
40,50
IIUCusing groupby with diff and cumsum to split the group
s=df.T.reset_index()
s=s.groupby(s[0].diff().ne(0).cumsum())['index'].agg(['first','last'])
Out[241]:
first last
0
1 0 15
2 20 25
3 30 35
4 40 50
cumsum to identify group, and groupby:
s = df.iloc[0].diff().ne(0).cumsum()
(df.columns.to_series()
.groupby(s).agg(['min','max'])
)
Output:
min max
0
1 0 15
2 20 25
3 30 35
4 40 50

Resetting index and setting a new index creates multi-level column in pandas dataframe

I have the following code:
df = pageview_df[['student_id', 'page_id']].groupby('student_id').agg('count')
df.head(3)
Which generates the following dataframe:
page_id
student_id
1 22
2 34
3 30
Then, as an attempt to have only single layer of columns, I reset the index this:
df.reset_index(inplace=True)
df.head(3)
Resulting in this dataframe:
student_id page_id
0 1 22
1 2 34
2 3 30
But, then I want to get rid of the new automatically generated index, and use student_id as the new index:
df = df.set_index('student_id')
df.head(3)
But, then this code gives me what I originally have:
page_id
student_id
1 22
2 34
3 30
Can someone explain me why it works this way? How can I resolve this issue? I want to obtain this dataframe:
student_id page_id
1 22
2 34
3 30

pandas: append new column of row subtotals

This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.

Categories