Understanding Pandas Pivot function - python

I want to convert a categorical column in a pandas dataframe to multiple columns containing values. Here is a minimal example dataframe
dfTest = pd.DataFrame({
'animal' : ['cat','cat','dog','dog', 'mouse', 'mouse', 'rat', 'rat'],
'color' : ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
'weight' : np.random.uniform(3, 20, 8)
})
dfTest
The table looks like this
According to pandas user guide, it seems to me that what I want to do is called a pivot. Namely, what I want to do should look something like this
animal weight_black weight_white
0 cat 1.23456 2.34234
1 dog 3.634634 3.4554646
2 mouse 5.24234 5.463452
3 rat 4.56456 2.3364
However, when I run
dfTest.pivot(columns='color', values='weight')
I get the following:
I don't want other categorical columns (such as animal) to disappear. Also, I don't want nans inbetween, I want everything to be compact. How do I do this?
EDIT: Here's a more involved example of what I want
animal color hair_length weight
1 cat black long 1.23
2 cat white long 2.34
3 cat black short 34534
4 cat white short 345
5 dog black long 234
6 dog white long 123
7 dog black short 444
8 dog white short 345
9 rat black long 5465
10 rat white long 2343
11 rat black short 123
12 rat white short 2343
13 bat black long 423
14 bat white long 23
15 bat black short 11123
16 bat white short 13423
I want to convert it to
animal hair_length weight_black weight_white
1 cat long 2.34 235
2 cat short 345 3423
3 dog long 123 56346
4 dog short 345 .... you get the point
5 rat long 2343
6 rat short 2343
7 bat long 23
8 bat short 13423

Ok I think I figured it out, #Randy's hint was actually enough
index = list(set(dfTest.columns) - {'color', 'weight'})
dfResult = df.pivot(index=index, columns='color', values='weight').reset_index()
So we
Put all of the columns except for the two columns of interest into index
Perform pivot, which results in complicated hierarchical index
Convert from complicated index to simple index by doing reset_index()

Related

How to filter dataframe columns between two rows that contain specific string in column?

I am trying to understand how to select only those rows in my dataframe that are between two specific rows. These rows contain two specific strings in one of the columns. I will explain further with this example.
I have the following dataframe:
String Value
-------------------------
0 Blue 45
1 Red 35
2 Green 75
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
9 Yellow 22
10 Red 14
There is only one instance of "Start" and only one instance of "End" in the "String" column. I only want the rows of this dataframe that are between the rows that contain "Start" and "Stop" in the "String" column, and so I want to produce this output dataframe:
String Value
-------------------------
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
Also, I want to preserve the order of those rows I am preserving, and so preserving the order of "Start", "Orange", "Purple", "Teal", "Indigo", "End".
I know I can index these specific columns by doing:
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
But I am not sure how to actually filter out all rows that are not between these two strings. How can I accomplish this in python?
If both values are present you temporarily set "String" as index:
df.set_index('String').loc['Start':'End'].reset_index()
output:
String Value
0 Start 65
1 Orange 33
2 Purple 65
3 Teal 34
4 Indigo 44
5 End 32
Alternatively, using isin (then the order of Start/End doesn't matter):
m = df['String'].isin(['Start', 'End']).cumsum().eq(1)
df[m|m.shift()]
output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
This should be enough, iloc[] is useful when you try to locate rows by index, and it works the same as slices in lists.
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
df.iloc[index_start[0]:index_end[0]+1]
More information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
You can build a boolean mask using eq + cummax and filter:
out = df[df['String'].eq('Start').cummax() & df.loc[::-1, 'String'].eq('End').cummax()]
Output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
As you return the index values through your work:
df.iloc[index_start.item(): index_end.item()]

pandas/python: replacing categorical values in dataframe through iteration

I created a database and I am trying to substitute the categorical variables with some numerical values
that I calculated via 'pivot'. In my code, I am trying to iterate through the whole dataframe and if the dataframe categorical columns cells have the same values as one of the elements in 'sublist_names', they should be replaced by the element in 'sublist_values' located in the same position as the value in sublist names.
For example, while iterating the dataframe and each of the categorical columns, the first value of column called 'Name' is the string 'tom'. 'tom' is exactly the 7th element in 'sublist_names', which means it should be replaced by the 7th element in 'sublist_values' which is equal to 150.
I was able to obtain all the needed values but when it comes to solving this last task by iterating the whole dataframe instead of working column by column, I am not sure how to do it.
I hope I explained clearly, but for any questions feel free to ask.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = [['tom', 10,6,'brown',200],
['nick', 15,5.10,'red',150],
['juli', 14,5.5,'black',170]
,['peter', 10,6,'blue',290],
['axel', 15,5.10,'yellow',190],
['william', 14,5.5,'yellow',170]
,['tom', 10,6,'orange',100],
['tom', 15,5.10,'brown',150],
['angela', 14,5.5,'black',160]
,['peter', 10,6,'purple',220],
['nick', 15,5.10,'orange',150],
['aroon', 14,5.5,'red',170] ]
df = pd.DataFrame(data, columns=['Name', 'Age','height','color','weight'])
categorical_variables= (df.select_dtypes('object') ) # categorical variables
categ_var_list=(list(categorical_variables))
print(categ_var_list)
condition_pivot_list_names=[]
pivot_values_list=[]
for i in categ_var_list:
condition_pivot = df.pivot_table(index=i, values='weight', aggfunc=np.mean)
pivot_names = (condition_pivot.index.values.tolist())
condition_pivot_list_names.append(pivot_names)
pivot_values_draft = ((condition_pivot.values.tolist()))
pivot_values = [i[0] for i in pivot_values_draft]
pivot_values_list.append(pivot_values)
print(condition_pivot_list_names, 'condition pivot list names')
print(pivot_values_list,'pivot values list')
sublist_names=[(sublists) for sublists in condition_pivot_list_names]
print(sublist_names)
sublist_values=[(sublists1) for sublists1 in pivot_values_list]
print(sublist_values)
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
print(df['Name'])
This is what print( df[name]) shows:
0 tom
1 nick
2 juli
3 peter
4 axel
5 william
6 tom
7 tom
8 angela
9 peter
10 nick
11 aroon
And this is what should show:
0 150
1 150
2 170
3 255
4 190
5 170
6 150
7 150
8 160
9 255
10 150
11 170
You have two categorical values Name and Color. So you cam do something like this.
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
And than you can create a function myfunc() which will receive x from above code. What above code is doing is, it will iterate over the column one by one and pass value of each row one by one to the function. Inside the function you can define the logic to convert the categorical values something like this
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
Do the same thing for the column Color.
Try this:
df.Name = np.where(df.groupby('Name', as_index=False)['Name'].cumcount().eq(0), df.Name, df.weight)
Output:
Name Age height color weight
0 tom 10 6.0 brown 200
1 nick 15 5.1 red 150
2 juli 14 5.5 black 170
3 peter 10 6.0 blue 290
4 axel 15 5.1 yellow 190
5 william 14 5.5 yellow 170
6 100 10 6.0 orange 100
7 150 15 5.1 brown 150
8 angela 14 5.5 black 160
9 220 10 6.0 purple 220
10 150 15 5.1 orange 150
11 aroon 14 5.5 red 170
Okay I see your problem. Just write the code below before the function declaration.
sub_names=[]
sub_values=[]
for i in sublist_names:
sub_names.extend(i)
for i in sublist_values:
sub_values.extend(i)
Also dont forget to update variable names in myfunc().

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

Shift rows with missing data in python

I have a txt file that I read in through python that comes like this:
Text File:
18|Male|66|180|Brown
23|Female|67|120|Brown
16|71|192|Brown
22|Male|68|185|Brown
24|Female|62|100|Blue
One of the rows has missing data and the problem is that when I read it into a dataframe it appears like this:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 71 192 Brown NaN
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
I'm wondering if there is a way to shift the row that has missing data over without shifting all columns.
Here is what I have so far:
import pandas as pd
df = pd.read_csv('C:/Documents/file.txt', sep='|', names=['Age','Gender', 'Height', 'Weight', 'Eyes'])
df_full = df.loc[df['Gender'].isin(['Male','Female'])]
df_missing = df.loc[~df['Gender'].isin(['Male','Female'])]
df_missing = df_missing.shift(1,axis=1)
df_final = pd.concat([df_full, df_missing])
I was hoping to just separate out the ones with missing data, shift the columns by one, and then put the dataframe back to the data that has no missing data. But I'm not sure how to go about shifting the columns at a certain point. This is the result I'm trying to get to:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 NaN 71 192 Brown
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
It doesn't really matter how I get it done, but the files I'm using have thousands of rows so I can not fix them individually. Any help is appreciated. Thanks!
Selectively shift a portion of each of the rows that have missing values.
df.apply(lambda r: r[:1].append(r[1:].shift())
if r['Gender'] not in ['Male', 'Female']
else r, axis=1)
The misaligned column data for each affected record will be aligned with 'NaN' inserted where the missing value was in the input text.
Age Gender Height Weight Eyes Age Gender Height Weight Eyes
1 23 Female 67 120 Brown 1 23 Female 67 120 Brown
2 16 71 192 Brown NaN ======> 2 16 NaN 71 192 Brown
For a single record, this'll do it:
df.loc[2] = df.loc[2][:1].append(df.loc[2][1:].shift())
Starting at the 'Gender' column, data is shifted right. The default fill is 'NaN'. The 'Age' column is preserved.
RegEx could help here.
Searching for ^(\d+\|)(\d) and making a replacement using $1|$2 (just added one vertical bar where Gender is missing "group 1 + | + group 2")
This could be done in almost every text editors (Notepad++, VSC, Sublime etc.)
See the example following the link: https://regexr.com/50gkh

Pandas: combine columns without duplicates/ find unique words after combining

I have a dataframe where I would like to concatenate certain columns.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
For example, if I had a data frame such as:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
I want to combine the columns but retain only unique information from each of the strings.
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
I combine the columns by doing the following
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
The output I would desire is:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python
but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
I have the solution in R but want to recode in python.
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)
You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Also is possible join values in apply:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Solution with set, but it change order:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt
If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.
df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
.apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))
df
Out[46]:
animal1 animal2 label detail
0 1 cat dog dolphin 19 1 cat dog dolphin 19
1 2 dog cat cat 72 2 dog cat 72
2 3 pilchard 26 koala 26 3 pilchard 26 koala
3 4 newt bat 81 bat 81 4 newt bat 81
I'd suggest to remove the duplicates at the end of the process by using python set.
here is an example function to do so:
def dedup(value):
words = set(value.split(' '))
return ' '.join(words)
That works like this:
val = 'dog cat cat 81'
print dedup(val)
81 dog cat
in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.
then just apply it (similar to map) on your details columns for the desired result:
animals.detail = animals.detail.apply(dedup)

Categories