Pandas: combine columns without duplicates/ find unique words after combining - python

I have a dataframe where I would like to concatenate certain columns.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
For example, if I had a data frame such as:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
I want to combine the columns but retain only unique information from each of the strings.
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
I combine the columns by doing the following
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
The output I would desire is:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python
but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
I have the solution in R but want to recode in python.
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:
animals["detail"] = animals["animal1"].map(str) + ' ' +
animals["animal2"].map(str) + ' ' +
animals["label"].map(str)
animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Also is possible join values in apply:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard 26 koala
4 newt bat 81 bat 81 newt bat 81
Solution with set, but it change order:
animals["detail"] = animals.astype(str)
.apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dolphin 19 dog
2 dog cat cat 72 cat dog 72
3 pilchard 26 koala 26 26 pilchard koala
4 newt bat 81 bat 81 bat 81 newt

If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.
df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
.apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))
df
Out[46]:
animal1 animal2 label detail
0 1 cat dog dolphin 19 1 cat dog dolphin 19
1 2 dog cat cat 72 2 dog cat 72
2 3 pilchard 26 koala 26 3 pilchard 26 koala
3 4 newt bat 81 bat 81 4 newt bat 81

I'd suggest to remove the duplicates at the end of the process by using python set.
here is an example function to do so:
def dedup(value):
words = set(value.split(' '))
return ' '.join(words)
That works like this:
val = 'dog cat cat 81'
print dedup(val)
81 dog cat
in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.
then just apply it (similar to map) on your details columns for the desired result:
animals.detail = animals.detail.apply(dedup)

Related

How to filter dataframe columns between two rows that contain specific string in column?

I am trying to understand how to select only those rows in my dataframe that are between two specific rows. These rows contain two specific strings in one of the columns. I will explain further with this example.
I have the following dataframe:
String Value
-------------------------
0 Blue 45
1 Red 35
2 Green 75
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
9 Yellow 22
10 Red 14
There is only one instance of "Start" and only one instance of "End" in the "String" column. I only want the rows of this dataframe that are between the rows that contain "Start" and "Stop" in the "String" column, and so I want to produce this output dataframe:
String Value
-------------------------
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
Also, I want to preserve the order of those rows I am preserving, and so preserving the order of "Start", "Orange", "Purple", "Teal", "Indigo", "End".
I know I can index these specific columns by doing:
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
But I am not sure how to actually filter out all rows that are not between these two strings. How can I accomplish this in python?
If both values are present you temporarily set "String" as index:
df.set_index('String').loc['Start':'End'].reset_index()
output:
String Value
0 Start 65
1 Orange 33
2 Purple 65
3 Teal 34
4 Indigo 44
5 End 32
Alternatively, using isin (then the order of Start/End doesn't matter):
m = df['String'].isin(['Start', 'End']).cumsum().eq(1)
df[m|m.shift()]
output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
This should be enough, iloc[] is useful when you try to locate rows by index, and it works the same as slices in lists.
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
df.iloc[index_start[0]:index_end[0]+1]
More information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
You can build a boolean mask using eq + cummax and filter:
out = df[df['String'].eq('Start').cummax() & df.loc[::-1, 'String'].eq('End').cummax()]
Output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
As you return the index values through your work:
df.iloc[index_start.item(): index_end.item()]

Understanding Pandas Pivot function

I want to convert a categorical column in a pandas dataframe to multiple columns containing values. Here is a minimal example dataframe
dfTest = pd.DataFrame({
'animal' : ['cat','cat','dog','dog', 'mouse', 'mouse', 'rat', 'rat'],
'color' : ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
'weight' : np.random.uniform(3, 20, 8)
})
dfTest
The table looks like this
According to pandas user guide, it seems to me that what I want to do is called a pivot. Namely, what I want to do should look something like this
animal weight_black weight_white
0 cat 1.23456 2.34234
1 dog 3.634634 3.4554646
2 mouse 5.24234 5.463452
3 rat 4.56456 2.3364
However, when I run
dfTest.pivot(columns='color', values='weight')
I get the following:
I don't want other categorical columns (such as animal) to disappear. Also, I don't want nans inbetween, I want everything to be compact. How do I do this?
EDIT: Here's a more involved example of what I want
animal color hair_length weight
1 cat black long 1.23
2 cat white long 2.34
3 cat black short 34534
4 cat white short 345
5 dog black long 234
6 dog white long 123
7 dog black short 444
8 dog white short 345
9 rat black long 5465
10 rat white long 2343
11 rat black short 123
12 rat white short 2343
13 bat black long 423
14 bat white long 23
15 bat black short 11123
16 bat white short 13423
I want to convert it to
animal hair_length weight_black weight_white
1 cat long 2.34 235
2 cat short 345 3423
3 dog long 123 56346
4 dog short 345 .... you get the point
5 rat long 2343
6 rat short 2343
7 bat long 23
8 bat short 13423
Ok I think I figured it out, #Randy's hint was actually enough
index = list(set(dfTest.columns) - {'color', 'weight'})
dfResult = df.pivot(index=index, columns='color', values='weight').reset_index()
So we
Put all of the columns except for the two columns of interest into index
Perform pivot, which results in complicated hierarchical index
Convert from complicated index to simple index by doing reset_index()

Removing from pandas dataframe all rows having less than 3 characters

I have this dataframe
Word Frequency
0 : 79
1 , 60
2 look 26
3 e 26
4 a 25
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
I would like to remove all words having less than 3 characters.
I tried as follows:
df['Word']=df['Word'].str.findall('\w{3,}').str.join(' ')
but it does not remove them from my datataset.
Can you please tell me how to remove them?
My expected output would be:
Word Frequency
2 look 26
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
Try with
df = df[df['Word'].str.len()>=3]
Instead of attempting a regular expression, you can use .str.len() to get the length of each string of your column. Then you can simply filter based on that length for >= 3
Should look like:
df.loc[df["Word"].str.len() >= 3]
Please Try
df[df.Word.str.len()>=3]

Shift rows with missing data in python

I have a txt file that I read in through python that comes like this:
Text File:
18|Male|66|180|Brown
23|Female|67|120|Brown
16|71|192|Brown
22|Male|68|185|Brown
24|Female|62|100|Blue
One of the rows has missing data and the problem is that when I read it into a dataframe it appears like this:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 71 192 Brown NaN
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
I'm wondering if there is a way to shift the row that has missing data over without shifting all columns.
Here is what I have so far:
import pandas as pd
df = pd.read_csv('C:/Documents/file.txt', sep='|', names=['Age','Gender', 'Height', 'Weight', 'Eyes'])
df_full = df.loc[df['Gender'].isin(['Male','Female'])]
df_missing = df.loc[~df['Gender'].isin(['Male','Female'])]
df_missing = df_missing.shift(1,axis=1)
df_final = pd.concat([df_full, df_missing])
I was hoping to just separate out the ones with missing data, shift the columns by one, and then put the dataframe back to the data that has no missing data. But I'm not sure how to go about shifting the columns at a certain point. This is the result I'm trying to get to:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 NaN 71 192 Brown
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
It doesn't really matter how I get it done, but the files I'm using have thousands of rows so I can not fix them individually. Any help is appreciated. Thanks!
Selectively shift a portion of each of the rows that have missing values.
df.apply(lambda r: r[:1].append(r[1:].shift())
if r['Gender'] not in ['Male', 'Female']
else r, axis=1)
The misaligned column data for each affected record will be aligned with 'NaN' inserted where the missing value was in the input text.
Age Gender Height Weight Eyes Age Gender Height Weight Eyes
1 23 Female 67 120 Brown 1 23 Female 67 120 Brown
2 16 71 192 Brown NaN ======> 2 16 NaN 71 192 Brown
For a single record, this'll do it:
df.loc[2] = df.loc[2][:1].append(df.loc[2][1:].shift())
Starting at the 'Gender' column, data is shifted right. The default fill is 'NaN'. The 'Age' column is preserved.
RegEx could help here.
Searching for ^(\d+\|)(\d) and making a replacement using $1|$2 (just added one vertical bar where Gender is missing "group 1 + | + group 2")
This could be done in almost every text editors (Notepad++, VSC, Sublime etc.)
See the example following the link: https://regexr.com/50gkh

randomly subsampling lines in a file

I have a file like this:
Tree 5
Jaguar 9
Cat 23
Monkey 12
Gorilla 67
Is possible to randomly subsample 3 of these lines?
For example:
Jaguar 9
Gorilla 67
Tree 5
or
Monkey 12
Tree 5
Cat 23
etc.?
Using random.sample on readlines:
import random
random.sample(open('foo.txt', 'r').readlines(), 3)

Categories