Sort string columns with numbers in it in Pandas - python

I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)

Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))

Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22

You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)

Related

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Python Split output column with fixed & dynamic length

I want to split the data frame from a single column to three columns Sample input and output
[(Col1=fix length), (Col2=dynamic length),( Col3= remaining part)]
import re
import pandas as pd
text='Raw Data'
out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)
df = pd.DataFrame(out, columns = ["RIY"])
df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]# need to split based on criteria (find next '.' less 2 char
df["col3"] = df.RIY.str[24:] # remaining all text after splitting 2 column
#Output
[1]: https://i.stack.imgur.com/Lupcd.png
I tried to split with a fixed length (solution by Roy2012) which only works perfectly, For the first part, [0:15], length varies for the remaining two columns. I want to split by finding second dot('.') less (-2) (to avoid removing 46) I want to achieve by (find the second dot(.) -2 (to avoid removing 46) then split.
Is this working for you?
df.RAW.str.extract(r"(.*)(\d\d\.\d+)(\d\d\.\d+)")
The output I get is:
0 1 2
0 RIY-OUHOMH-1002 24.534768 46.650127
1 RIY-OUHOHH-1017 24.51472 46.663988
2 RIY-OUHOMH-1004 24.532244 46.651758
3 RIY-OUHOHH-1007 24.529029 46.653571
4 RIY-OUHOHH-1006 24.530071 46.651934
5 RIY-OUHOHH-1005 24.531786 46.65279
6 RIY-OUHOMH-1001 24.535972 46.649456
7 RIY-DIRAHH-0151 24.495407 46.641877
8 RIY-DIRAHH-0152 24.494105 46.644253

Filter Dataframe by using ~isin([list_of_substrings])

Given a dataframe full of emails, I want to filter out rows containing potentially blocked domain names or clearly fake emails. The dataframe below represents an example of my data.
>> print(df)
email number
1 fake#fake.com 2
2 real.email#gmail.com 1
3 no.email#email.com 5
4 real#yahoo.com 2
5 rich#money.com 1
I want to filter by two lists. The first list is fake_lst = ['noemail', 'noaddress', 'fake', ... 'no.email'].
The second list is just the set from disposable_email_domains import blocklist converted to a list (or kept as a set).
When I use df = df[~df['email'].str.contains('noemail')] it works fine and filters out that entry. Yet when I do df = df[~df['email'].str.contains(fake_lst)] I get TypeError: unhashable type: 'list'.
The obvious answer is to use df = df[~df['email'].isin(fake_lst)] as in many other stackoverflow questions, like Filter Pandas Dataframe based on List of substrings or pandas filtering using isin function but that ends up having no effect.
I suppose I could use str.contains('string') for each possible list entry, but that is ridiculously cumbersome.
Therefore, I need to filter this dataframe based on the substrings contained in the two lists such that any email containing a particular substring I want rid of, and the subsequent row in which it is contained, are removed.
In the example above, the dataframe after filtering would be:
>> print(df)
email number
2 real.email#gmail.com 1
4 real#yahoo.com 2
5 rich#money.com 1
Here is a potential solution assuming you have following df and fake_lst
df = pd.DataFrame({
'email': ['fake#fake.com', 'real.email#gmail.com', 'no.email#email.com',
'real#yahoo.com', 'rich#money.com'],
'number': [2, 1, 5, 2, 1]
})
fake_lst = ['fake', 'money']
Option 1:
Filter out rows that have any of the fake_lst words in email with apply:
df.loc[
~df['email'].apply(lambda x: any([i in x for i in fake_lst]))
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Option 2:
Filter out without apply
df.loc[
[not any(i) for i in zip(*[df['email'].str.contains(word) for word in fake_lst])]
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Use DataFrame.isin to check whether each element in the DataFrame is contained in values. Another issue is that your fake list contains the name without the domain so you need str.split to remove the characters you are not matching against.
Note: str.contains tests if a pattern or regex is contained within a string of a Series and hence your code df['email'].str.contains('noemail') works fine but doesn't work for list
df[~df['email'].str.split('#').str[0].isin(fake_lst)]
email number
0 fake#fake.com 2
1 real.email#gmail.com 1
3 real#yahoo.com 2
4 rich#money.com 1

How to convert the extension of the images in dataframe to UPPERCASE

I have the link of the images in my pandas dataframe in the profile column now i only want to convert the extension of the images into UPPERCASE in pandas
I have tried this but the problem is that my whole string is in the form of upper case I only want to convert the extension of the image into uppercase
df.profile.astype(str).str.upper()
The results of my dataframe are
1 DATA/IMAGES/PCNAYAK1971.PNG
2 DATA/IMAGES/SC_INDIVISIBLE.JPG
3 DATA/IMAGES/DEVPLACEMENT.JPG
4 DATA/IMAGES/PHOENIXINFORMER.JPG
5 DATA/IMAGES/UNIA_MAY.COM/PROFILE_IMAGES/212183...
6 DATA/IMAGES/AADANIELS3.JPG
7 DATA/IMAGES/CHRISTI02463358.JPG
8 DATA/IMAGES/BABIE__BEAR.JPG
9 DATA/IMAGES/NC0303.JPG
I just only want to convert like that
1 data/images/pcnayak1971.PNG
2 data/images/sc_indivisible.JPG
3 data/images/devplacement.JPG
You could use str.rsplit to split the strings on '.' from the end, and then modify and combine them using pandas' vectorized string functions:
l = df.profile.str.rsplit('.', n=1)
l.str[0].str.cat(l.str[-1].str.upper(), sep='.')
Lets try with the two first rows:
profile
1 data/images/pcnayak1971.png
2 data/images/cs_indivisible.jpg
l = df.profile.str.rsplit('.', n=1)
df['profile'] = l.str[0].str.cat(l.str[-1].str.upper(), sep='.')
profile
1 data/images/pcnayak1971.PNG
2 data/images/cs_indivisible.JPG

split one pandas column text to multiple columns

For example, I have one pandas column contain
text
A1V2
B2C7Z1
I want split it into 26(A-Z) columns with alphabet followed value, if it is missing, then -1.
So, it can be
text A B C D ... Z
A1V2 1 -1 -1 -1 ... -1
B2C7Z1 -1 2 7 -1 ... 1
Is there any fast way rather than using df.apply()?
Followup:
Thank Psidom for the brilliant answer. When I use the method run 4 millions rows, it took me 1 hour. I hope there's another way can make it faster. It seems str.extractall() is the most time-consuming one.
Try str.extractall with regex (?P<key>[A-Z])(?P<value>[0-9]+) which extracts the key([A-Z]) value([0-9]+) into separate columns and a long to wide transform should get you there.
Here regex (?P<key>[A-Z])(?P<value>[0-9]+) matches letterDigits pattern and the two capture groups go into two separate columns in the result as columns key and value (with ?P<> syntax);
And since extractall puts multiple matches into separate rows, you will need to transform it to wide format with unstack on the key column:
(df.text.str.extractall("(?P<key>[A-Z])(?P<value>[0-9]+)")
.reset_index('match', drop=True)
.set_index('key', append=True)
.value.unstack('key').fillna(-1))
#key A B C V Z
# 0 1 -1 -1 2 -1
# 1 -1 2 7 -1 1

Categories