I have the link of the images in my pandas dataframe in the profile column now i only want to convert the extension of the images into UPPERCASE in pandas
I have tried this but the problem is that my whole string is in the form of upper case I only want to convert the extension of the image into uppercase
df.profile.astype(str).str.upper()
The results of my dataframe are
1 DATA/IMAGES/PCNAYAK1971.PNG
2 DATA/IMAGES/SC_INDIVISIBLE.JPG
3 DATA/IMAGES/DEVPLACEMENT.JPG
4 DATA/IMAGES/PHOENIXINFORMER.JPG
5 DATA/IMAGES/UNIA_MAY.COM/PROFILE_IMAGES/212183...
6 DATA/IMAGES/AADANIELS3.JPG
7 DATA/IMAGES/CHRISTI02463358.JPG
8 DATA/IMAGES/BABIE__BEAR.JPG
9 DATA/IMAGES/NC0303.JPG
I just only want to convert like that
1 data/images/pcnayak1971.PNG
2 data/images/sc_indivisible.JPG
3 data/images/devplacement.JPG
You could use str.rsplit to split the strings on '.' from the end, and then modify and combine them using pandas' vectorized string functions:
l = df.profile.str.rsplit('.', n=1)
l.str[0].str.cat(l.str[-1].str.upper(), sep='.')
Lets try with the two first rows:
profile
1 data/images/pcnayak1971.png
2 data/images/cs_indivisible.jpg
l = df.profile.str.rsplit('.', n=1)
df['profile'] = l.str[0].str.cat(l.str[-1].str.upper(), sep='.')
profile
1 data/images/pcnayak1971.PNG
2 data/images/cs_indivisible.JPG
Related
I have a column that consist of 8000 rows, and I need to create a new column which the value is extracted from the existing column.
the string shows like this:
TP-ETU06-01-525-W-133
and I want to create two new columns from the string where the value of first new column is extracted from the second string which ETU06 and the second one is from the last string which is 133.
I have done this by using:
df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
It actually works fine, but the existing column has random string that doesn't follow the others. So I want to leave empty in the new columns if the random string appears.
How should I change my script?
Thankyou
Use Series.str.contains for mask, then split values by Series.str.split and select secnd and last value by indexing only filtered rows by mask:
print (df)
test_no
0 temp data
1 NaN
2 TP-ETU06-01-525-W-133
mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]
print (df)
test_no sys_no package_no
0 temp data NaN NaN
1 NaN NaN NaN
2 TP-ETU06-01-525-W-133 ETU06 133
This approach uses regex and named capture groups to find and extract the strings of interest, in just two lines of code.
Benefit of regex over split:
It is true that regex is not required. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.
Additionally, regex provides a robust solution as the pattern matches the entire string, and anything outside of this pattern is ignored - like the example mentioned in the question.
If you'd like some explanation on the regex pattern itself, just add a comment and I'll update the answer to explain.
Sample Data:
test_no
0 TP-ETU05-01-525-W-005
1 TP-ETU06-01-525-W-006
2 TP-ETU07-01-525-W-007
3 TP-ETU08-01-525-W-008
4 TP-ETU09-01-525-W-009
5 NaN
6 NaN
7 otherstuff
Code:
import re
exp = re.compile(r'^[A-Z]{2}-(?P<sys_no>[A-Z]{3}\d{2})-\d{2}-\d{3}-[A-Z]-(?P<package_no>\d{3})$')
df[['sys_no', 'package_no']] = df['test_no'].str.extract(exp, expand=True)
Output:
test_no sys_no package_no
0 TP-ETU05-01-525-W-005 ETU05 005
1 TP-ETU06-01-525-W-006 ETU06 006
2 TP-ETU07-01-525-W-007 ETU07 007
3 TP-ETU08-01-525-W-008 ETU08 008
4 TP-ETU09-01-525-W-009 ETU09 009
5 NaN NaN NaN
6 NaN NaN NaN
7 otherstuff NaN NaN
I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)
Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))
Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22
You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)
I want to split the data frame from a single column to three columns Sample input and output
[(Col1=fix length), (Col2=dynamic length),( Col3= remaining part)]
import re
import pandas as pd
text='Raw Data'
out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)
df = pd.DataFrame(out, columns = ["RIY"])
df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]# need to split based on criteria (find next '.' less 2 char
df["col3"] = df.RIY.str[24:] # remaining all text after splitting 2 column
#Output
[1]: https://i.stack.imgur.com/Lupcd.png
I tried to split with a fixed length (solution by Roy2012) which only works perfectly, For the first part, [0:15], length varies for the remaining two columns. I want to split by finding second dot('.') less (-2) (to avoid removing 46) I want to achieve by (find the second dot(.) -2 (to avoid removing 46) then split.
Is this working for you?
df.RAW.str.extract(r"(.*)(\d\d\.\d+)(\d\d\.\d+)")
The output I get is:
0 1 2
0 RIY-OUHOMH-1002 24.534768 46.650127
1 RIY-OUHOHH-1017 24.51472 46.663988
2 RIY-OUHOMH-1004 24.532244 46.651758
3 RIY-OUHOHH-1007 24.529029 46.653571
4 RIY-OUHOHH-1006 24.530071 46.651934
5 RIY-OUHOHH-1005 24.531786 46.65279
6 RIY-OUHOMH-1001 24.535972 46.649456
7 RIY-DIRAHH-0151 24.495407 46.641877
8 RIY-DIRAHH-0152 24.494105 46.644253
Given a dataframe full of emails, I want to filter out rows containing potentially blocked domain names or clearly fake emails. The dataframe below represents an example of my data.
>> print(df)
email number
1 fake#fake.com 2
2 real.email#gmail.com 1
3 no.email#email.com 5
4 real#yahoo.com 2
5 rich#money.com 1
I want to filter by two lists. The first list is fake_lst = ['noemail', 'noaddress', 'fake', ... 'no.email'].
The second list is just the set from disposable_email_domains import blocklist converted to a list (or kept as a set).
When I use df = df[~df['email'].str.contains('noemail')] it works fine and filters out that entry. Yet when I do df = df[~df['email'].str.contains(fake_lst)] I get TypeError: unhashable type: 'list'.
The obvious answer is to use df = df[~df['email'].isin(fake_lst)] as in many other stackoverflow questions, like Filter Pandas Dataframe based on List of substrings or pandas filtering using isin function but that ends up having no effect.
I suppose I could use str.contains('string') for each possible list entry, but that is ridiculously cumbersome.
Therefore, I need to filter this dataframe based on the substrings contained in the two lists such that any email containing a particular substring I want rid of, and the subsequent row in which it is contained, are removed.
In the example above, the dataframe after filtering would be:
>> print(df)
email number
2 real.email#gmail.com 1
4 real#yahoo.com 2
5 rich#money.com 1
Here is a potential solution assuming you have following df and fake_lst
df = pd.DataFrame({
'email': ['fake#fake.com', 'real.email#gmail.com', 'no.email#email.com',
'real#yahoo.com', 'rich#money.com'],
'number': [2, 1, 5, 2, 1]
})
fake_lst = ['fake', 'money']
Option 1:
Filter out rows that have any of the fake_lst words in email with apply:
df.loc[
~df['email'].apply(lambda x: any([i in x for i in fake_lst]))
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Option 2:
Filter out without apply
df.loc[
[not any(i) for i in zip(*[df['email'].str.contains(word) for word in fake_lst])]
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Use DataFrame.isin to check whether each element in the DataFrame is contained in values. Another issue is that your fake list contains the name without the domain so you need str.split to remove the characters you are not matching against.
Note: str.contains tests if a pattern or regex is contained within a string of a Series and hence your code df['email'].str.contains('noemail') works fine but doesn't work for list
df[~df['email'].str.split('#').str[0].isin(fake_lst)]
email number
0 fake#fake.com 2
1 real.email#gmail.com 1
3 real#yahoo.com 2
4 rich#money.com 1
I have a 30+ million row data set that I need to apply a whole host of data transformation rules to. For this task, I am trying to explore Pandas as a possible solution because my current solution isn't very fast.
Currently, I am performing a row by row manipulation of the data set, and then exporting it to a new table (CSV file) on disk.
There are 5 functions users can perform on the data within a given column:
remove white space
Capitalize all text
format date
replace letter/number
replace word
My first thought was to use the dataframe's apply or applmap, but this can only be used on a single column.
Is there a way to use apply or applymap to many columns instead of just one?
Is there a better workflow I should consider since I could be doing manipulations to 1:n columns in my dataset, where the maximum number of columns is currently around 30.
Thank you
You can use list comprehension with concat if need apply some function working only with Series:
import pandas as pd
data = pd.DataFrame({'A':[' ff ','2','3'],
'B':[' 77','s gg','d'],
'C':['s',' 44','f']})
print (data)
A B C
0 ff 77 s
1 2 s gg 44
2 3 d f
print (pd.concat([data[col].str.strip().str.capitalize() for col in data], axis=1))
A B C
0 Ff 77 S
1 2 S gg 44
2 3 D F