Pandas how to create unique 4 character string from longer string - python

I have a a pandas dataframe with strings. I would like to shorten them so I have decided to remove the vowels. Then my next step is to take the first four characters of the string but I am running into collisions. Is there a smarter way to do this so that I can try not to have repeatable strings but also keep to 4 character strings?
import pandas as pd
import re
d = {'test': ['gregorypolanco','franciscoliriano','chrisarcher', 'franciscolindor']}
df = pd.DataFrame(data=d)
def remove_vowels(r):
result = re.sub(r'[AEIOU]', '', r, flags=re.IGNORECASE)
return result
no_vowel = pd.DataFrame(df['test'].apply(remove_vowels))
no_vowel['test'].str[0:4]
Output:
0 grgr
1 frnc
2 chrs
3 frnc
Name: test, dtype: object
From the above you can see that 'franciscoliriano' and 'franciscolindor' are the same when shortened.

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...
If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc
You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

In python, I want to delete the last letter from the string value in the data and convert it to a number

Data
I want to remove the + sign at the end of the Installs column in the image and convert it to a number.
import numpy as np
import pandas as pd
import os
data = pd.read_csv("../input/googleplaystore.csv")
data.info()
data.head(10)
if data.Installs.endswith("+"):
data.Installs =data.Installs[:-1]
With a pandas Series like this:
>>> import pandas as pd
>>> installs = pd.Series(["10,000+", "500,000+", "5,000,000+"])
>>> print(installs)
0 10,000+
1 500,000+
2 5,000,000+
dtype: object
Use the pandas Accessor object str to replace all occurences of "+" and "," with an empty string. This is far more robust compared to just removing the last character.
>>> installs = installs.str.replace("+", "")
>>> installs = installs.str.replace(",", "")
To apply numeric functions afterwards (e.g. sum) change the datatype to int.
>>> installs = installs.astype(int)
With Regular Expressions we can make it even clearer. (The brackets [] define a set of characters to be replaced.)
>>> installs = installs.str.replace("[+,]", "").astype(int)
>>> print(installs)
0 10000
1 500000
2 5000000
dtype: int64
Conclusions:
This should solve your problem:
data.Installs = data.Installs.str.replace("[+,]", "").astype(int)
You can refer GeeksForGeeks for this
https://www.geeksforgeeks.org/python-remove-last-character-in-list-of-strings/
# Python3 code to demonstrate
# remove last character from list of strings
# using map() + lambda
# initializing list
test_list = ['Manjeets']
# printing original list
print("The original list : " + str(test_list))
# using map() + lambda
# remove last character from list of strings
res = list(map(lambda i: i[ : -1], test_list))
# printing result
print("The list after removing last characters : " + str(res))
``
**Output**
**The original list : ['Manjeets']
The list after removing last characters : ['Manjeet']**
import numpy as np
import pandas as pd
import os
data = pd.read_csv("../input/googleplaystore.csv")
data.info()
data.head(10)
if data.Installs.endswith("+"):
data.Installs = data.Installs[:-1].astype(int) #Assuming of course that it's a numpy/pandas column or something.

Conversion of List of words (inside a dataframe) to a set of words

In my dataframe, I have a column with data as a list like [cell, protein, expression], I wanted to convert it as a set of words like cell, protein, expression, it should applies to entire column of the dataframe. Please suggest the possible way to do it.
try this
data['column_name'] = data['column_name'].apply(lambda x: ', '.join(x))
The issue is that df['Final_Text'] is not a list but rather a string. try using ast.literal_eval first:
import ast
from io import StringIO
# your sample df
s = """
,Final_Text
0,"['study', 'response', 'cell']"
1,"['cell', 'protein', 'effect']"
2,"['cell', 'patient', 'expression']"
3,"['patient', 'cell', 'study']"
4,"['study', 'cell', 'activity']"
"""
df = pd.read_csv(StringIO(s))
# convert you string of a list of to an actual list
df['Final_Text'] = df['Final_Text'].apply(ast.literal_eval)
# use a lambda expression with join to keep the text inside the list
df['Final_Text'] = df['Final_Text'].apply(lambda x: ', '.join(x))
Unnamed: 0 Final_Text
0 0 study, response, cell
1 1 cell, protein, effect
2 2 cell, patient, expression
3 3 patient, cell, study
4 4 study, cell, activity

String function on a pandas series

I wanted to used the below string functions text.lower for a Pandas series instead of from a text file. Tried different methods to convert the series to list and then string,, but no luck. Still I am not able to use the below function directly. Help is much appreciated.
def words(text):
return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('some.txt').read()))
I think need apply by your function:
s = pd.Series(['Aasa dsad d','GTH rr','SSD'])
print (s)
0 Aasa dsad d
1 GTH rr
2 SSD
dtype: object
def words(text):
return re.findall(r'\w+', text.lower())
print (s.apply(words))
0 [aasa, dsad, d]
1 [gth, rr]
2 [ssd]
dtype: object
But in pandas is better use str.lower and str.findall, because also working with NaNs:
print (s.str.lower().str.findall(r'\w+'))
0 [aasa, dsad, d]
1 [gth, rr]
2 [ssd]
dtype: object
Something like this?
from collections import Counter
import pandas as pd
series = pd.Series(['word', 'Word', 'WORD', 'other_word'])
counter = Counter(series.apply(lambda x: x.lower()))
print(counter)

Categories