I am trying to create a function that splits text in a column of a dataframe and puts each half of the split into a different new column. I want to split the text right after a specific phrase (defined as "search_text" in the function "create_var") and then trim that text to a specified number of characters (defined as left_trim_number in the function). My function has worked in some cases but does not work in others.
Here is the basic structure of my dataframe, where "lst" is my list of text items and "cols" are the two columns of the original dataframe:
import pandas as pd
cols = ['page', 'text_i']
df1 = pd.DataFrame(lst, columns=cols)
Here is my function:
def create_var(varname, search_text, left_trim_number):
df1[['a',varname]] = df1['text_i'].str.split(search_text, expand=True)
df1[varname] = df1[varname].str[: left_trim_number ]
create_var('var1','I am looking for the text that follows this ',3)
In the cases where it doesnt work, I get this error (which I assume is related to pandas):
"ValueError: Columns must be same length as key"
Is there a better way of doing this?
You could try this:
import pandas as pd
df = pd.DataFrame({"text":["hello world", "a", "again hello world"]})
search_text = "hello "
parts = df['text'].str.partition(search_text)
df['a'] = parts[0] + parts[1]
df['var1'] = parts[2]
df['var1'] = df['var1'].str[:3]
print(df)
Output:
text a var1
0 hello world hello wor
1 a a
2 again hello world again hello wor
Related
I am working on a data cleaning project and in this, I have to remove some outliers of price_per_sqft.. So I used groupby function and by statistic, the formula creates a data frame without outliers and concat it with the output data frame...
But in the output this type of word returns with the location names so how can I get a clean location name instead of this..?
Code:
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df6 = remove_pps_outliers(df5)
df6.head()
Output:
enter image description here
How can I get the answer without "1st Phase" or "1st Block" keywords like this...
enter image description here
A rudimentary fix would be to just replace the characters you do not want. Luckily in this example, both '1st Phase ' and '1st Block ' contain 10 characters so you could use :
df6['location'] = df6['location'].str.slice_replace(0,10,'')
I am new to stack overflow so let me know if this is not allowed.
Currently I am using the pandas.dataframe.applymap function to apply a text cleaning function to an entire column in the dataframe (df). My df is fairly large so I added an icecream call in the text cleaning function to see the progress of the function. For further clarity, I would like to add an argument to specify the index of the df when it is executed. Is there a way to access df indices in this way? For reference, here is my text cleaner and applymap call:
def get_clean_text(text):
"""
returns: clean text string
"""
text = gen_clean(text) #function to remove punctuation, HTML tags, etc
doc = NLP(text) #spacy tokenization
sans_stops = rm_stops(doc) #removes stop words from doc, return type string
sugs = SYM_SPELL.lookup_compound(sans_stops, max_edit_distance=2) #symspellpy spell checker, return type list
spell_check = " ".join([sug.term for sug in sugs])
ic()
return spell_check
DF = pd.read_csv('data.csv', index_col=0, encoding='utf-8')
DF = DF.applymap(get_clean_text)
Desired output would look something like this:
ic | id1
ic | id2
...
i am new to python and this is my first post on stack overflow. I have a list of keywords and a dataframe containing multiple columns.
I want to search for these keywords in a particular column and write the keyword that appears against it.
This is what I am doing. My code
This is the error I am getting. The loop with the error
This is what I want to get. Desired output
Please help figuring out what is going wrong or suggesting a better way to to this. Thanks!
Writing the code below if it helps making things easier.
import pandas as pd
keywords = ["hello","hi","greetings","wassup"]
data = ["hello, my name is Harry", "Hi I am John", "Yo! Wassup", "Greetings fellow traveller","Hey im
Henry", "Hello there General Kenobi"]
df = pd.DataFrame(data,columns = ['strings'])
df['Keywords'] = ""
df2 = pd.DataFrame(data = None, columns = df.columns)
for word in keywords:
temp = df[df['strings'].str.contains(word,na = False)]
temp.reset_index(drop = True)
temp['Keywords'] = word
df2.append(temp)
Error:
C:\Users\harka\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""
I added 'Yo' to show that it can return multiple strings
import pandas as pd
def keyword(row):
strings = row['strings']
keywords = ["hello","hi","greetings","wassup",'yo']
keyword = [key for key in keywords if key.upper() in strings.upper()]
return keyword
data = ["hello, my name is Harry", "Hi I am John", "Yo! Wassup", "Greetings fellow traveller","Hey im Henry", "Hello there General Kenobi"]
df = pd.DataFrame(data,columns = ['strings'])
df['keyword'] = df.apply(keyword, axis=1)
if you don't like the list of strings return then perhaps a comma separated string?
import pandas as pd
def keyword(row):
strings = row['strings']
keywords = ["hello","hi","greetings","wassup",'yo']
keyword = [key for key in keywords if key.upper() in strings.upper()]
return ','.join(keyword)
data = ["hello, my name is Harry", "Hi I am John", "Yo! Wassup", "Greetings fellow traveller","Hey im Henry", "Hello there General Kenobi"]
df = pd.DataFrame(data,columns = ['strings'])
df['keyword'] = df.apply(keyword, axis=1)
Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?
Hello World! This will be my first contribution ever to SO :-)
Let's create some data:
data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])
My approach is very straight forward, simple and little "brute" and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable.
let's find the indices of the data frame where there are more than n tokens:
from nltk.tokenize import word_tokenize
def get_indices(df,col,n):
"""
Get the indices of dataframe where exist more than n tokens in a specific column
Parameters:
df(pandas dataframe)
n(int): threshold value for minimum words
col(string): column name
"""
tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
if len(word_tokenize(df[col][i])) < n:
tmp.append(i)
return tmp
Next we just need to call the function and drop the rows and said indices:
tmp = get_indices(df)
df_clean = df.drop(tmp)
Best!
df = pd.DataFrame({"mycolumn": ["", " ", "test string", "test string 1", "test string 2 2"]})
df = df.loc[df["mycolumn"].str.count(" ") >= 2]
You should never loop over a dataframe, always use vectorized operations.
I am working in jupyter notebook and have a pandas dataframe "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I want to go through the text in column "Answer" and get the three words before and after the word "data".
So in this scenario I would have gotten "is very important"; "We value", "since we need".
Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)
This may work:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
output:
['is very important']
['We value', 'since we need']
The solution using generator expression, re.findall and itertools.chain.from_iterable functions:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
The output:
[' is very important', 'We value ', ' since we need']