I am new to stack overflow so let me know if this is not allowed.
Currently I am using the pandas.dataframe.applymap function to apply a text cleaning function to an entire column in the dataframe (df). My df is fairly large so I added an icecream call in the text cleaning function to see the progress of the function. For further clarity, I would like to add an argument to specify the index of the df when it is executed. Is there a way to access df indices in this way? For reference, here is my text cleaner and applymap call:
def get_clean_text(text):
"""
returns: clean text string
"""
text = gen_clean(text) #function to remove punctuation, HTML tags, etc
doc = NLP(text) #spacy tokenization
sans_stops = rm_stops(doc) #removes stop words from doc, return type string
sugs = SYM_SPELL.lookup_compound(sans_stops, max_edit_distance=2) #symspellpy spell checker, return type list
spell_check = " ".join([sug.term for sug in sugs])
ic()
return spell_check
DF = pd.read_csv('data.csv', index_col=0, encoding='utf-8')
DF = DF.applymap(get_clean_text)
Desired output would look something like this:
ic | id1
ic | id2
...
Related
Hi I'm attempting my first ML project and am probably way out of my depth. I've been trying to clean the data ready to start NLP, and need to get rid of the punctuation and split the sentences down into single words. I can't seem to get the script to run on all rows in the column, it's only looking at the first line which is the name of the column. In this case 'description'.
I've put the code below that I've run to load the file and pics to show the output at each stage. Any help would be much appreciated.
def load_user_info():
userdata = pd.read_csv('userinfo.csv')
return userdata
Output from head command
Creation of dataframe
Cleaning the data to remove punctuation etc
import re
import string
def clean_text_round1(df):
text = text.lower()
text = re.sub('[%s] ' % re.escape(string.punctuation), ' ', df)
text = re.sub('\w*\d\w*', ' ', df)
replace_no_space = re.compile('\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\<)|(\>)|(\{)|(\})')
replace_with_space = re.compile('<br\s/><bre\s/?)|(-)|(/)|(:).')
return df
round1 = lambda x: clean_text_round1
Applying the data clean
data_clean = pd.DataFrame(df.apply(round1))
data_clean
the above command returns:
description <function clean_text_round1 at 0x000001726BA53598>
When I call df I'm expecting to see clean data but as I've said above it only seems to be applying the clean to the first row and not the rest of the data in the description column. I'm getting the same when I try to split the sentences within the dataframe down into single words, it applies the code to line one and ignores the rest of the rows in the column.
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
I am trying to create a function that splits text in a column of a dataframe and puts each half of the split into a different new column. I want to split the text right after a specific phrase (defined as "search_text" in the function "create_var") and then trim that text to a specified number of characters (defined as left_trim_number in the function). My function has worked in some cases but does not work in others.
Here is the basic structure of my dataframe, where "lst" is my list of text items and "cols" are the two columns of the original dataframe:
import pandas as pd
cols = ['page', 'text_i']
df1 = pd.DataFrame(lst, columns=cols)
Here is my function:
def create_var(varname, search_text, left_trim_number):
df1[['a',varname]] = df1['text_i'].str.split(search_text, expand=True)
df1[varname] = df1[varname].str[: left_trim_number ]
create_var('var1','I am looking for the text that follows this ',3)
In the cases where it doesnt work, I get this error (which I assume is related to pandas):
"ValueError: Columns must be same length as key"
Is there a better way of doing this?
You could try this:
import pandas as pd
df = pd.DataFrame({"text":["hello world", "a", "again hello world"]})
search_text = "hello "
parts = df['text'].str.partition(search_text)
df['a'] = parts[0] + parts[1]
df['var1'] = parts[2]
df['var1'] = df['var1'].str[:3]
print(df)
Output:
text a var1
0 hello world hello wor
1 a a
2 again hello world again hello wor
I've seen questions posted here that are similar to mine, but I'm still getting errors in my code when trying some accepted answers. I have a dataframe with three columns--created _at, text, and words (which is just tokenized version of text). See below:
Now, I have a list of companies ['Starbucks', 'Nvidia', 'IBM', 'Dell'], and I only want to keep the rows where the text includes those words above.
I've tried a few things, but with no success:
small_DF.filter(lambda x: any(word in x.text for word in test_list))
Returns : TypeError: condition should be string or Column
I tried creating a function and using foreach():
def filters(line):
return(any(word in line for word in test_list))
df = df.foreach(filters)
That turns df into 'Nonetype'
And the last one I tried:
df = df.filter((col("text").isin(test_list))
This returns an empty dataframe, which is nice as I get no error, but obviously not what I want.
Your .filter returns an error because it is the sql filter function (expecting a BooleanType() column) on dataframes not the filter function on RDDs. If you want to use the RDD one, just add .rdd:
small_DF.rdd.filter(lambda x: any(word in x.text for word in test_list))
You don't have to use a UDF, you can use regular expressions in pyspark with .rlike on your column "text":
from pyspark.sql import HiveContext
hc = HiveContext(sc)
import pyspark.sql.functions as psf
words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = hc.createDataFrame(data).toDF('text')
df.filter(psf.lower(df.text).rlike('|'.join(words)))
I think filter isnt working becuase it expects a boolean output from lambda function and isin just compares with column. You are trying to compare list of words to list of words. Here is something that I tried can give you some direction -
# prepare some test data ==>
words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = spark.createDataFrame(data).toDF('text')
from pyspark.sql.types import *
def intersect(row):
# convert each word in lowecase
row = [x.lower() for x in row.split()]
return True if set(row).intersection(set(words)) else False
filterUDF = udf(intersect,BooleanType())
df.where(filterUDF(df.text)).show()
output :
+------------------+
| text|
+------------------+
| i love Starbucks|
|dell laptops rocks|
+------------------+