Data Cleaning - Issues with applying clean to all elements of the dataframe - python

Hi I'm attempting my first ML project and am probably way out of my depth. I've been trying to clean the data ready to start NLP, and need to get rid of the punctuation and split the sentences down into single words. I can't seem to get the script to run on all rows in the column, it's only looking at the first line which is the name of the column. In this case 'description'.
I've put the code below that I've run to load the file and pics to show the output at each stage. Any help would be much appreciated.
def load_user_info():
userdata = pd.read_csv('userinfo.csv')
return userdata
Output from head command
Creation of dataframe
Cleaning the data to remove punctuation etc
import re
import string
def clean_text_round1(df):
text = text.lower()
text = re.sub('[%s] ' % re.escape(string.punctuation), ' ', df)
text = re.sub('\w*\d\w*', ' ', df)
replace_no_space = re.compile('\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\<)|(\>)|(\{)|(\})')
replace_with_space = re.compile('<br\s/><bre\s/?)|(-)|(/)|(:).')
return df
round1 = lambda x: clean_text_round1
Applying the data clean
data_clean = pd.DataFrame(df.apply(round1))
data_clean
the above command returns:
description <function clean_text_round1 at 0x000001726BA53598>
When I call df I'm expecting to see clean data but as I've said above it only seems to be applying the clean to the first row and not the rest of the data in the description column. I'm getting the same when I try to split the sentences within the dataframe down into single words, it applies the code to line one and ignores the rest of the rows in the column.

Related

Using icecream to debug within dataframe applymap function

I am new to stack overflow so let me know if this is not allowed.
Currently I am using the pandas.dataframe.applymap function to apply a text cleaning function to an entire column in the dataframe (df). My df is fairly large so I added an icecream call in the text cleaning function to see the progress of the function. For further clarity, I would like to add an argument to specify the index of the df when it is executed. Is there a way to access df indices in this way? For reference, here is my text cleaner and applymap call:
def get_clean_text(text):
"""
returns: clean text string
"""
text = gen_clean(text) #function to remove punctuation, HTML tags, etc
doc = NLP(text) #spacy tokenization
sans_stops = rm_stops(doc) #removes stop words from doc, return type string
sugs = SYM_SPELL.lookup_compound(sans_stops, max_edit_distance=2) #symspellpy spell checker, return type list
spell_check = " ".join([sug.term for sug in sugs])
ic()
return spell_check
DF = pd.read_csv('data.csv', index_col=0, encoding='utf-8')
DF = DF.applymap(get_clean_text)
Desired output would look something like this:
ic | id1
ic | id2
...

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

python pandas dataframe words in context: get 3 words before and after

I am working in jupyter notebook and have a pandas dataframe "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I want to go through the text in column "Answer" and get the three words before and after the word "data".
So in this scenario I would have gotten "is very important"; "We value", "since we need".
Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)
This may work:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
output:
['is very important']
['We value', 'since we need']
The solution using generator expression, re.findall and itertools.chain.from_iterable functions:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
The output:
[' is very important', 'We value ', ' since we need']

How to load in txt file as data in Python?

I'm learning how to use sklearn and scikit and all that to do some machine learning.
I was wondering how to import this as data?
This is a dataset from the million song genre dataset.
How can I make my data.target[0] equal to "classic pop and rock" (as 0) and data.target[1] equal to 0 which is "classic pop and rock" and data.target[640] equal to 1 which is "folk"?
And my data.data[0,:] be equal to -8.697, 155.007, 1, 9, and so forth (all numerical values after the title column)
as others had mentioned it was a little unclear as to what shape you were looking for, but just as a general starter, and getting the data into a very flexible format, you could read the text file into python and convert it to a pandas dataframe. I am certain their are other more compact ways of doing this, but just to provide clear steps we could start with:
import pandas as pd
import re
file = 'filepath' #this is the file path to the saved text file
music = open(file, 'r')
lines = music.readlines()
# split the lines by comma
lines = [line.split(',') for line in lines]
# capturing the column line
columns = lines[9]
# capturing the actual content of the data, and dismissing the header info
content = lines[10:]
musicdf = pd.DataFrame(content)
# assign the column names to our dataframe
musicdf.columns = columns
# preview the dataframe
musicdf.head(10)
# the final column had formatting issues, so wanted to provide code to get rid of the "\n" in both the column title and the column values
def cleaner(txt):
txt = re.sub(r'[\n]+', '', txt)
return txt
# rename the column of issue
musicdf = musicdf.rename(columns = {'var_timbre12\n' : 'var_timbre12'})
# applying the column cleaning function above to the column of interest
musicdf['var_timbre12'] = musicdf['var_timbre12'].apply(lambda p: cleaner(p))
# checking the top and bottom of dataframe for column var_timbre12
musicdf['var_timbre12'].head(10)
musicdf['var_timbre12'].tail(10)
the result of this would be the following:
%genre track_id artist_name
0 classic pop and rock TRFCOOU128F427AEC0 Blue Oyster Cult
1 classic pop and rock TRNJTPB128F427AE9F Blue Oyster Cult
By having the data in this format, you can now do lots of grouping tasks, finding certain genres and their relative attributes, etc. using pandas groupby function.
Hope this helps!

Categories