Cleaning Dataframe in Python 3 - python

I've got a dataframe (haveleft) full of people who have left a service and their reason for leaving. The 'text' column is their reason, but some of them aren't strings. Not many, so I just want to remove those rows, either in place or to a new dataframe. Below code just gives me a dataframe populated with only NaN. Why doesn't it work?
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft[haveleft['text'] == str]]
print(holder[0:10])
or if I remove one of the 'haveleft[ ]' I get an empty dataframe
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft['text'] == str]
print(holder[0:10])
I've tried to add a type() but can't seem to figure out the way to do this.

It doesn't work because DataFrame columns cannot contain mixed types; your text column will be string or object, even if some values are numerical. You'll want to figure out how to characterize unwanted data and drop them accordingly.
For instance, to drop rows where 'text' consists only of digits as in the single-line example you give:
cleaned = df[~df['text'].str.match('^\d+$')]

Related

Not able to insert a string to a position in a dataframe

I'm trying to iterate over two data frames with different lenghts in order to look for some data in a string. If the data is found, I should be able to add info to a specific position in a data frame. Here is the code.
In the df data frame, I created an empty column, which is going to receive the data in the future.
I also have the df_userstory data frame. And here is where I'm looking for the data. So, I created the code below.
Both df['Issue key'][i] and df_userstory['parent_of'][i] contains strings.
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
item = df_userstory['Issue key'][i]
df['parent'].iloc[i] = item
df
For some reason, when I run this code the df['parent'] remains empty. I've tried different approaches, but everything failed.
I've tried to do the following in order to check what was happening:
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
print('True')
Nothing was printed out.
I appreciate your help here.
Cheers
Iterating over each index will lose all the performance benefits of using Pandas dataframe.
You should use dataframe merge.
# select the two columns you need from df_userstory
df_us = df_userstory.loc[:, ['parent_of', 'Issue key']]
# rename to parent, that will be merged with df dataframe
df_us.rename(columns={'Issue key': 'parent'}, inplace=True).drop_duplicates()
# merge
df = df.merge(df_us, left_on='Issue key', right_on='parent_of', how='left')
Ref: Pandas merge

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

How to update/apply validation to pandas columns

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

How to check splitted values of a column without changing the dataframe?

I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column

Convert multiple columns to string in pandas dataframe

I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]

Categories