I need help cleaning a very large dataframe. One of the rows is "PostingTimeUtc" should be only dates but several rows inserted wrong and they have strings of text instead. How can I select all the rows for "PostingTimeUtc" which have strings instead of dates and drop them?
I'm new to this site and to coding, so please let me know if I'm being vague.
Please remember to add examples even if short -
This may work in your case:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
df[df['column name'].map(is_datetime)]
Where map applies the is_datetime function (results in True or False) to each row and the Boolean filter is applied to the dataframe.
Don't forget to assign df to this result to retain the values as it is not done inplace.
df = df[df['column name'].map(is_datetime)]
I am assuming it's the pandas data frame. You can do this to filter rows on the basis of regex.
df.column_name.str.contains('your regex here')
Related
I see a lot of questions related to dropping rows that have a certain value in a column, or dropping the entirety of columns, but pretend we have a Pandas Dataframe like the one below.
In this case, how could one write a line to go through the CSV, and drop all rows like 2 and 4? Thank you.
You could try
~((~df).all(axis=1))
to get the rows that you want to keep/drop. To get the dataframe with just those rows, you would use
df = df[~((~df).all(axis=1))]
A more detailed explanation is here:
Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
This should help
for i in range(df.shape[0]):
value=df.shape[1]
count=0
for column_name in column_names:
if df.loc[[i]].column_name==False:
count=count+1
if count==value:
df.drop(index=i,inplace=True)
Apologies if this is contained in a previous answer but I've read this one: How to select rows from a DataFrame based on column values? and can't work out how to do what I need to do:
Suppose have some pandas dataframe X and one of the columns is 'timestamp'. The entries are formatted like '2010-11-03 09:44:05'. I want to select just those rows that correspond to a specific day, for example, select just those rows for which the actual string in timestamp column starts with '2010-11-03'. Is there a neat way to do this? Can I do it with a mask or Boolean indexing? Or should I just write a separate line to peel off the day from each entry and then select the rows? Bear in mind the dataframe is large if it helps.
i.e. I want to write something like
X.loc[X['timestamp'].startswith('2010-11-03')]
or
mask = '2010-11-03' in X["timestamp"]
but these don't actually make any sense.
This should work:-
X[X['timestamp'].str.startswith('2010-11-03')]
I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.
I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column
Essentially this is the same question as in this link:How to automatically shrink down row numbers in R data frame when removing rows in R. However, I want to do this with a pandas dataframe. How would I go about doing so? There seems to be nothing similar to the rownames method of R dataframes in the Pandas library...Any ideas?
What you call "row number" is part of the index in pandas-speak, in this case a integer index. You can rebuild the index using
df = df.reset_index(drop=True)
There is another way of doing this, which does not generate a new column with the old index:
df.index=range(len(df.index))