So, I read CSV-files that are generated using excel.
Those can contain empty columns and rows on the right side - resp. below the data range/table.
Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.
In a first iteration I just used
pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False)
which seemed to work fine.
But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.
I do want to keep all columns that
have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column
So, per https://stackoverflow.com/a/43983654/2215053 I first used
unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)
which looks and feels clean, but it scrambles the column order.
So now I go with:
df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]
Which works.
But there should be an obviously right way to accomplish this frequently occuring task?
So how could I do this better?
Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?
Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:
# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)
Related
I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.
I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column
I've got some big csv's. They can easily have over 300k rows and 500 columns. So obviously I like to get rid of some unneeded data in the resulting dataframe to safe resources.
There are some fix labeled columns and also some variable number of columns having similar lables but being numbered.
example=pd.DataFrame(columns=["fix","variable 1","variable 2","waste 1","waste 2"])
I want to get all these variable columns, which I can get via
example.filter(regex="var")
but I want to include "fix" as well. As df.loc doesn't allow regex' and df.filter only supports a single argument, is there a smooth way to do this? Or do I have to create a quite complex callable?
thanks in advance
Just modify your regex to do a full match for "fix":
df.filter(regex=r"var|(^fix$)")
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
Another option is using Index.str.contains in the same fashion:
df.loc[:,df.columns.str.contains(r'var|(?:^fix$)') ]
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
I made the group non-capturing, otherwise pandas complains.
new to pandas here. I have a df:
inked=tracker[['A','B','C','D','AA','BB','CC', 'DD', 'E', 'F']]
single letter column names contain names and double letter column names contain numbers but also NaN.
I am converting all NaN to zeros by using this:
inked.loc[:,'AA':'DD'].fillna(0)
and it works, but when I do
inked.head()
I get the original df with the NaN. How can I make the change permanently in the df?
By default, fillna() is not performed in place. If you were operating directly on the DataFrame, then you could use the inplace=True argument, like this:
inked.fillna(0, inplace=True)
However, if you first select a subset of the columns, using loc, then the results are lost.
This was covered here. Basically, you need to re-assign the updated DataFrame back to the original DataFrame. For a list of columns (rather than a range, like you originally tried), you can do this:
inked[['AA','DD']] = inked[['AA','DD']].fillna(0)
In general when performing dataframe operations, when you want to alter a dataframe you either need to re-assign it to itself, or to a new variable. (In my experience at least)
inked = inked.loc[:,'AA':'DD'].fillna(0)
Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.
I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.
My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.
From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.
Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?
Thanks
In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.
df.columns = [str(int(float(column))) for column in df.columns]
Or, if integer value is not required,extra int conversion can be removed and float string value can be used:
df.columns = [str(float(column)) for column in df.columns]
Then, df['300.0'] can be used instead of df['300'].
If string type is not required then, I think converting them float would work as well.
df.columns = [float(column) for column in df.columns]
Then, df[300.0] would work as well.
Other alternative to change column names may be using map:
Changing to float value for all columns, then as mentioned above use df[300.0]:
df.columns = map(float, df.columns)
Changing to string value of float, then df['300.0']:
df.columns = map(str, map(float, df.columns))
Changing to string value of int, then df['300']:
df.columns = map(str, map(int, map(float, df.columns)))
Some solutions:
Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.
Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.
Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.
This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.
Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.
cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)
For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.
My thanks to user Nipun Batra for this answer that explained df.rename.