So I have the current file in Excel where I have dates and don't have dates for everything which can be seen.
I read this excel file into a pandas dataframe, rename the column and get the following:
My question is, how would I get it so every empty date in the dataframe is filled in with the last previous date encountered. All of the blanks between 04/03/2021 and 05/03/2021 gets replaced with 04/03/2021, so every row in my dataframe has a date associated with it?
Thanks!
After reading the data into a dataframe, you can fill missing values using fillna with method='ffill' for forward fill
Just using the inbuilt way in pandas of:
duplicate_df['StartDate'] = duplicate_df['StartDate'].fillna(method = 'ffill')
This replaces all the NaNs in the dataframe with the last row that had data in.
Related
I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.
I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']
I have a csv in which one header column is missing. Eg, I have n data columns, but n-1 header names. When this happens, it seems like pandas shifts my first column to be an index, as shown in the image. So what happens is the column to the right of date_time in the csv, is under the date_time column in the pandas data frame.
My question is: how can I force pandas to read from the left so that the date_time data remains under the date_time column instead of becoming the index? I'm thinking if pandas can simply read from left to right and add dummy column names at the end of the file, that would be great.
Side note: I concede that my input csv should be "clean", however, I think that pandas/frameworks in general should be able to handle the case in which some data might be unclean, but the user wants to proceed with the analysis instead of spending 30 minutes writing a side function/script to fix these minor issues. In my case, the data I care about is usually in the first 15 columns and I don't really care if the columns after that are misaligned. However, when I read the dataframe into pandas, I'm forced to care and waste time fixing these issues even though I don't care about the remaining columns.
Since you don't care about the last column, just set index_col=False
df = pd.read_csv(file, index_col=False)
That way, it will sequentially match the columns with data for the first n-1 columns. Data after that will not be in the data frame
You may also skip the first row to have all your data in the data frame first
df = pd.read_csv(file, skiprows=1)
and then just set the column name after
df.columns = ['col1', 'col2', ....] + ['dummy_col1', 'dummy_col2'...]
where the first list comes from the row=0 of your csv, and the second list you just fill dinamically with a list comprehension.
I have a new Data-frame df. Which was created using:
df= pd.DataFrame()
I have a date value called 'day' which is in format dd-mm-yyyy and a cost value called 'cost'.
How can I append the date and cost values to the df and assign the date as the index?
So for example if I have the following values
day = 01-01-2001
cost = 123.12
the resulting df would look like
date cost
01-01-2001 123.12
I will eventually be adding paired values for multiple days, so the df will eventually look something like:
date cost
01-01-2001 123.12
02-01-2001 23.25
03-01-2001 124.23
: :
01-07-2016 2.214
I have tried to append the paired values to the data frame but am unsure of the syntax. I've tried various thinks including the below but without success.
df.append([day,cost], columns='date,cost',index_col=[0])
There are a few things here. First, making a column the index goes like this, though you can also do it when you load the dataframe from a file (see below):
df.set_index('date', inplace=True)
To add new rows, you should write them out to file first. Pandas isn't great at adding rows dynamically, and this way you can just read the data in when you need it for analysis.
new_row = ... #a row of new data in string format with values
#separated by commas and ending with \n
with open(path, 'a') as f:
f.write(new_row)
You can do this in a loop, or singly, as many time as you need. Then when you're ready to work with it, you use:
df = pd.read_csv(path, index_col=0, parse_dates=True)
index_col can't take a string name for the index column, so you have to use the index of the order on disk; in my case it makes the first column the index. Passing parse_dates=True will make it turn your datetime strings that you declared as the index into datetime objects.
Try this:
dfapp = [day,cost]
df.append(dfapp)
I have a dataframe df with two columns date and data. I want to take the first difference of the data column and add it as a new column.
It seems that df.set_index('date').shift() or df.set_index('date').diff() give me the desired result. However, when I try to add it as a new column, I get NaN for all the rows.
How can I fix this command:
df['firstdiff'] = df.set_index('date').shift()
to make it work?