I have a dataframe stations with four columns "1990", "2000", "2006", and "2012" with area data. To interpolate the years in between I want to insert columns with empty values in the gaps.
I did use pandas.DataFrame.insert to insert columns at specific locations but couldn't find out how to do that with multiple columns like pandas.DataFrame.insert[1, ["1991":"1999"], np.nan].
Is there a way to insert multiple columns with a consecutive number/name to fill the gaps?
I appreciate every help!
You won't hear this often for question about pandas, but in this instance, I think looping is probably the clearest solution:
for year in range(1991, 2000):
df[str(year)] = np.NaN.
You can then reorder the columns afterwards.
Related
I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
Apologies if this is contained in a previous answer but I've read this one: How to select rows from a DataFrame based on column values? and can't work out how to do what I need to do:
Suppose have some pandas dataframe X and one of the columns is 'timestamp'. The entries are formatted like '2010-11-03 09:44:05'. I want to select just those rows that correspond to a specific day, for example, select just those rows for which the actual string in timestamp column starts with '2010-11-03'. Is there a neat way to do this? Can I do it with a mask or Boolean indexing? Or should I just write a separate line to peel off the day from each entry and then select the rows? Bear in mind the dataframe is large if it helps.
i.e. I want to write something like
X.loc[X['timestamp'].startswith('2010-11-03')]
or
mask = '2010-11-03' in X["timestamp"]
but these don't actually make any sense.
This should work:-
X[X['timestamp'].str.startswith('2010-11-03')]
How to keep rows in a DataFrame based on column unique pairs in Python?
I have a massive ocean datasets with over 300k rows. Given some unique latitude-longitude pairs have multiple depths, I am only interested in keeping unique rows that contain unique sets of Latitude-Longitude-Year-Month.
The goal here is to know how many months of sampling for a given Latitude-Longitude location.
I tried using pandas conditions but the sets that I want are dependent on each other.
Any ideas on how to do this?
So far I've tried the following:
# keep Latitude, Longitude, Year and Month
glp = glp[['latitude', 'longitude', 'year', 'month']]
# only keep unique rows
glp.drop_duplicates(keep = False, inplace = True)
but it removes too many lines as I want those four variables to work together
The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use
df.drop_duplicates()
or include column name list if you're only looking for unique values within specified columns
df.drop_duplicates(subset=[column_list])#column_list of names you want to compare
Edit:
If that's the case, I guess you could just do
df.groupby([column_list]).first() #first() takes the first values of other columns
And then you could just use df.reset_index() if you want the unique sets as columns again.
I am trying to read a dataset, which has few rows with uneven column count ('ragged'). I want to leave out those rows and read the rest of the rows. Is it possible in pandas instead of breaking the dataset into separate data frames and combining them?
If I understand your question, you have uneven columns, but want to drop any rows that don't have every column. If so, simply read the entire data set (read_csv) and then call dropna() on the dataframe. dropna() has a swarg called 'how' which defaults to 'any' ... that is, if any of the items in the given row (or column) are NA. (Consider also doing 'inplace=True'). See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html