i have created a dataframe from python pandas using a numpy array but i want to know how do i add values in specific columns horizontally not vertically
let's assume i have this dataframe:
df = pd.DataFrame(data=data1)
how can i add [1.2,3.5,2.2] to the second row of (-1,label) (-2,label) (0,label)?
Use DataFrame.loc:
#if need set last 3 columns and index 1
df.loc[1, df.columns[-3:]] = [1.2,3.5,2.2]
Or DataFrame.iloc:
#if need set last 3 columns and second index
df.iloc[1, -3:] = [1.2,3.5,2.2]
Or:
#if need set columns by names
cols = [col1, col3, col5]
df.loc[1, cols] = [1.2,3.5,2.2]
Related
I want to drop the first row of a dataframe subset which is a subset of the main dataframe main. The first row of the dataframe has index = 31, so when I try dropping the first row I get the following error:
>>> subset.drop(0, axis=1)
KeyError: '[0] not found in axis'
I want to perform this drop on multiple dataframes, so I cannot drop index 31 on every dataframe. Is it possible to drop the first row when the index isn't equal to 0?
Simpliest is select all rows without first by position:
df = df.iloc[1:]
Or with drop is possible select first value, but if duplicated values, then all rows are removed:
df = df.drop(df.index[0])
Your solution try remove column 0:
subset.drop(0, axis=1)
df = df if df.index[0] == 0 else df.iloc[1:]
First create dataframe with regular index, this is the df that I want to resample using th index of df1
df0 = pd.DataFrame(index=pd.date_range(start='2018-10-31 00:17:24', periods=50,freq='1s'))
I didn't know how to create a df that has an irregular index so I have created a new dataframe( the index of which I want to use) to resample df0
df1 = pd.DataFrame(index=pd.date_range(start='2018-10-31 00:17:24', periods=50,freq='20s'))
For minimum reproducible example. Create a column with values between 0 and 1
df0['dat'] = np.random.rand(len(df0))
I want to find the rows where the dat column has a value greater than 0.5
df0['target'] = 0
df0.loc[(df0['dat'] >= 0.5), 'target'] = 1
I then want to reindex df0 using the index of df1 but each row of the column named df0['target']
Should have the sum of the values that lay in that window
What I have tried is:
new_index = df1.index
df_new = df0.reindex(df0.index.union(new_index)).interpolate(method='linear').reindex(new_index).sum()
But this sum() screws everything
IIUC:
try:
df_new=df0.reindex(df1.index.union(df0.index)).interpolate(method='linear').reset_index()
Finally make use of pd.Grouper() and groupby():
out=df_new.groupby(pd.Grouper(key='index',freq='1 min')).sum()
I have multiple dataframes, on which I want to run this function which mainly drops unnecessary columns from the dataframe and returns a dataframe:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
#first drop nan columns
df = df.loc[:, df.columns.notnull()]
#then search for columns with unnamed
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
return df
Now I iterate over the list of dataframes: [df1, df2, df3]
dfsublist = [df1, df2, df3]
for index in enumerate(dfsublist):
dfsublist[index] = dropunnamednancols(dfsublist[index])
Whereas the items of dfsublist have been changed, the original dataframes df1, df2, df3 still retain the unnecessary columns. How could I achieve this?
If I understand correctly you want to apply a function to multiple dataframes seperately.
The underlaying issue is that in your function you return a new dataframe and replace the stored dataframe in the list with a new own instead of modifying the old orignal one.
If you want to modify the orignal one you have to use the inplace=True parameters of the pandas functions. This is possible, but not recommended, as seen here.
Your code could therefore look like this:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
cols = [col for col in df.columns if (col is None) | (col.startswith('Unnamed'))]
df.drop(cols, axis=1, inplace=True)
As example on sample data:
import pandas as pd
df_1 = pd.DataFrame({'a':[0,1,2,3], 'Unnamed':[9,8,7,6]})
df_2 = pd.DataFrame({'Unnamed':[9,8,7,6], 'b':[0,1,2,3]})
lst_dfs = [df_1, df_2]
[dropunnamednancols(df) for df in lst_dfs]
# df_1
# Out[55]:
# a
# 0 0
# 1 1
# 2 2
# 3 3
# df_2
# Out[56]:
# b
# 0 0
# 1 1
# 2 2
# 3 3
The reason is probably because your are using enumerate wrong. In your case, you just want the index, so what you should do is:
for index in range(len(dfsublist)):
...
Enumerate returns a tuple of an index and the actual value in your list. So in your code, the loop variable index will actually be asigned:
(0, df1) # First iteration
(1, df2) # Second iteration
(2, df3) # Third iteration
So either, you use enumerate correctly and unpack the tuple:
for index, df in enumerate(dfsublist):
...
or you get rid of it altogether because you access the values with the index either way.
What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.