I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols
Related
When I run this code it drops the first row instead of the first column:
df.drop(axis=1, index=0)
How do you drop a column by index?
You can use df.columns[i] to denote the column. Example:
df.drop(df.columns[0], axis=1)
Using the example
df = pd.DataFrame([
[1023.423,12.59595],
[1000,11.63024902],
[975,9.529815674],
[100,-48.20524597]], columns = ['col1', 'col2'])
col1 col2
0 1023.423 12.595950
1 1000.000 11.630249
2 975.000 9.529816
3 100.000 -48.205246
If you do df.drop(index=0), the output is dropping row with index 0
col1 col2
1 1000.0 11.630249
2 975.0 9.529816
3 100.0 -48.205246
If you do df.drop('col1', axis=1), the output is dropping column with name 'col1'
col2
0 12.595950
1 11.630249
2 9.529816
3 -48.205246
Please remember to use inplace=True where necessary
I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)
Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))
Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2
This question already has answers here:
How to add a new column to an existing DataFrame?
(32 answers)
Closed 4 years ago.
I have dataframe in Pandas for example:
Col1 Col2
A 1
B 2
C 3
Now if I would like to add one more column named Col3 and the value is based on Col2. In formula, if Col2 > 1, then Col3 is 0, otherwise would be 1. So, in the example above. The output would be:
Col1 Col2 Col3
A 1 1
B 2 0
C 3 0
Any idea on how to achieve this?
You just do an opposite comparison. if Col2 <= 1. This will return a boolean Series with False values for those greater than 1 and True values for the other. If you convert it to an int64 dtype, True becomes 1 and False become 0,
df['Col3'] = (df['Col2'] <= 1).astype(int)
If you want a more general solution, where you can assign any number to Col3 depending on the value of Col2 you should do something like:
df['Col3'] = df['Col2'].map(lambda x: 42 if x > 1 else 55)
Or:
df['Col3'] = 0
condition = df['Col2'] > 1
df.loc[condition, 'Col3'] = 42
df.loc[~condition, 'Col3'] = 55
The easiest way that I found for adding a column to a DataFrame was to use the "add" function. Here's a snippet of code, also with the output to a CSV file. Note that including the "columns" argument allows you to set the name of the column (which happens to be the same as the name of the np.array that I used as the source of the data).
# now to create a PANDAS data frame
df = pd.DataFrame(data = FF_maxRSSBasal, columns=['FF_maxRSSBasal'])
# from here on, we use the trick of creating a new dataframe and then "add"ing it
df2 = pd.DataFrame(data = FF_maxRSSPrism, columns=['FF_maxRSSPrism'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = FF_maxRSSPyramidal, columns=['FF_maxRSSPyramidal'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_strainE22, columns=['deltaFF_strainE22'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = scaled, columns=['scaled'])
df = df.add( df2, fill_value=0 )
df2 = pd.DataFrame(data = deltaFF_orientation, columns=['deltaFF_orientation'])
df = df.add( df2, fill_value=0 )
#print(df)
df.to_csv('FF_data_frame.csv')