I have a data frame called v where columns are = ['self','id','desc','name','arch','rel']. And when I rename is as follows it won't let me drop columns giving column not found in axis error.
case1:
for i in range(0,len(v.columns)):
#I'm trying to add 'v_' prefix to all col names
v.columns.values[i] = 'v_' + v.columns.values[i]
v.drop('v_self',1)
#leads to error
KeyError: "['v_self'] not found in axis"
But if I do it as follows then it works fine
case2:
v.columns = ['v_self','v_id','v_desc','v_name','v_arch','v_rel']
v.drop('v_self',1)
# no error
In both cases if I do following it give same results for its columns
v.columns
#both cases gives
Index(['v_self', 'v_id', 'v_description', 'v_name', 'v_archived',
'v_released'],
dtype='object')
I can't understand why in the case1 it gives an error? Please help, thanks.
That's because .values returns the underlying values. You're not supposed to change those directly. Assigning directly to .columns is supported though.
Try something like this:
import pandas
df = pandas.DataFrame(
[
{key: 0 for key in ["self", "id", "desc", "name", "arch", "rel"]}
for _ in range(100)
]
)
# Add a v_ to every column
df.columns = [f"v_{column}" for column in df.columns]
# Drop one column
df = df.drop(columns=["v_self"])
To your "case 1":
You meet a bug (#38547) in pandas — “Direct renaming of 1 column seems to be accepted, but only old name is working”.
It means that after that "renaming", you may delete the first column
not by using
v.drop('v_self',1)
but using the old name
v.drop('self',1)`.
Of course, the better option is not using such a buggy renaming in the
current versions of pandas.
To renaming columns by adding a prefix to every label:
There is a direct dateframe method .add_prefix() for it, isn't it?
v = df.add_prefix("v_")
Related
I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.
Little new to Python, I am trying to merge two data-frame with columns similar. 2nd data-frame consists of 1 column different need to append that in new data-frame.
Detailed view of dataframes
Code Used :
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id')
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id', how='outer')
Getting Output csv as
Unnamed: 0 Id_x Number_x Class_x Section_x Place_x Name_x Executed_Date_x Version_x Value PartDateTime_x Cycles_x Id_y Mumber_y Class_y Section_y Place_y Name_y Executed_Date_y Version_y Value_data PartDateTime_y Cycles_y
whereas i dont want _x & _y i wanted the output to be :
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
If i use df2=pd.concat([df,df1],axis=0,ignore_index=True)
then i will get values in the below mentioned format in all columns except Value_data; whereas Value_data would be empty column.
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
Please help me with a solution for this. Thanks for your time.
I think easiest path is to make a temporary df, let's call it df_temp2 , which is a copy of df_2, with renamed column, then append it to df_1
df_temp2 = df_2.copy()
df_temp2.columns = ['..','..', .... 'value' ...]
then
df_total = df_1.append(df_temp2)
This provides you a total DataFrame with all the rows of DF_1 and DF_2. 'append()' method supports a few arguments, check the docs for more details.
--- Added --------
One other possible approach is to use pd.concat() function, which can work in the same way ad .append() method, like this
result = pd.concat([df_1, df_temp2])
In your case the two approaches would lead to similar performances. You can consider append() as a method written on top of pd.concat() but it is applied to a DF itself.
Full docs about concat() here: pd.Concat() docs
Hope this was helpful.
import pandas as pd
df =pd.read_csv('C:/Users/output_2.csv')
df1 pd.read_csv('C:/Users/output_1.csv')
df1_temp=df1[['Id','Cycles','Value_data']].copy()
df3=pd.merge(df,df1_temp,on = ['Id','Cycles'], how='inner')
df3=df3.drop(columns="Unnamed: 0")
df3.to_csv('C:/Users/output.csv')
This worked
I am trying to sort a dataframe by a particular column: "Lat". However, although when I print out the column names, "Lat" clearly shows up, when I try to use it as the "by" parameter in the sort_values function, I get a KeyError. It doesn't matter which column name I use, I get a key error no matter what.
I have tried using different columns, running in place, stripping the columns names, nothing seems to work
print(lights_df.columns.tolist())
lights_by_lat = lights_df.sort_values(axis = 'columns', by = "Lat", kind
= "mergesort")
outputs:
['the_geom', 'OBJECTID', 'TYPE', 'Lat', 'Long']
KeyError: 'Lat'
^output from trying to sort
All you have to do is remove the axis argument:
lights_by_lat = lights_df.sort_values(by = "Lat", kind = "mergesort")
and you should be good.
I have two dataframes: df1 and df2. I am iterating through df1 using iterrows, and for a particular field in each line, I am looking into df2 for the line that matches that field, and trying to pull out a corresponding value from that line in df2 in a SCALAR format. Every way I try to do this I end up with another dataframe or series and I can't use that value as a scalar. Here is my latest attempt:
for index, row in df1.iterrows():
a = row[0]
b = df2.loc[(df2['name'] == a ), 'weight']
c = row[1] - b #this is where error happens
df1.set_value(index,'wtdif',c)
I get an error because 'b' in this case is not a scalar, if i print it out here is an example of what it looks like. The '24' here is the index of the row it was found in in df2. The other confusing part about this is that I can't index 'b' in any way even though it is a series (i.e. b[0] creates an error, as does b['weight'], etc.)
Name: weight, dtype: float64
24 141.5
You're getting an error because the only index in b is 24. You could use that or (more easily) index by location using,
b.iloc[0]
This is a common gotcha for new Pandas users. Indices are preserved when pulling data out of a Series or DataFrame. They do not, in general, run from 0 -> N-1 where N is the length of the Series or the number of rows in the DataFrame.
This will help a bit http://pandas.pydata.org/pandas-docs/stable/indexing.html although I admit it was confusing for me at first as well.
Welp, I am still getting "IndexError: single positional indexer is out-of-bounds" when I make that change to my code.
Your suggestion makes a lot of sense though and does work, thanks for posting that. I wrote a quick test script to verify the fix, and it did in fact work so thumbs up for that. I'll post that code here in case anyone else is ever curious.
I'm missing something here, I'll just have to keep working on what is wrong and what my next question should be...
import pandas as pd
import numpy as np
def foo(df1,df2):
df1['D'] = 0
for index,row in df1.iterrows():
name = row[2] #for some reason name ends up as column 3 in this dataframe rather than column 0? whatever, not important, but strange
temp = df2.loc[(df2['name'] == name), 'weight']
x = row[3] + temp.iloc[0] #
df1.set_value(index,'D',x)
print df1
df1 = pd.DataFrame({'name' : ['alex','bob', 'chris'], 'weight' : [140,150,160], 'A' : ['1','2','3'], 'B' : ['4','5','6']})
df2 = pd.DataFrame({'name' : ['alex','bob', 'chris'], 'weight' : [180,190,200], 'C' : ['1','2','3'], 'D' : ['4','5','6']})
print df1
print df2
foo(df1,df2)
I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )