i have a dataframe, after grouping, it is like this now:
now i want to move row index(name) to be the first column, how to do that ?
i tried to do like this:
gr.reset_index(drop=True)
but the effect is like this:
name field now has count information,
Don't specify the drop parameter, as as it means, it will drop the index, and also probably better to rename the index, since you have a name column already:
gr.index.name = "company"
gr = gr.reset_index()
Related
I have a Pandas Dataframe that looks like this:
First DF
and another that looks like this:
Second DF
Which I would like to combine, and get something that looks like this:
Third DF
Is there any way to do this?
I have tried merging and joining but every time I do it it drops the transaction # column, if possible I would like to have it join on the name column but still keep the transaction column is basically what I'm trying to do. So that whichever row has a name, has a net volume. TIA
I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T
I have this dataframe
For some reason when I did this all_data = all_data.set_index(['Location+Type']) the name 'Location+Type' is now gone. How do I now set the first column of the dataframe to have the name 'Location+Type'?
By using set_index(['Location+Type']) you no longer have the column Location+Type instead, it is now the index of your dataframe therefore when printing, you are not seeing the name. If you wish to recover Location+Type as a column, then you need to use:
all_data = all_data.reset_index()
This will create a new index, while returning the original index (Location+Type) to a column. As the documentation states:
When we reset the index, the old index is added as a column, and a new sequential index is used:
I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want
I have pandas dataframe which has the same column names. (column names are a,b,a,a,a)
Below is example.
Is there any way I can change column name only for 3rd column from the left by specifying column location? I found that there is a way to change column name by making a new list. But I wanted to see if there is any way I can specify column location and change the name.
Below is what I want.
Since I am new to programming, I would appreciate any of your help!
Does this work?:
column_names = df.columns.values
column_names[2] = 'Changed'
df.columns = column_names
df.rename(inplace=True,columns={'3col':'Changed'})