How to get particular Column of DataFrame in pandas? - python

I have a data frame name it df I want to have like its first and second colums(series) in variable x and y.
I would have done that by name of the column like df['A'] or df['B'] or something like that.
But problem here is that data is itself header and it has no name.Header is like 2.17 ,3.145 like that.
So my Question is:
a) How to name column and start the data(which starts now from head) right after the name ?
b) How to get particular column's data if we don't know the name or it doesn't have the name ?
Thank you.

You might want to read the
documentation on indexing.
For what you specified in the question, you can use
x, y = df.iloc[:, [0]], df.iloc[:, [1]]

Set the names kwarg when reading the DataFrame (see the read_csv docs.
So instead of pd.read_csv('kndkma') use pd.read_csv('kndkma', names=['a', 'b', ...]).

It is usually easier to name the columns when you read or create the DataFrame, but you can also name (or rename) the columns afterwards with something like:
df.columns = ['A','B', ...]

Related

Updating element of dataframe while referencing column name and row number

I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.

Adding comuln/s if not existing using Pandas

I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)

Rename last column in a dataframe passed along in method chain

How can I rename the last column in a dataframe, that was passed along in a method chain? Think about the following example (the real use case is more complex). How can the rename function refer to the dataframe that it processes (which is different from the "table" dataframe? Is there something like the following? Unfortunately "self" does not exist.
result = table.iloc[:,2:-1].rename(columns={self.columns[-1]: "Text"})
Use pipe():
result = table.iloc[:,2:-1].pipe(lambda df: df.rename(columns={df.columns[-1]: "Text"}))
I think that you can just do the following:
result = table.iloc[:,2:-1]
result.columns = result.columns[:-1] + ["Text"]

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

Unable rename column series

I am unable to rename the column of a series:
tabla_paso4
Date decay
2015-06-29    0.003559
2015-09-18    0.025024
2015-08-24    0.037058
2014-11-20    0.037088
2014-10-02    0.037098
Name: decay, dtype: float64
I have tried:
tabla_paso4.rename('decay_acumul')
tabla_paso4.rename(columns={'decay':'decay_acumul'}
I already had a look at the possible duplicate, however don't know why although applying :
tabla_paso4.rename(columns={'decay':'decay_acumul'},inplace=True)
returns the series like this:
Date
2015-06-29    0.003559
2015-09-18    0.025024
2015-08-24    0.037058
2014-11-20    0.037088
2014-10-02    0.037098
dtype: float64
It looks like your tabla_paso4 - is a Series, not a DataFrame.
You can make a DataFrame with named column out of it:
new_df = tabla_paso4.to_frame(name='decay_acumul')
Try
tabla_paso4.columns = ['Date', 'decay_acumul']
or
tabla_paso4.rename(columns={'decay':'decay_acumul'}, inplace=True)
What you were doing wrong earlier, is you missed the inplace=True part and therefore the renamed df was returned but not assigned.
I hope this helps!

Categories