New column in DataFrame from other columns AND rows - python

I want to create a new column, V, in an existing DataFrame, df. I would like the value of the new column to be the difference between the value in the 'x' column in that row, and the value of the 'x' column in the row below it.
As an example, in the picture below, I want the value of the new column to be
93.244598 - 93.093285 = 0.151313.
I know how to create a new column based on existing columns in Pandas, but I don't know how to reference other rows using this method. Is there a way to do this that doesn't involve iterating over the rows in the dataframe? (since I have read that this is generally a bad idea)

You can use pandas.DataFrame.shift for your use case.
The last row will not have any row to subtract from so you will get the value for that cell as NaN
df['temp_x'] = df['x'].shift(-1)
df[`new_col`] = df['x'] - df['temp_x']
or one liner :
df[`new_col`] = df['x'] - df['x'].shift(-1)
the column new_col will contain the expected data

An ideal solution is to use diff:
df['new'] = df['x'].diff(-1)

Related

Averaging data of dataframe columns based on redundancy of another column

I want to average the data of one column in a pandas dataframe is they share the same 'id' which is stored in another column in the same dataframe. To make it simple i have:
and i want:
Were is clear that 'nx' and 'ny' columns' elements have been averaged if for them the value of 'nodes' was the same. The column 'maille' on the other hand has to remain untouched.
I'm trying with groupby but couldn't manage till now to keep the column 'maille' as it is.
Any idea?
Use GroupBy.transform with specify columns names in list for aggregates and assign back:
cols = ['nx','ny']
df[cols] = df.groupby('nodes')[cols].transform('mean')
print (df)
Another idea with DataFrame.update:
df.update(df.groupby('nodes')[cols].transform('mean'))
print (df)

Creating a Pandas column based on a value in a specific row and column with .map or similar

I have a use case where I need to fill a new pandas column with the contents of a specific cell in the same table. There are 60 countries in Europe, so I need to fill a shared currency column with the content's of one country's currency (as an example only)
I need an SQL "Where" clause for Pandas - that:
1. Searches the dataframe rows for the single occurrence of "Britain" in column "country"
2. Returns a single, unique value "pound" from df['currency'].
3. Creates a new column filled with just this value = string "pound"
w['Euro_currency'] = w['Euro_currency'].map(w.loc["country"]=="Britain"["currency"])
# [Britain][currency] - contains the value - "Pound"
When this works correctly, every row in the new column 'Euro_currency' contains the value "pound"
How about you take the value from that cell and just create a new column with it as below:
p = w.loc["Britain"]["currency"]
w['Euro_currency'] = p
Does this work for you?
Thanks for help. I found this answer by #anton-protopopov at extract column value based on another column pandas dataframe
currency_value = df.loc[df['country'] == 'Britain', 'currency'].iloc[0]
df['currency_britain'] = currency_value
#anderson-zhu also mentioned that .item() would work as well
currency_value = df.loc[df['country'] == 'Britain', 'currency'].item()

in Pandas, how do I use a variable name to represent a row index to obtain a string that can be used as a header row?

I'm trying to clean an excel file that has some random formatting. The file has blank rows at the top, with the actual column headings at row 8. I've gotten rid of the blank rows, and now want to use the row 8 string as the true column headings in the dataframe.
I use this code to get the position of the column headings by searching for the string 'Destination' in the whole dataframe, and then take the location of the True value in the Boolean mask to get the list for renaming the column headers:
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[7]
print(hdrstr)
df2=df.rename(columns=hdrstr)
However when I try to use hdrindex as a variable, I get errors when the second dataframe is created (ie when I try to use hdrstr to replace column headings.)
boolmsk=df.apply(lambda row: row.astype(str).str.contains('Destination').any(), axis=1)
print(boolmsk)
hdrindex=boolmsk.index[boolmsk == True].tolist()
print(hdrindex)
hdrstr=df.loc[hdrindex]
print(hdrstr)
df2=df.rename(columns=hdrstr)
How do I use a variable to specify an index, so that the resulting list can be used as column headings?
I assume your indicator of actual header rows in dataframe is string "destination". Lets find where it is:
start_tag = df.eq("destination").any(1)
We'll keep the number of the index of first occurrence of word "destination" for further use:
start_row = df.loc[start_tag].index.min()
Using index number we will get list of values in the "header" row:
new_col_names = df.iloc[start_row].values.tolist()
And here we can assign new column names to dataframe:
df.columns = new_col_names
From here you can play with new dataframe, actual column names and proper indexing:
df2 = df.iloc[start_row+1:].reset_index(drop=True)

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

Cleaning Data: Replacing Current Column Values with Values mapped in Dictionary

I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.

Categories