Replace values in pandas dataframe based on column names - python

I would like to replace the values in a pandas dataframe from another series based on the column names. I have the foll. dataframe:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
34 43 0 0 25 Test Val
and I have another series:
Y2000 41403766
Y2001 45283735
Y2002 47850796
Y2003 38639101
Y2004 45226813
How do I replace the values in the first dataframe based on the values in the 2nd series?
--MORE EDITS:
To recreate the proble, code and data is here: umd.box.com/s/hqd6oopj6vvp4qvpwnj8r4lm3z7as4i3
Instructions to run teh code:
To run this code:
Replace data_dir in config_rotations.txt with the path to the input directory i.e. where the files are kept
Replace out_dir in config_rotations.txt with whatever output path you want
Run python code\crop_stats.py. The problem is in line 133 of crop_stats.py
--EDIT:
Based on #Andy's query, here's the result I want:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
41403766 45283735 47850796 38639101 45226813 Test Val
I tried
df_a.replace(df_b)
but this does not change any value in df_a

You can construct a df from the series after reshaping and overwrite the columns:
In [85]:
df1[s.index] = pd.DataFrame(columns = s.index, data = s.values.reshape(1,5))
df1
Out[85]:
Y2000 Y2001 Y2002 Y2003 Y2004 Item Item Code
0 41403766 45283735 47850796 38639101 45226813 Test Val
So this uses the series index values to sub-select from the df and then constructs a df from the same series, here we have to reshape the array to make a single row df
EDIT
The reason my code above won't work on your real code is firstly when assigning you can't do this:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop)][s.index]
This is called chained indexing and raises a warning, see the docs.
So to correct this you can put the columns inside the []:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop),s.index]
Additionally pandas tries to align along index values and column names, if they don't match then you'll get NaN values so you can get around this by calling .values to get a np array which just becomes anonymous data that has no index or column labels, so long as the data shape is broadcast-able then it will do what you want:
df.loc[(df['Country Code'] == replace_cnt) & (df['Item'] == crop),s.index] = pd.DataFrame(columns=s.index, data=s.values.reshape(1, len(s.index))).values

Related

Creating a subset from a dataframe based on a condition from another array

I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]

Set an idex from the first level of a hierarchical index column, PANDAS

I have a dataframe which is the result of a concatenation of dataframe. I use "keys= " option for the title of each blocks when I export in Excel.
And now I want define the ID2 as an index with ID. (For have a multindex)
I tried to use .resetindex, but it didn't work like I want.
I have:
I want:
You can extract your indexes to lists and to create a MultiIndex object, and then simply define the index of your DataFrame with this MultiIndex. This works on my side (pandas imported as pd):
Let's assume your initial DataFrame is this one (just a smaller version of what you have):
df = pd.DataFrame({'ID2': ['b','c','b'], 'name' : ['tomato', 'pizza', 'kebap']}, index = [1,2,4])
Then, we extract the final indices from the index and from the column of the dataframe in order to build a list of tuples, with which you create the multiindex with pandas.MuliIndex method:
ID2 = df.ID2.to_list()
ID1 = df.index.to_list()
indexes = [(id1, id2) for id1,id2 in zip(ID1,ID2)]
final_indices = pd.MultiIndex.from_tuples(indexes, names=["Id1", "Id2"])
Finally, you redefine your index and you can drop the 'ID2' column:
df.index = final_indices
df = df.drop('ID2', axis = 1)
This gives the following DataFrame:
Note: I also tried with the df.reindex method, but the values of the DataFrame became NaN, I do not know why.

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

Updating element of dataframe while referencing column name and row number

I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

Categories