Pandas dropping column issue - python

I'm trying to drop a column from my dataframe, but the problem is that whenever I drop the column (which works) my columns always get rearranged in different orders. Does anyone know why this might be? This is my code right now:
df=df.drop('column_name', axis=1)

One way to do it is:
df = df[[col1,col2,col4]] #if col3 is what you want to drop
This is useful when you have fewer columns.

I can not reproduce your problem. But I believe the following code can keep the order. It's the same idea as runzhi xiao's answer, but without typing all the remaining columns.
df = pd.DataFrame({
'col1': [1, 2, 3, 4],
'col2': ['a', 'e', 'i', 'o'],
'col3': ['a', 'e', 'i', 'o'],
'col4': [0.1, 0.2, 1, 2],
})
cols_to_drop = ['col2']
new_columns = df.columns.drop(cols_to_drop).to_list()
df[new_columns]
col1 col3 col4
0 1 a 0.1
1 2 e 0.2
2 3 i 1.0
3 4 o 2.0

Related

How do I multiply values of a datraframe column by values from a column from another dataframe based on a common category?

I have two dataframes:
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
df1
data2 = {'Category': ['X', 'Y', 'Z'], 'Value retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df2
How do I multiply 'Value retained' by 'Price' following their respective Category and add the result as a new column in df1?
I've searched a lot for a solution and tried several different things, among them:
df3 = df1
for cat, VR in df2['Category', 'Value retained']:
if cat in df1.columns:
df3[cat] = df1['Price'] * VR
and
df3 = df1['Price'] * df2.set_index('Category')['Value retained']
df3
In my real dataframe I have 250k+ items and 32 categories with different values of 'value retained'.
I really appreciate any help for a newbie in Python coding.
You're 2nd approach would work if both dataframes have Category as index, but since you can't set_index on Category in df1 (because you have duplicated entries) you need to do a left merge on the two df based on the column Category and then multiply.
df3 = df1.merge(df2, on='Category', how='left')
df3['result'] = df3['Price'] * df3['Value retained']
print(df3)
Item Price Category County Value retained result
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0
You can use this,
import pandas as pd
data1 = {'Item': ['A', 'B', 'C', 'N'], 'Price': [1, 2, 3, 10], 'Category': ['X', 'Y', 'X', 'Z'], 'County': ['K', 'L', 'L', 'K']}
df1 = pd.DataFrame(data1)
data2 = {'Category': ['X', 'Y', 'Z'], 'Value_retained': [0.1, 0.2, 0.8]}
df2 = pd.DataFrame(data2)
df = df1.merge(df2, how='left')
df['Values'] = df.Price * df.Value_retained
print(df)
The output is,
Item Price Category County Value_retained Values
0 A 1 X K 0.1 0.1
1 B 2 Y L 0.2 0.4
2 C 3 X L 0.1 0.3
3 N 10 Z K 0.8 8.0

Making long-to-wide transformation by grouping/seperating rows by a delimiter

I need to make a long-to-wide transformation (see image below) using Pandas.
I made this code but unfortunately, it does not work !
Code :
import pandas as pd
df = pd.DataFrame({'Id': ['Id001', 'Id001', 'Id002', 'Id003', 'Id003', 'Id003'],
'val1': [np.nan, 'B', 3, 'H', np.nan, 'J'],
'val2': ['N', np.nan, 'M', 2, 'K', 'I'],
'val3': [5, 'E', 'P', 'L', np.nan, 'R']})
df = df.groupby('Id')
.agg(
val1=('val1',' | '.join),
val2=('val2',' | '.join),
val3=('val3',' | '.join))
.rename_axis(None))
df
Here is the error I'm getting :
Error :
TypeError: sequence item 0: expected str instance, float found
Do you have any suggestions/solutions ?
The error is due to the presence of NaN values, NaN values are categroized as floating point types and hence you can't join strings with NaN. The solution is to explicitly cast the NaN's to string
df.filter(like='val').astype(str).groupby(df['Id']).agg('|'.join)
val1 val2 val3
Id
Id001 nan|B N|nan 5|E
Id002 3 M P
Id003 H|nan|J 2|K|I L|nan|R

Combine rows of df based on sequence in 1 column

I have df like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2]}
df = pd.DataFrame(data=d)
I'd like to iterate over col3 whenever the next row is increasing, until the same value reoccurs, then combine rows/columns like this:
d = {'col1': ['A', 'B', 'C', 'K', 'L', 'M'], 'col2': ['Open', 'Done', 'Open', 'Open', 'Done', 'Open'], 'col3': [1, 2, 3, 3, 1, 2], 'col4': ['B/Done;C/Open;K/Open', 'C/Open;K/Open', 'None', 'None', 'M/Open', 'None']}
df = pd.DataFrame(data=d)
I have thousands of rows, so I am trying to avoid using a for loop if possible.
I believe you can't perform this in a vectorial way.
Here is a working approach, but using a loop in a custom function:
def combine(series):
out = []
for s in series.iloc[1:]:
out.append(out[-1]+';'+s if out else s)
out = out[::-1]
out.append(None)
return pd.Series(out, index=series.index)
group = df['col3'].diff().eq(0)[::-1].cumsum()[::-1]
df['col4'] = (df.assign(col=df['col1']+'/'+df['col2'])
.groupby(group, sort=False)['col']
.apply(combine)
)
output:
col1 col2 col3 col4
0 A Open 1 B/Done;C/Open;K/Open
1 B Done 2 B/Done;C/Open
2 C Open 3 B/Done
3 K Open 3 None
4 L Done 1 M/Open
5 M Open 2 None

dataframe pivoting and adding and convert category into columns with prefix

I am trying to transform this Dataframe.
To look like the following:
Here is the code to create the sample df
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
symbol_list = [['A', 'B', ''], ['C','D','E']]
The end result dataframe is grouped by id field with symbol column turns into multiple columns with symbol ordering mapped to the user input list.
I was using .apply() method to construct each datarow for the above dataframe but it is taking very long time for 10000+ datapoints.
I am trying to find a more efficient way to transform the dataframe. I am thinking that I will need to use pivot function to unstack the data frame with the combination of resetting index (to turn category value into column). Appreciate any help on this!
Use GroupBy.cumcount with DataFrame.unstack for reshape, then extract date by DataFrame.pop with max per rows, flatten columns and last add new column date by DataFrame.assign:
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
#IMPORTANT all values from symbol_list are in column symbol (without empty strings)
symbol_list = [['A', 'B', ''], ['C','D','E']]
order = [y for x in symbol_list for y in x if y]
print (order)
['A', 'B', 'C', 'D', 'E']
#convert all values to Categoricals with specified order by flatten lists
df['symbol'] = pd.Categorical(df['symbol'], ordered=True, categories=order)
df['date'] = pd.to_datetime(df['date'])
#sorting by id and symbol
df = df.sort_values(['id','symbol'])
df1 = df.set_index(['id',df.groupby('id').cumcount()]).unstack()
date_max = df1.pop('date').max(axis=1)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.assign(date = date_max)
print (df1)
symbol_0 symbol_1 symbol_2 value_0 value_1 value_2 date
id
1 A B NaN 0.0 1.0 NaN 2021-07-02
2 C D E 2.0 3.0 4.0 2021-07-02

Why does the np.where function also seem to work on values

I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?
The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2
The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)

Categories