Wanted a new column based on certain conditions of existing columns, below is what I am doing right now, but it takes too much time for huge data. Is there any efficient or faster way to do it.
DF["A"][0] = 0
for x in range(1,rows):
if(DF["B"][x]>DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] + DF["C"][x]
elif(DF["B"][x]<DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] - DF["C"][x]
else:
DF["A"][x] = DF["A"][x-1]
If I got you right this is what you want:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
df['A'] = np.where(df.index==0,
0,
np.where(df['B']>df['B'].shift(),
df['A']-df['A'].shift(),
np.where(df['B']<df['B'].shift(),
df['A'].shift()-df['C'],
df['A'].shift())))
df
# A B C
#0 0.0 12 3
#1 1.0 15 9
#2 -10.0 9 12
#3 -3.0 8 6
#4 1.0 15 8
a new column based on certain conditions of existing columns,
I'm using the DataFrame provided by #zipa:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
First approach
Here's a function that implements efficiently as you specified. It works by leveraging Pandas' indexing features, specifically row masks
def update(df):
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
for cond, sign in [(cond_larger, +1), # A[x-1] + C[x]
(cond_smaller, -1), # A[x-1] - C[x]
(cond_else, 0)]: # A[x-1] + 0
if any(cond):
df.loc[cond, 'A_updated'] = (df['A'].shift().fillna(0) +
sign * df[cond]['C'])
df['A'] = df['A_updated']
df.drop(columns=['A_updated'], inplace=True)
return df
update(df)
=>
A B C
0 3.0 12 3
1 10.0 15 9
2 -10.0 9 12
3 -3.0 8 6
4 12.0 15 8
Optimized
It turns out you can use DataFrame.mask to achieve the same as above. Note you could combine the conditions into the call of mask, however I find it easier to read like this:
# specify conditions
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
# apply
A_shifted = (df['A'].shift().fillna(0)).copy()
df.mask(cond_larger, A_shifted + df['C'], axis=0, inplace=True)
df.mask(cond_smaller, A_shifted - df['C'], axis=0, inplace=True)
df.mask(cond_else, A_shifted, axis=0, inplace=True)
=>
(same results as above)
Notes:
I'm assuming default value 0 for A/B[x-1]. If the first row should be treated differently remove or replace .fillna(0). Results will be different.
Conditions are checked in sequence. Depending on whether updates should use the original values in A or those updated in the previous condition you may not need the helper column A_updated
See previous versions of this answer for a history of how I got here
Related
I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.
I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!
I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!
Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])
You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.
I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?
Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN
Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.
I am trying to create a program that will delete a column in a Panda's dataFrame if the column's sum is less than 10.
I currently have the following solution, but I was curious if there is a more pythonic way to do this.
df = pandas.DataFrame(AllData)
sum = df.sum(axis=1)
badCols = list()
for index in range(len(sum)):
if sum[index] < 10:
badCols.append(index)
df = df.drop(df.columns[badCols], axis=1)
In my approach, I create a list of column indexes that have sums less than 10, then I delete this list. Is there a better approach for doing this?
You can call sum to generate a Series that gives the sum of each column, then use this to generate a boolean mask against your column array and use this to filter the df. DF generation code borrowed from #Alexander:
In [2]:
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
df
Out[2]:
a b c
0 1 1 20
1 10 1 30
In [3]:
df.sum()
Out[3]:
a 11
b 2
c 50
dtype: int64
In [6]:
df[df.columns[df.sum()>10]]
Out[6]:
a c
0 1 20
1 10 30
You can accomplish your objective using a one-liner by using a list comprehension and iteritems to identify all columns that meet your criteria.
df = pd.DataFrame({'a': [1, 10], 'b': [1, 1], 'c': [20, 30]})
>>> df
a b c
0 1 1 20
1 10 1 30
df.drop([col for col, val in df.sum().iteritems() if val < 10], axis=1, inplace=True)
>>> df
a c
0 1 20
1 10 30
I'm trying to make a table, and the way Pandas formats its indices is exactly what I'm looking for. That said, I don't want the actual data, and I can't figure out how to get Pandas to print out just the indices without the corresponding data.
You can access the index attribute of a df using .index:
In [277]:
df = pd.DataFrame({'a':np.arange(10), 'b':np.random.randn(10)})
df
Out[277]:
a b
0 0 0.293422
1 1 -1.631018
2 2 0.065344
3 3 -0.417926
4 4 1.925325
5 5 0.167545
6 6 -0.988941
7 7 -0.277446
8 8 1.426912
9 9 -0.114189
In [278]:
df.index
Out[278]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
.index.tolist() is another function which you can get the index as a list:
In [1391]: datasheet.head(20).index.tolist()
Out[1391]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
You can access the index attribute of a df using df.index[i]
>> import pandas as pd
>> import numpy as np
>> df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)})
a b
0 0 1.088998
1 1 -1.381735
2 2 0.035058
3 3 -2.273023
4 4 1.345342
>> df.index[1] ## Second index
>> df.index[-1] ## Last index
>> for i in xrange(len(df)):print df.index[i] ## Using loop
...
0
1
2
3
4
You can use lamba function:
index = df.index[lambda x : for x in df.index() ]
print(index)
You can always try df.index. This function will show you the range index.
Or you can always set your index. Let say you had a weather.csv file with headers:
'date', 'temperature' and 'event'. And you want set "date" as your index.
import pandas as pd
df = pd.read_csvte'weather_file)
df.set_index('day', inplace=True)
df