I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1
Related
I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)
I have a dataframe which is called "df". It looks like this:
a
0 2
1 3
2 0
3 5
4 1
5 3
6 1
7 2
8 2
9 1
I would like to produce a cummulative sum column which:
Sums the contents of column "a" cumulatively;
Until it gets a sum of "5";
Resets the cumsum total, to 0, when it reaches a sum of "5", and continues with the summing process;
I would like the dataframe to look like this:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
In the dataframe, the column "a_cumm_summ" contains the results of the cumulative sum.
Does anyone know how I can achieve this? I have hunted through the forums. And saw similar questions, for example, this one, but they did not meet my exact requirements.
You can get the cumsum, and floor divide by 5. Then subtract the result of the floor division, multiplied by 5, from the below row's cumulative sum:
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
df
Out[1]:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
Solution #2 (more robust):
Per Trenton's comment, A good, diverse sample dataset goes a long way to figure out unbreakable logic for these types of problems. I probably would have come up with a better solution first time around with a good sample dataset. Here is a solution that overcomes the sample dataset that Trenton mentioned in the comments. As shown, there are more conditions to handle as you have to deal with carry-over. On a large dataset, this would still be much more performant than a for-loop, but it is much more difficult logic to vectorize:
df = pd.DataFrame({'a': {0: 2, 1: 4, 2: 1, 3: 5, 4: 1, 5: 3, 6: 1, 7: 2, 8: 2, 9: 1}})
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
over = (df['a_cumm_sum'].shift(1) - 5)
df['a_cumm_sum'] = df['a_cumm_sum'] - np.where(over > 0, df['a_cumm_sum'] - over, 0).cumsum()
s = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum']*-1, 0).cumsum()
df['a_cumm_sum'] = np.where((df['a_cumm_sum'] > 0) & (s > 0), s + df['a_cumm_sum'],
df['a_cumm_sum'])
df['a_cumm_sum'] = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum'].shift() + df['a'], df['a_cumm_sum'])
df
Out[2]:
a a_cumm_sum
0 2 2.0
1 4 6.0
2 1 1.0
3 5 6.0
4 1 1.0
5 3 4.0
6 1 5.0
7 2 2.0
8 2 4.0
9 1 5.0
The assignment can be combined with a condition. The code is as follows:
import numpy as np
import pandas as pd
a = [2, 3, 0, 5, 1, 3, 1, 2, 2, 1]
df = pd.DataFrame(a, columns=["a"])
df["cumsum"] = df["a"].cumsum()
df["new"] = df["cumsum"]%5
df["new"][((df["cumsum"]/5)==(df["cumsum"]/5).astype(int)) & (df["a"]!=0)] = 5
df
The output is as follows:
a cumsum new
0 2 2 2
1 3 5 5
2 0 5 0
3 5 10 5
4 1 11 1
5 3 14 4
6 1 15 5
7 2 17 2
8 2 19 4
9 1 20 5
Working:
Basically, take remainder for the cumulative sum for 5. In cases where the actual sum is 5 also becomes zero. So, for these cases, check if the value/5 == int(value/5). Then, remove cases where the actual value is zero.
EDIT:
As Trenton McKinney pointed out in the comments, OP likely wanted to reset it to 0 whenever the cumsum exceeded 5. This makes the definition to be a recurrence which is usually difficult to do with pandas/numpy (see David's solution). I'd recommend using numba to speed up the for loop in this case
Another alternative: using groupby
In [78]: df.groupby((df['a'].cumsum()% 5 == 0).shift().fillna(False).cumsum()).cumsum()
Out[78]:
a
0 2
1 5
2 0
3 5
4 1
5 4
6 5
7 2
8 4
9 5
You could try using this for loop:
lastvalue = 0
newcum = []
for i in df['a']:
if lastvalue >= 5:
lastvalue = i
else:
lastvalue += i
newcum.append(lastvalue)
df['a_cum_sum'] = newcum
print(df)
Output:
a a_cum_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
The above for loop iterates through the a column, and when the cumulative sum is 5 or more, it resets it to 0 then adds the a column's value i, but if the cumulative sum is lower than 5, it just adds the a column's value i (the iterator).
The problem is this, I have a dataframe like so:
A B C D
2 3 X 5
7 2 5 7
1 2 7 9
3 4 X 9
1 2 3 5
6 3 X 8
I wish to iterate over the rows of the dataframe and every time column C=X I want to reset a counter and start adding the values in column B until column C=X again. Then rince and repeat down the rows till complete.
Currently I am iterating over the rows using .iterrows(), comparing column C and then procedurally adding to a variable.
I'm hoping there is a more efficient 'pandas' like approach to doing something like this.
Use cumsum() as follows.
import pandas as pd
df = pd.DataFrame({"B":[3, 2, 2, 4, 2, 3],
"C":["X", 5, 7, "X", 3, "X"]})
df['C'].loc[df['C'] == "X"] = df['B'].loc[df['C'] == "X"].cumsum()
The output is
B C
0 3 3
1 2 5
2 2 7
3 4 7
4 2 3
5 3 10
I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())
This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).