pandas set one column equal to 1 but both df changes - python

I'm trying to get df b column D to be 1, however, when I run this code, it also changes df a column D to 1 also... why is that, why are the variables linked? and how to I just change df b only?
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df
b=df
b['D']=1
output:
>>> a
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>> b
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>>

a, b and df are references to the same object. When you change b['D'], you are actually changing that column of the actual object. Instead, it looks like you want to copy the DataFrame:
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df.copy()
b=df.copy()
b['D']=1
which yields
b.head()
Out:
A B C D
0 63 52 92 1
1 98 35 43 1
2 24 87 70 1
3 38 4 7 1
4 71 30 25 1
a.head()
Out:
A B C D
0 63 52 92 80
1 98 35 43 78
2 24 87 70 26
3 38 4 7 48
4 71 30 25 61
There are also detailed responses here.

Don't use = when trying to copy a dataframe
use pd.DataFrame.copy(yourdataframe) instead
a = pd.DataFrame.copy(df)
b = pd.DataFrame.copy(df)
b['D'] = 1
This should solve your problem

You should use copy. Change
a=df
b=df
to
a=df.copy()
b=df.copy()
Check out this reference where this issue is discussed a bit more in depth. I also had this confusion when I started using Pandas.

Related

Overwrite some rows in pandas dataframe with ones from another dataframe based on index

I have a pandas dataframe, df1.
I want to overwrite its values with values in df2, where the index and column name match.
I've found a few answers on this site, but nothing that quite does what I want.
df1
A B C
0 33 44 54
1 11 32 54
2 43 55 12
3 43 23 34
df2
A
0 5555
output
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34
You can use combine_first with convert to integer if necessary:
df = df2.combine_first(df1).astype(int)
print (df)
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34
If need check intersection index and columns between both DataFrames:
df2= pd.DataFrame({'A':[5555, 2222],
'D':[3333, 4444]},index=[0, 10])
idx = df2.index.intersection(df1.index)
cols = df2.columns.intersection(df1.columns)
df = df2.loc[idx, cols].combine_first(df1).astype(int)
print (df)
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34

Retrieving future value in Python using offset variable from another column

I'm trying to figure out how to retrieve values from future dates using an offset variable in a separate row in Python. For instance, I have the dataframe df below, and I'd like to find a way to produce Column C:
Orig A Orig B Desired Column C
54 1 76
76 4 46
14 3 46
35 1 -3
-3 0 -3
46 0 46
64 0 64
93 0 93
72 0 72
Any help is much appreciated, thank you!
You can use NumPy for a vectorised solution:
import numpy as np
idx = np.arange(df.shape[0]) + df['OrigB'].values
df['C'] = df['OrigA'].iloc[idx].values
print(df)
OrigA OrigB C
0 54 1 76
1 76 4 46
2 14 3 46
3 35 1 -3
4 -3 0 -3
5 46 0 46
6 64 0 64
7 93 0 93
8 72 0 72
import pandas as pd
dict = {"Orig A": [54,76,14,35,-3,46,64,93,72],
"Orig B": [1,4,3,1,0,0,0,0,0],
"Desired Column C": [76,46,46,-3,-3,46,64,93,72]}
df = pd.DataFrame(dict)
df["desired_test"] = [df["Orig A"].values[i+j] for i,j in enumerate(df["Orig B"].values)]
df
Orig A Orig B Desired Column C desired_test
0 54 1 76 76
1 76 4 46 46
2 14 3 46 46
3 35 1 -3 -3
4 -3 0 -3 -3
5 46 0 46 46
6 64 0 64 64
7 93 0 93 93
8 72 0 72 72

Pythonic way to randomly assign pandas dataframe entries

Suppose we have a data frame
In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
In [2]: df
Out[3]:
A B C D
0 45 88 44 92
1 62 34 2 86
2 85 65 11 31
3 74 43 42 56
4 90 38 34 93
5 0 94 45 10
.. .. .. .. ..
How can I randomly replace x% of all entries with a value, such as None?
In [4]: something(df, percent=25)
Out[5]:
A B C D
0 45 88 None 92
1 62 34 2 86
2 None None 11 31
3 74 43 None 56
4 90 38 34 None
5 None 94 45 10
.. .. .. .. ..
I've found information about sampling particular axes, and I can imagine a way of randomly generating integers within the dimensions of my data frame and setting those equal to None, but that doesn't feel very Pythonic.
Edit: forgot 'way' in title
You could combine DataFrame.where and np.random.uniform:
In [37]: df
Out[37]:
A B C D
0 1 0 2 2
1 2 2 0 3
2 3 0 0 3
3 0 2 3 1
In [38]: df.where(np.random.uniform(size=df.shape) > 0.3, None)
Out[38]:
A B C D
0 1 0 2 None
1 2 2 0 3
2 3 0 None None
3 None 2 3 None
It's not the most concise, but gets the job done.
Note though that you should ask yourself whether you really want to do this if you still have computations to do. If you put None in a column, then pandas is going to have to use the slow object dtype instead of something fast like int64 or float64.

Filter pandas data frame with results of groupby

I have a large data frame (40M rows) and I want to filter out rows based on one column if the value meets a condition in a groupby object.
For example, here is some random data. The 'letter' column would actually have thousands of unique values:
x y z letter
0 47 86 30 e
1 58 9 28 b
2 96 59 42 a
3 79 6 45 e
4 77 80 37 d
5 66 91 35 d
6 96 31 52 d
7 56 8 26 e
8 78 96 14 a
9 22 60 13 e
10 75 82 9 d
11 5 54 29 c
12 83 31 40 e
13 37 70 2 c
14 53 67 66 a
15 76 33 78 d
16 64 67 81 b
17 23 94 1 d
18 10 1 31 e
19 52 11 3 d
Apply a groupby on the 'letter' column, and get the sum of column x for each letter:
df.groupby('letter').x.sum()
>>> a 227
b 122
c 42
d 465
e 297
Then, I sort to see the letters with the highest sum, and manually identify a threshold. In this example the threshold might be 200.
df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False)
>>> letter x
3 d 465
4 e 297
0 a 227
1 b 122
2 c 42
Here's where I am stuck. In the original dataframe, I want to keep letters if the groupby sum of column 'x' > 200, and drop the other rows. So in this example, it would keep all the rows with d, e or a.
I was trying something like this but it doesn't work:
df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)
And even if I filter the groupby object, how do I use it to filter the original dataframe?
You can use groupby transform to calculate a the sum of x for each row and create a logical series with the condition with which you can do the subset:
df1 = df[df.x.groupby(df.letter).transform('sum') > 200]
df1.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
Or another option using groupby.filter:
df2 = df.groupby('letter').filter(lambda g: g.x.sum() > 200)
df2.letter.unique()
# array(['e', 'a', 'd'], dtype=object)

Combine duplicated columns within a DataFrame

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer

Categories