Replace values of two columns in pandas - python

What is the fastest way to replace values of two columns in pandas? Let's say given is:
A B
0 2
1 3
and we want to get:
A B
2 0
3 1

This code flips the values between the columns:
>>> df_name[['A', 'B']] = df_name[['B', 'A']]
>>> print(df_name)
A B
2 0
3 1

Related

Is there a way to put a dataframe as the value of a specific column in pandas python?

I have a set of data that has column names and values to create a dataframe.
However one of the column values is another dataframe is it possible to do this in pandas, or are each column values meant to be a single value?
For example what I am trying to achieve would look something like this;
df
out:
A B C
0 A1 B1 D E
D1 E1
F G
F1 G1
This is where letters that have numbers with them are the values, and just letters are the column names.
Yes it is possible to put another dataframe (or any type of object) in a pandas cell.
In[2]: df1 = pd.DataFrame({'a':range(2)})
df1
Out[2]:
a
0 0
1 1
In[3]: df2 = pd.DataFrame({'x':range(3), 'y':range(3)})
df2
Out[3]:
x y
0 0 0
1 1 1
2 2 2
In[4]: df1['b'] = [df2, {'cat':'meow', 'otter':'clap'}]
df1
Out[4]:
a b
0 0 x y 0 0 0 1 1 1 2 2 2
1 1 {u'otter': u'clap', u'cat': u'meow'}
In[5]: df1.get_value(0, 'b')
Out[5]:
x y
0 0 0
1 1 1
2 2 2
As you see it's not very readable to print a dataframe contaning another dataframe. If you want it to look as your example you should go with multiindex as Wen suggested.

pandas - number of unique rows occurrences in dataframe

How can I count number of occurrences of each unique row in a DataFrame?
data = {'x1': ['A','B','A','A','B','A','A','A'], 'x2': [1,3,2,2,3,1,2,3]}
df = pd.DataFrame(data)
df
x1 x2
0 A 1
1 B 3
2 A 2
3 A 2
4 B 3
5 A 1
6 A 2
7 A 3
And I would like to obtain
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
IIUC you can pass param as_index=False as an arg to groupby:
In [100]:
df.groupby(['x1','x2'], as_index=False).count()
Out[100]:
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
You could also drop duplicated rows:
In [4]: df.shape[0]
Out[4]: 8
In [5]: df.drop_duplicates().shape[0]
Out[5]: 4
There are two ways you can find unique occurence in your dataframe.
1st: Using drop_duplicates
df.drop_duplicates().sort_values('x1',ignore_index=True)
2nd: Using groupby.nunique
df.groupby(['x1','x2'], as_index=False).nunique()
For finding the number of occurrences, the answer from #EdChum will work precisely.

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

pandas - Going from aggregated format to long format

If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a

How to create a new column with a tuple (or a list)?

Let's say I have the following DataFrame:
d = pd.DataFrame({ 'a': [10,20,30], 'b': [1,2,3] })
a b
0 10 1
1 20 2
2 30 3
I want to create a new column 'c' that will contain a tuple of 'a' and 'b' (per row). Something like this:
a b c
0 10 1 (10,1)
1 20 2 (20,2)
2 30 3 (30,3)
I just can't make it, no matter what I try (I tried apply with axis=1 and have it return a tuple, a list, a Series object.. neither worked).
I saw that I can create a DataFrame and set the dtype to 'object' and then I can put tuples in a cell. How do I do it with apply?
What I'm trying to do is to count distinct combinations of a and b, get the most common and print a summary with some data on them (data comes from other columns, let's say 'd' and 'e').
Is there any more elegant way to do it?
You could do it using zip:
>>> df = pd.DataFrame({'a': [10,20,30], 'b': [1,2,3]})
>>> df["c"] = zip(df["a"], df["b"])
>>> df
a b c
0 10 1 (10, 1)
1 20 2 (20, 2)
2 30 3 (30, 3)
[3 rows x 3 columns]
but usually putting a tuple in a column is the wrong way to go because pandas can't really do anything else with it at that point. If you want to count distinct combinations of a and b and do something with the associated groups of rows, you should use groupby instead:
>>> df = pd.DataFrame({'a': [10,20,30,20,30], 'b': [1,2,3,2,1]})
>>> df
a b
0 10 1
1 20 2
2 30 3
3 20 2
4 30 1
[5 rows x 2 columns]
>>> df_counts = df.groupby(["a", "b"]).size()
>>> df_counts.sort(ascending=False)
>>> df_counts
a b
20 2 2
30 3 1
1 1
10 1 1
dtype: int64
"Print a summary with some data on them" is too broad to say anything useful about, but you can use groupby to perform all sorts of summary operations on the groups.

Categories