I have a dataset given below:
a,b,c
1,1,1
1,1,1
1,1,2
2,1,2
2,1,1
2,2,1
I created crosstab with pandas:
cross_tab = pd.crosstab(index=a, columns=[b, c], rownames=['a'], colnames=['b', 'c'])
my crosstab is given as an output:
b 1 2
c 1 2 1
a
1 2 1 0
2 1 1 1
I want to iterate over this crosstab for given each a,b and c values. How can I get values such as cross_tab[a=1][b=1, c=1]? Thank you.
You can use slicers:
a,b,c = 1,1,1
idx = pd.IndexSlice
print (cross_tab.loc[a, idx[b,c]])
2
You can also reshape df by DataFrame.unstack, reorder_levels and then use loc:
a = cross_tab.unstack().reorder_levels(('a','b','c'))
print (a)
a b c
1 1 1 2
2 1 1 1
1 1 2 1
2 1 2 1
1 2 1 0
2 2 1 1
dtype: int64
print (a.loc[1,1,1])
2
You are looking for df2.xxx.get_level_values:
In [777]: cross_tab.loc[cross_tab.index.get_level_values('a') == 1,\
(cross_tab.columns.get_level_values('b') == 1)\
& (cross_tab.columns.get_level_values('c') == 1)]
Out[777]:
b 1
c 1
a
1 2
Another way to consider, albeit at loss of a little bit of readability, might be to simply use the .loc to navigate the hierarchical index generated by pandas.crosstab. Following example illustrates it:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(
{
"a": np.random.choice([1, 2], 5, replace=True),
"b": np.random.choice([11, 12, 13], 5, replace=True),
"c": np.random.choice([21, 22, 23], 5, replace=True),
}
)
df
Output
a b c
0 2 11 23
1 2 11 23
2 1 12 23
3 2 12 21
4 1 12 21
crosstab output is:
cross_tab = pd.crosstab(
index=df.a, columns=[df.b, df.c], rownames=["a"], colnames=["b", "c"]
)
cross_tab
b 11 12
c 23 21 23
a
1 0 1 1
2 2 1 0
Now let's say you want to access value when a==2, b==11 and c==23, then simply do
cross_tab.loc[2].loc[11].loc[23]
2
Why does this work? .loc allows one to select by index labels. In the dataframe output by crosstab, our erstwhile column values now become index labels. Thus, with every .loc selection we do, it gives the slice of the dataframe corresponding to that index label. Let's navigate cross_tab.loc[2].loc[11].loc[23] step by step:
cross_tab.loc[2]
yields:
b c
11 23 2
12 21 1
23 0
Name: 2, dtype: int64
Next one:
cross_tab.loc[2].loc[11]
Yields:
c
23 2
Name: 2, dtype: int64
And finally we have
cross_tab.loc[2].loc[11].loc[23]
which yields:
2
Why do I say that this reduces the readability a bit? Because to understand this selection you have to be aware of how the crosstab was created, i.e. rows are a and columns were in the order [b, c]. You have to know that to be able to interpret what cross_tab.loc[2].loc[11].loc[23] would do. But I have found that often to be a good tradeoff.
Related
for example if I have an original list:
A B
1 3
2 4
to be turned into
A B
3 1
4 2
two cents worth:
3 ways to do it
you could add a 3rd column C, copy A to C, then delete A. This would take more memory.
you could create a swap function for the values in a row, then wrap it into a loop.
you could just swap the labels of the columns. This is probably the most efficient way.
You could use rename:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})
output:
B A
0 1 3
1 2 4
If order matters:
df2 = df.rename(columns={'A': 'B', 'B': 'A'})[df.columns]
output:
A B
0 3 1
1 4 2
Use DataFrame.rename with dictionary for swapping columnsnames, last check orcer by selecting columns:
df = df.rename(columns=dict(zip(df.columns, df.columns[::-1])))[df.columns]
print (df)
A B
0 3 1
1 4 2
You can also just simple use masking to change the values.
import pandas as pd
df = pd.DataFrame({"A":[1,2],"B":[3,4]})
df[["A","B"]] = df[["B","A"]].values
df
A B
0 3 1
1 4 2
for more than 2 columns:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9], 'D':[10,11,12]})
print(df)
'''
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
'''
df = df.set_axis(df.columns[::-1],axis=1)[df.columns]
print(df)
'''
A B C D
0 10 7 4 1
1 11 8 5 2
2 12 9 6 3
I assume that your list is like this:
my_list = [[1, 3], [2, 4]]
So you can use this code:
print([[each_element[1], each_element[0]] for each_element in my_list])
The output is:
[[3, 1], [4, 2]]
I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.
I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2
How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.
I am querying a dataframe like below:
>>> df
A,B,C
1,1,200
1,1,433
1,1,67
1,1,23
1,2,330
1,2,356
1,2,56
1,3,30
if I do part_df = df[df['A'] == 1 & df['B'] == 2], I am able to get a sub-dataframe as
>>> part_df
A, B, C
1, 2, 330
1, 2, 356
1, 2, 56
Now i wanna make some changes to part_df like:
part_df['C'] = 0
The changes are not reflected in the original df at all. I guess it is because of numpy's array mechanism that everytime a new copy of dataframe is produced. I am wondering how do I query a dataframe with some conditions and makes changes to the selected part as the example I provided and reflect value back to original dataframe in place?
You should do this instead:
In [28]:
df.loc[(df['A'] == 1) & (df['B'] == 2),'C']=0
df
Out[28]:
A B C
0 1 1 200
1 1 1 433
2 1 1 67
3 1 1 23
4 1 2 0
5 1 2 0
6 1 2 0
7 1 3 30
[8 rows x 3 columns]
You should use loc and select the column of interest 'C' in the square brackets at the end