What I want to do
I would like to count the number of rows with conditions. Each column should have different numbers.
import numpy as np
import pandas as pd
## Sample DataFrame
data = [[1, 2], [0, 3], [np.nan, np.nan], [1, -1]]
index = ['i1', 'i2', 'i3', 'i4']
columns = ['c1', 'c2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# c1 c2
# i1 1.0 2.0
# i2 0.0 3.0
# i3 NaN NaN
# i4 1.0 -1.0
## Question 1: Count non-NaN values
## Expected result
# [3, 3]
## Question 2: Count non-zero numerical values
## Expected result
# [2, 3]
Note: Data types of results are not important. They can be list, pandas.Series, pandas.DataFrame etc. (I can convert data types anyway.)
What I have checked
## For Question 1
print(df[df['c1'].apply(lambda x: not pd.isna(x))].count())
## For Question 2
print(df[df['c1'] != 0].count())
Obviously these two print functions are only for column c1. It's easy to check one column by one column. I would like to know if there is a way to calculate counts of all columns at once.
Environment
Python 3.10.5
pandas 1.4.3
You do not iterate over your data using apply. You can achieve your results in a vectorized fashion:
print(df.notna().sum().to_list()) # [3, 3]
print((df.ne(0) & df.notna()).sum().to_list()) # [2, 3]
Note that I have assumed that "Question 2: Count non-zero values" also excluded nan values, otherwise you would get [3, 4].
You was close I think ! To answer your first question :
>>> df.apply(lambda x : x.isna().sum(), axis = 0)
c1 1
c2 1
dtype: int64
You change to axis = 1 to apply this operation on each row.
To answer your second question this is from here (already answered question on SO) :
>>> df.astype(bool).sum(axis=0)
c1 3
c2 4
dtype: int64
In the same way you can change axis to 1 if you want ...
Hope it helps !
Related
I have a dataframe a pandas dataframe with the following columns:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
Which I am hoping to sort primarily by column 'two', then by column 'one'. For the secondary sort, I would like to use a custom sorting rule that will sort column 'one' by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:
one two
A1 1
B1 1
A2 1
A1 2
B1 2
A2 2
I have sorted a list of strings similar to column 'one' before using a sorting rule like so:
def custom_sort(value):
return (value[0], int(value[1:]))
my_list.sort(key=custom_sort)
If I try to apply this rule via a pandas sort, I run into a number of issues including:
The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column 'one', I get the error "TypeError: cannot convert the series to <class 'int'>"
When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column 'two' using a native numerical sort.
How would I go about sorting the DataFrame as mentioned above?
You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.
>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
.sort_values(['two', 'rhs', 'lhs'])
.drop(columns=['lhs', 'rhs']))
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
0 A2 2
use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (\d+) and then drop them:
df = pd.DataFrame([
['A2', 2],
['B1', 1],
['A1', 2],
['A2', 1],
['B1', 2],
['A1', 1]],
columns=['one','two'])
df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]:
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".
But I have some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.
So my solution was to use a key to sort on multiple columns with a custom sorting order:
import pandas as pd
df = pd.DataFrame([
[A2, 2],
[B1, 1],
[A1, 2],
[A2, 1],
[B1, 2],
[A1, 1]],
columns=['one','two'])
def custom_sorting(col: pd.Series) -> pd.Series:
"""Series is input and ordered series is expected as output"""
to_ret = col
# apply custom sorting only to column one:
if col.name == "one":
custom_dict = {}
# for example ensure that A2 is first, pass items in sorted order here:
def custom_sort(value):
return (value[0], int(value[1:]))
ordered_items = list(col.unique())
ordered_items.sort(key=custom_sort)
# apply custom order first:
for index, item in enumerate(ordered_items):
custom_dict[item] = index
to_ret = col.map(custom_dict)
# default text sorting is about to be applied
return to_ret
# pass two columns to be sorted
df.sort_values(
by=["two", "one"],
ascending=True,
inplace=True,
key=custom_sorting,
)
print(df)
Output:
5 A1 1
3 A2 1
1 B1 1
2 A1 2
0 A2 2
4 B1 2
Be aware that this solution can be slow.
I have created a function to solve the issue of using key argument for multi-column, following the suggestion from #Alexander. Also it deals with not duplicating names when creating temporal columns. Furthermore, it can sort the whole dataframe including the index (using the index.names).
It can be improved, but using copy-paste should work:
https://github.com/DavidDB33/pandas_helpers/blob/main/pandas_helpers/helpers.py
With pandas >= 1.1.0 and natsort, you can also do this now:
import natsort
sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())
I am comparing two df, it gives me False when using .equals(), but if I append two df together and use drop_duplicate() it gives me nothing. Can someone explain this?
TL;DR
These are completely different operations and I'd have never expected them to produce the same results.
pandas.DataFrame.equals
Will return a boolean value depending on whether Pandas determines that the dataframes being compared are the "same". That means that the index of one is the "same" as the index of the other, the columns of one is the "same" as the columns of the the other, and the data of one is the "same" as the data of the other.
See docs
It is NOT the same as pandas.DataFrame.eq which will return a dataframe of boolean values.
Setup
Consider these three dataframes
df0 = pd.DataFrame([[0, 1], [2, 3]], [0, 1], ['A', 'B'])
df1 = pd.DataFrame([[1, 0], [3, 2]], [0, 1], ['B', 'A'])
df2 = pd.DataFrame([[0, 1], [2, 3]], ['foo', 'bar'], ['A', 'B'])
df0 df1 df2
A B B A A B
0 0 1 0 1 0 foo 0 1
1 2 3 1 3 2 bar 2 3
If we checked if df1 was equals to df0, we get
df0.equals(df1)
False
Even though all elements are the same
df0.eq(df1).all().all()
True
And that is because the columns are not aligned. If I sort the columns then ...
df0.equals(df1.sort_index(axis=1))
True
pandas.DataFrame.drop_duplicates
Compares the values in rows and doesn't care about the index.
So, both of these produce the same looking results
df0.append(df2).drop_duplicates()
and
df0.append(df1, sort=True).drop_duplicates()
A B
0 0 1
1 2 3
When I append (or pandas.concat), Pandas will align the columns and add the appended dataframe as new rows. Then drop_duplicates does it's thing. But it was the inherent aligning of the columns that does the what I did above with sort_index and axis=1.
maybe the lines in both dataframes are not ordered the same way? dataframes will be equal when the lines corresponding to the same index are the same
Given two DataFrames (t1, t2), both with a column 'x', how would I append a column to t1 with the ID of t2 whose 'x' value is the nearest to the 'x' value in t1?
t1:
id x
1 1.49
2 2.35
t2:
id x
3 2.36
4 1.5
output:
id id2
1 4
2 3
I can do this by creating a new DataFrame and iterating on t1.groupby() and doing look ups on t2 then merging, but this take incredibly long given a 17 million row t1 DataFrame.
Is there a better way to accomplish? I've scoured the pandas docs regarding groupby, apply, transform, agg, etc. But an elegant solution has yet to present itself despite my thought that this would be a common problem.
Using merge_asof
df = pd.merge_asof(df1.sort_values('x'),
df2.sort_values('x'),
on='x',
direction='nearest',
suffixes=['', '_2'])
print(df)
Out[975]:
id x id_2
0 3 0.87 6
1 1 1.49 5
2 2 2.35 4
Method 2 reindex
df1['id2']=df2.set_index('x').reindex(df1.x,method='nearest').values
df1
id x id2
0 1 1.49 4
1 2 2.35 3
convert to list t1 and t2 and sort them after this
and with the zip() function match the id
list1 = t1.values.tolist()
list2 = t2.values.tolist()
list1.sort() // ASC ORD DESC YOU DECIDE
list2.sort()
list3 = zip(list1,list2)
print(list3)
//after that you must see the output like (1,4),(2,3)
You can calculate a new array with the distance from each element in t1 to each element in t2, and then take the argmin along the rows to get the right index. This has the advantage that you can choose whatever distance function you like, and it does not require the dataframes to be of equal length.
It creates one intermediate array of size len(t1) * len(t2). Using a pandas builtin might be more memory-efficient, but this should be as fast as you can get as everything is done on the C side of numpy. You could always do this method in batches if memory is a problem.
import numpy as np
import pandas as pd
t1 = pd.DataFrame({"id": [1, 2], "x": np.array([1.49, 2.35])})
t2 = pd.DataFrame({"id": [3, 4], "x": np.array([2.36, 1.5])})
Now comes the part doing the actual work. The .to_numpy() bit is important since otherwise Pandas tries to merge on the indices. The first line uses broadcasting to create horizontal and vertical "repetitions" in a memory-efficient way.
dist = np.abs(t1["x"][np.newaxis, :] - t2["x"][:, np.newaxis])
closest_idx = np.argmin(dist, axis=1)
closest_id = t2["id"][closest_idx].to_numpy()
output = pd.DataFrame({"id1": t1["id"], "id2": closest_id})
print(output)
Alternatively, you can use round to 1 precision
t1 = {'id': [1, 2], 'x': [1.49,2.35]}
t2 = {'id': [3, 4], 'x': [2.36,1.5]}
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
df = df1.round(1).merge(df2.round(1), on='x', suffixes=('','2')).drop('x',1)
print(df)
id id2
0 1 4
1 2 3
add .drop('x',1) to remove the output for the binding column 'x'.
add suffixes=('','2') to rename the column titles.
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...
my input dataframe (shortened) looks like this:
>>> import numpy as np
>>> import pandas as pd
>>> df_in = pd.DataFrame([[1, 2, 'a', 3, 4], [6, 7, 'b', 8, 9]],
... columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_in
c1 c2 col c3 c4
0 1 2 a 3 4
1 6 7 b 8 9
It is supposed to be manipulated, i.e.
if row (sample) in column 'col' (feature) has a specific value (e.g., 'b' here),
then convert the entries in columns 'c1' and 'c2' in the same row to NumPy.NaNs.
Result wanted:
>>> df_out = pd.DataFrame([[1, 2, 'a', 3, 4], [np.nan, np.nan, np.nan, 8, 9]],
columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_out
c1 c2 col c3 c4
0 1 2 a 3 4
1 NaN NaN b 8 9
So far, I managed to get obtain desired result via the code
>>> dic = {'col' : ['c1', 'c2']} # auxiliary
>>> b_w = df_in[df_in['col'] == 'b'] # Subset with 'b' in 'col'
>>> b_w = b_w.drop(dic['col'], axis=1) # ...inject np.nan in 'c1', 'c2'
>>> b_wo = df_in[df_in['col'] != 'b'] # Subset without 'b' in 'col'
>>> df_out = pd.concat([b_w, b_wo]) # Both Subsets together again
>>> df_out
c1 c2 c3 c4 col
1 NaN NaN 8 9 b
0 1.0 2.0 3 4 a
Although I get what I want (the original data consists entirely of floats, don't
bother the mutation from int to float her), it is a rather inelegant
snippet of code. And my educated guess is that this could be done faster
by using the build-in functions from pandas and numpy, but I am unable to manage this.
Any suggestions how to code this in a fast and efficient way for daily use? Any help is highly appreciated. :)
You can condition on both the row and col positions to assign values using loc which supports both logic indexing and dimension name indexing:
df_in.loc[df_in.col == 'b', ['c1', 'c2']] = np.nan
df_in
# c1 c2 col c3 c4
# 0 1.0 2.0 a 3 4
# 1 NaN NaN b 8 9
When using pandas I would go for the solution provided by #Psidom.
However, for larger datasets it is faster when doing the whole pandas -> numpy -> pandas procedure, i.e. dataframe -> numpy.array -> dataframe (minus 10% process time for my setup). Without converting back to a dataframe, numpy is almost twice as fast for my dataset.
Solution for the question asked:
cols, df_out = df_in.columns, df_in.values
for i in [0, 1]:
df_out[df_out[:, 2] == 'b', i] = np.nan
df_out = pd.DataFrame(df_out, columns=cols)