I have a column
df['COL_1']
and a list of numbers
num_range = list(range(200,281, 5))
The columns contains either words such as UNREADABLE NOT_PASSIVE or some values present in the list above, so 200 205 210 etc or nothing.
I am trying to get a sum of how many rows in that column contains a number in the range given.
What I have tried:
df['COL_1'].value_counts(num_range)
I am not sure what else to try, the various trials I have done similar to the above have all failed.
I'm new to python, any guidance is highly appreciated.
Python 2.7 and pandas 0.24.2
EDIT:
I was getting errors, as other users have mentioned, my data was not numeric. Fixed this with .astype or alternatively, redifining target_range as so:
target_range = map(str, range(200, 281, 5))
If you're after the whole sum, and are not interested in the breakout of individual counts,
target_range = range(200, 281, 5)
df["COL_1"].isin(target_range).sum()
Notice that you do not need to cast the range object to a list.
If you want the breakout of value counts, see #Corralien's answer.
Details: pandas.DataFrame.isin() is a function which returns a boolean mask.
>>> import pandas as pd
>>> # Data provided by Corralien
>>> df = pd.DataFrame({'COL_1': ['UNREADABLE', 200, 'NOT_PASSIVE', 205, 210, 200, '', 210, 180, 170, '']})
>>> target_range = range(200, 281, 5)
>>> df.isin(target_range)
COL_1
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 False
9 False
10 False
Notice I'm using df.isin() instead of df["COL_1"].isin(). If you have multiple columns in your DataFrame that you want to perform this operation on, you can pass a list of column names. If you want to perform this operation on your entire DataFrame, you can simply use df.isin().
The .isin() method returns a boolean mask. Since bool is a subtype of int, you can simply call sum() on the resulting DataFrame to sum up all the 1s and 0s to get your final tally of all the rows which match your criterion.
IIUC, you can try:
df = pd.DataFrame({'COL_1': ['UNREADABLE', 200, 'NOT_PASSIVE', 205, 210,
200, '', 210, 180, 170, '']})
out = df.loc[df['COL_1'].apply(pd.to_numeric, errors='coerce')
.isin(num_range), 'COL_1'] \
.value_counts()
>>> out
200 2
210 2
205 1
Name: COL_1, dtype: int64
>>> out.sum()
5
You have a column in a DataFrame as a list of items and also have another list of values. and you want to count how many items from dataframe column exist in the list. right?
so use this:
count = 0
for i in df['COL_1']:
if i in num_range:
count +=1
in each iteration on your column, if the value exist in the list, count variable added by one.
Related
I am using a similar question on SO, except that I need to limit the range of rows in the Pandas dataframe Count occurrences of False or True in a column in pandas
So using this example, I would like to count the number of False occurrences in rows that start with values between 50000 and 80000.
I would need to first sort by values in Column 0,
then find the range of rows that is between 50000 and 80000,
then count the number of false occurrences for that limited range.
The table is below:
patient_id test_result has_cancer
0 79452 Negative False
1 81667 Positive True
2 76297 Negative False
3 36593 Negative False
4 53717 Negative False
5 67134 Negative False
6 40436 Negative False
I will assume the data you presented is in a variable called "df".
import pandas as pd
# 0
df = pd.DataFrame()
# 1
df = df[(df['patient_id'] >= 50000) & (df['patient_id'] <= 50000)]
# 2 retrieves a series object
df['has_cancer'].value_counts()
# 3 retrieving the actual number
df[~df['has_cancer']].shape[0]
0: if stored in csv, used pd.read_csv() or look up how to read your specific file type with pandas
1: the first line with patient id just extracts what you specified: patient_id in your interval (assuming ids are int types)
2: We take the series "has_cancer" and counts the number of occurences of each entry (True or False in this case)
we grab the series of "has_cancer" which is already boolean, so when passed in the df[...], we would only get the values that are True. So, we use the ~ to negate the values and only get the values that are False. calling df.shape gives you a tuple that is (# rows, # columns), so we grab the item in the first index which is the number of rows.
No need to sort given these solutions but if necessary just call df.sort_values([columns we want to sort by]) and pass a list of the column name(s) you want to sort by.
The binary operators in pandas.DataFrame, such as the gt greater than operator, seem to permute the DataFrame's index/columns in undesirable ways. The issue seems to be that it union's together mismatched indices
Mismatched indices will be unioned together. NaN values are considered
different (i.e. NaN != NaN).
and that pd.Index.union (which I assume is called internally by the binary operators) sorts the indices by default. Yet this behavior is undesirable for many (most?) applications of the binary operator and there does not appear to be a clean way to override it.
Example
Here is an example to demonstrate. Setting up the data:
import pandas as pd
df = pd.DataFrame({'volume': [1, 2, 3],
'cost': [250, 150, 100],
'revenue': [100, 250, 300]},
index=['A', 'B', 'C'])
thresh = pd.Series({'volume': 2, 'cost': 125, 'revenue': 200})
and printing it:
print(df)
volume cost revenue
A 1 250 100
B 2 150 250
C 3 100 300
print(thresh)
volume 2
cost 125
revenue 200
dtype: int64
Since the indices of the DataFrame and Series match exactly (including order), the binary operator preserves order:
print(df.gt(thresh, axis=1)) # preserves column order
volume cost revenue
A False True False
B False True True
C True False True
What happens when we apply the binary operator with a Series whose indices (i) do not match the DataFrame order, and (ii) are not sorted?
thresh_perm = thresh.loc[['revenue', 'volume', 'cost']]
print(df.gt(thresh_perm, axis=1))
cost revenue volume
A True False False
B True True True
C False True True
The output column order matches neither the input DataFrame column order nor the Series index order. Instead it is sorted by string name.
Conclusion
Is my analysis correct (i.e., that pandas binary operators always sort the indices)? What is the best way to fix this? Ideally the binary operators would have an option that controls output order, such as:
df.gt(thresh_perm, axis=1, sort=False) # desired call; code does not work
Since pandas.Index.union has a sort option I think it makes sense for binary operators to be able to pass the option through.
For now, the best option seems to be to manually restore the column order:
print(df.gt(thresh_perm, axis=1).loc[:, df.columns])
volume cost revenue
A False True False
B False True True
C True False True
I'm trying to fill two pandas columns with multiple lists of different sizes.
So for example I have the first column with a list like "angioplasty, aortic, artery" and the second one like "251, 2882, 401, 4019, 412"
First I tried to append each list like this:
matches.code_matches.append(code_series)
which resulted in this TypeError:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
So I tried converting the lists into a series and append them to the dataframe with this code:
code_series = pd.Series( (v[0] for v in code_matches))
matches.code_matches.append(code_series)
However after appending the series my Dataframe still ends up empty.
What is the best way to fill the columns of a Dataframe different sized lists? Note that i want to keep the lists and fill them into single fields instead of filling in each element separately (i want to assign the values to ids)
Use:
In [704]: l1 = ['angioplasty', 'aortic', 'artery']
In [705]: l2 = [251, 2882, 401, 4019, 412]
In [707]: df = pd.DataFrame({'a': pd.Series(l1), 'b': pd.Series(l2)})
In [708]: df
Out[708]:
a b
0 angioplasty 251
1 aortic 2882
2 artery 401
3 NaN 4019
4 NaN 412
I used the .at method in the code below and it seems that worked.
x = pd.DataFrame()
x['A'] = 1,2,3,4,5,6
x['B'] = 1.0,2,3,4,5,6
x['C'] = 9,8,7,6,5,4
x['D']=0
x['D']=x['D'].astype('object')
for i in range(len(x)): x.at[i,'D'] = list(np.random.rand(3))
You can replace the list(np.random.rand(3)) with your own data.
I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)
I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)