I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?
You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]
Related
Given a data like so:
Symbol
One
Two
1
28.75
25.10
2
29.00
25.15
3
29.10
25.00
I want to drop the column which does not have its values in an ascending order (though I want to allow for gaps) across all rows. In this case, I want to drop column 'Two'.I tried to following code with no luck:
df.drop(df.columns[df.all(x <= y for x,y in zip(df, df[1:]))])
Thanks
Dropping those columns that give at least one (any) negative value (lt(0)) when their values are differenced by 1 lag (diff(1)) after NaNs are neglected (dropna):
columns_to_drop = [col for col in df.columns if df[col].diff(1).dropna().lt(0).any()]
df.drop(columns=columns_to_drop)
Symbol One
0 1 28.75
1 2 29.00
2 3 29.10
An expression that works with gaps (NaN)
A.loc[:, ~(A.iloc[1:, :].reset_index() > A.iloc[:-1, :].reset_index()).any()]
Without gaps it would be equivalent to
A.loc[:, (A.iloc[1:, :].reset_index() <= A.iloc[:-1, :].reset_index()).all()]
Without loops to take better advantage of the framework for bigger dataframes.
A.iloc[1:, :] returns a dataframe without the first line
A.iloc[:-1, :] returns a dataframe without the last line
Slices in a dataframe keep the indices for corresponding rows, so the different slices have different indices, reset_index will create another index counting [0,1,...], thus making the two sides of the inequality compatible. You can pass drop=True if you want to remove the previous index.
Any (implicitly with axis=0) check for every column if any value is true, if so, it means that a number was followed by another.
A.loc[:, mask] select the columns where mask is true, drops the columns where mask is false.
The logic is could be read as not any value smaller than its predecessor or all values greater than its predecessor.
Check out code and only logic is:
map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)]
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'Symbol': [1, 2, 3],
'One': [28.75, 29.00, 29.10],
'Two': [25.10, 25.15, 25.10],
}
)
print(df.loc[:,map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)])
I have a ~2MM row dataframe. I have a problem where, after splitting a column by a delimiter, it looks as though there wasn't a consistent number of columns merged into this split.
To remedy this, I'm attempting to use a conditional new column C where, if a condition is true, should equal column A. If false, set equal to column B.
EDIT: In attempting a provided solution, I tried some code listed below, but it did not update any rows. Here is a better example of the dataset that I'm working with:
Scenario meteorology time of day
0 xxx D7 Bus. Hours
1 yyy F3 Offshift
2 zzz Bus. Hours NaN
3 aaa Offshift NaN
4 bbb Offshift NaN
The first two rows are well-formed. The Scenario, meteorology, and time of day have been split out from the merged column correctly. However, on the other rows, the merged column did not have data for meteorology. Therefore, the 'time of day' data has populated in 'Meteorology', resulting in 'time of day' being nan.
Here was the suggested approach:
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)
ddf[(ddf.met=='Bus. Hours') | (ddf.met == 'Offshift')]['time'] = ddf['met']
ddf[(ddf.met=='Bus. Hours') | (ddf.met == 'Offshift')]['met'] = np.nan
This does not update the appropriate rows in 'time' or 'met'.
I have also tried doing this in pandas:
df.loc[(df.met == 'Bus.Hours') | (df.met == 'Offshift'), 'time'] = df['met']
df.loc[(df.met == 'Bus.Hours') | (df.met == 'Offshift'), 'met'] = np.nan
This approach runs, but appears to hang indefinitely.
try, and calculate time, after all print(ddf.head(10)) to see output
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)
ddf[(ddf.A == 2) | (ddf.A == 1)]['C'] = ddf['A']
ddf[(ddf.A != 2) & (ddf.A != 1)]['C'] = ddf['B']
print(ddf.head(x))
The working solution was adapted from the comments, and ended up as follows:
cond = df.met.isin(['Bus. Hours', 'Offshift'])
df['met'] = np.where(cond, np.nan, df['met'])
df['time'] = np.where(cond, df['met'], df['time'])
Came across another situation where this was needed. It was along the lines of a string that shouldn't contain a substring:
df1 = dataset.copy(deep=True)
df1['F_adj'] = 0
cond = (df1['Type'] == 'Delayed Ignition') | ~(df1['Type'].str.contains('Delayed'))
df1['F_adj'] = np.where(cond,df1['F'], 0)
I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)
Here I have two cases, one that works as expected and one that doesn't(commented out):
import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)
df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})
df.sort_values(['Group', 'Number'])
Expected Output: every item in groups, to correspond to all items in numList
Group Number
a 0
a 1
a 2
a 3
b 0
b 1
b 2
b 3
c 0
c 1
c 2
c 3
Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)
Why is that? and how to fix that?
If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product.
If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.
df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)
And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter.
So you should do
df = df.sort_values(['Group', 'Number'])
or
df.sort_values(['Group', 'Number'], inplace=True)
and it should work now.
I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)
how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.