need to filter/eliminate the first #n row till were is present "nan" symbol from a much bigger df like df2 show.
main_df = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000],
'Pappa':[np.nan,np.nan,"2","ai"],
'Puppo':["Glob","Java","n","Godo"],
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(main_df,index=index_labels2)
I tryed with :
maino_df = main_df.loc[:, (main_df.iloc [0] != np.nan) & ((main_df.iloc [0,:] < 1000))]
to obtain:
main_dfnew = {
'Courses':["Python","Go"],
'Discount':[1200,2000],
'Pappa':["2","ai"],
'Puppo':["n","Godo"],
}
index_labels2=['r3','r5']
df2 = pd.DataFrame(main_dfnew, index=index_labels2)
but eliminate also the columns where is nan
IIUC, you want to drop the first row where you have NaNs, and keep all the rows after the first row that has no NaNs?
NB. I am assuming real NaNs here, if not first use replace or other method to convert to NaN, or comparison to match the data to consider invalid
You could use:
df3 = df2[df2.notna().all(1).cummax()]
output:
Courses Discount Pappa Puppo
r3 Python 1200 2 n
r5 Go 2000 ai Godo
If you just want to remove all the rows with NaNs, use dropna:
df3 = df2.dropna(axis=0)
Related
I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?
You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]
i have a df that has multiple pairs of related items; example: fxr_dl2_rank.r1 and fxr_dl2_rank.r1_wp. Is it possible to fliter all the related pairs with both positive results?
data = {'item':['fxr_dl2_rank.r1','fxr_dl2_rank.r2','fxr_dl2_rank.r3',
'fxr_dl2_rank.r4','fxr_dl2_rank.r5',
'fxr_dl2_rank.r1_wp','fxr_dl2_rank.r2_wp','fxr_dl2_rank.r3_wp',
'fxr_dl2_rank.r4_wp','fxr_dl2_rank.r5_wp',],
'result':[-0.15,0.13,-0.29,0.18,-0.18,0.00,0.16,0.15,0.17,-0.17]}
df = pd.DataFrame(data)
df
First rework the 'item' to get the common part, use it to group the rows, check whether all elements are positive and use the output for slicing:
group = df['item'].str.replace('_wp$', '', regex=True)
df[df.groupby(group)['result'].transform(lambda s: all(s.ge(0)))]
output:
item result
1 fxr_dl2_rank.r2 0.13
3 fxr_dl2_rank.r4 0.18
6 fxr_dl2_rank.r2_wp 0.16
8 fxr_dl2_rank.r4_wp 0.17
You can use the following:
#create helper column
df["helper"] = df["item"].str[:15]
#filter out all negative values in result
df = df[df["result"] >= 0]
#keep only duplicated rows in helper column
df[df.duplicated(subset="helper", keep=False)]
I have a ~2MM row dataframe. I have a problem where, after splitting a column by a delimiter, it looks as though there wasn't a consistent number of columns merged into this split.
To remedy this, I'm attempting to use a conditional new column C where, if a condition is true, should equal column A. If false, set equal to column B.
EDIT: In attempting a provided solution, I tried some code listed below, but it did not update any rows. Here is a better example of the dataset that I'm working with:
Scenario meteorology time of day
0 xxx D7 Bus. Hours
1 yyy F3 Offshift
2 zzz Bus. Hours NaN
3 aaa Offshift NaN
4 bbb Offshift NaN
The first two rows are well-formed. The Scenario, meteorology, and time of day have been split out from the merged column correctly. However, on the other rows, the merged column did not have data for meteorology. Therefore, the 'time of day' data has populated in 'Meteorology', resulting in 'time of day' being nan.
Here was the suggested approach:
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)
ddf[(ddf.met=='Bus. Hours') | (ddf.met == 'Offshift')]['time'] = ddf['met']
ddf[(ddf.met=='Bus. Hours') | (ddf.met == 'Offshift')]['met'] = np.nan
This does not update the appropriate rows in 'time' or 'met'.
I have also tried doing this in pandas:
df.loc[(df.met == 'Bus.Hours') | (df.met == 'Offshift'), 'time'] = df['met']
df.loc[(df.met == 'Bus.Hours') | (df.met == 'Offshift'), 'met'] = np.nan
This approach runs, but appears to hang indefinitely.
try, and calculate time, after all print(ddf.head(10)) to see output
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)
ddf[(ddf.A == 2) | (ddf.A == 1)]['C'] = ddf['A']
ddf[(ddf.A != 2) & (ddf.A != 1)]['C'] = ddf['B']
print(ddf.head(x))
The working solution was adapted from the comments, and ended up as follows:
cond = df.met.isin(['Bus. Hours', 'Offshift'])
df['met'] = np.where(cond, np.nan, df['met'])
df['time'] = np.where(cond, df['met'], df['time'])
Came across another situation where this was needed. It was along the lines of a string that shouldn't contain a substring:
df1 = dataset.copy(deep=True)
df1['F_adj'] = 0
cond = (df1['Type'] == 'Delayed Ignition') | ~(df1['Type'].str.contains('Delayed'))
df1['F_adj'] = np.where(cond,df1['F'], 0)
I know i can do like below if we are checking only two columns together.
df['flag'] = df['a_id'].isin(df['b_id'])
where df is a data frame, and a_id and b_id are two columns of the data frame. It will return True or False value based on the match. But i need to compare multiple columns together.
For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip columns. I want to compare like below,
a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']
df['flag'] = a_key.isin(b_key)
Somehow the above code is always returning False value. The output should be like below,
First row flag will be True because there is a match.
a_key becomes 2a10 this is match with last row of b_key (2a10)
You were going in the right direction, just use:
a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)
a_key.isin(b_key)
Mine is giving below results:
0 True
1 False
2 False
You can use isin with DataFrame as value, but as per the docs:
If values is a DataFrame, then both the index and column labels must
match
So this should work:
# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])
# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)
# Get actual rows with boolean indexing
df_a.loc[matched]
# ... or add boolean flag to dataframe
df['flag'] = matched
Here's one approach using DataFrame.merge, pandas.concat and testing for duplicated values:
df_merged = df.merge(df,
left_on=['a_id', 'a_region', 'a_ip'],
right_on=['b_id', 'b_region', 'b_ip'],
suffixes=('', '_y'))
df['flag'] = pd.concat([df, df_merged[df.columns]]).duplicated(keep=False)[:len(df)].values
[out]
a_id a_region a_ip b_id b_region b_ip flag
0 2 a 10 3222222 sssss 22222 True
1 22222 bcccc 10000 43333 ddddd 11111 False
2 33333 acccc 120000 2 a 10 False
I have a data frame that has multiple columns, example:
Prod_A Prod_B Prod_C State Region
1 1 0 1 1 1
I would like to drop all columns that starts with Prod_, (I can't select or drop by name because the data frame has 200 variables)
Is it possible to do this ?
Thank you
Use startswith for mask and then delete columns with loc and boolean indexing:
df = df.loc[:, ~df.columns.str.startswith('Prod')]
print (df)
State Region
1 1 1
First, select all columns to be deleted:
unwanted = df.columns[df.columns.str.startswith('Prod_')]
The, drop them all:
df.drop(unwanted, axis=1, inplace=True)
we can also use negative RegEx:
In [269]: df.filter(regex=r'^(?!Prod_).*$')
Out[269]:
State Region
1 1 1
Drop all rows where the path column starts with /var:
df = df[~df['path'].map(lambda x: (str(x).startswith('/var')))]
This can be further simplified to:
df = df[~df['path'].str.startswith('/var')]
map+lambda offer more flexibility by allowing to handle raw values as opposed to scalars. In the example below rows will be removed when they start with /var or are empty (nan, None, etc).
df = df[~df['path'].map(lambda x: (str(x).startswith('/var') or not x))]
Drop all rows where the path column starts with /var or /tmp (you can also pass a tuple to startswith):
df = df[~df['path'].map(lambda x: (str(x).startswith(('/var', '/tmp'))))]
The tilda ~ is used for negation; if you wanted instead to keep all rows starting with /var then just remove the ~.