I have a dataframe with this format:
ID measurement_1 measurement_2
0 3 NaN
1 NaN 5
2 NaN 7
3 NaN NaN
I want to combine to:
ID measurement measurement_type
0 3 1
1 5 2
2 7 2
For each row there will be a value in either measurement_1 or measurement_2 column, not in both, the other column will be NaN.
In some rows both columns will be NaN.
I want to add a column for the measurement type (depending on which column has the value) and take the actual value out of both columns, and remove the rows that have NaN in both columns.
Is there an easy way of doing this?
Thanks!
Use DataFrame.stack to reshape the dataframe then use reset_index and use DataFrame.assign to assign the column measurement_type by using Series.str.split + Series.str[:1] on level_1:
df1 = (
df.set_index('ID').stack().reset_index(name='measurement')
.assign(mesurement_type=lambda x: x.pop('level_1').str.split('_').str[-1])
)
Result:
print(df1)
ID measurement mesurement_type
0 0 3.0 1
1 1 5.0 2
2 2 7.0 2
Maybe combine_first could help?
import numpy as np
df["measurement"] = df["measurement_1"].combine_first(df["measurement_2"])
df["measurement_type"] = np.where(df["measurement_1"].notnull(), 1, 2)
df.drop(["measurement_1", "measurement_2"], 1)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Set a threshold and drop any that has more than one NaN. Use df.assign to fillna() measurement_1 and apply np.where on measurement_2
df= df.dropna(thresh=2).assign(measurement=df.measurement_1.fillna\
(df.measurement_2), measurement_type=np.where(df.measurement_2.isna(),1,2)).drop(columns=['measurement_1','measurement_2'])
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
You could use pandas melt :
(
df.melt("ID", var_name="measurement_type", value_name="measurement")
.dropna()
.assign(measurement_type=lambda x: x.measurement_type.str[-1])
.iloc[:, [0, -1, 1]]
.astype("int8")
)
or wide to long :
(
pd.wide_to_long(df, stubnames="measurement", i="ID",
j="measurement_type", sep="_")
.dropna()
.reset_index()
.astype("int8")
.iloc[:, [0, -1, 1]]
)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Related
I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?
Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
I have two DataFrames and I would like to take the median of one column grouped by a set of two other columns from dataframe A and then merge the calculated median into dataframe B. Let me explain it using the example below:
I have two DataFrames which look like
# DataFrame 1
pu_c do_c fare
0 0 5 10
1 0 5 20
2 1 1 3
# DataFrame 2
pu_c do_c
0 0 3
1 0 5
2 1 1
I would like to take the median of fare grouped by pu_c and do_c using:
a = df1.groupby(['pu_c', 'do_c']).median()['fare']
which will result in:
pu_c do_c
0 5 15
1 1 3
Now I want to merge the median fare calculated in a from df1 into another dataframe such as df2. I know how to do it using the for loops and messy code. I am wondering if there is an efficient way to do it using pandas' merge or concat functions.
My desired output in this example is
pu_c do_c median_fare
0 0 3 NaN (or whatever)
1 0 5 15
2 1 1 3
Note: to reproduce my dataframes use:
import pandas as pd
pu_c = [0, 0, 1]
do_c = [5, 5, 1]
do_c2 = [3, 5, 1]
fare = [10, 20, 3]
df1 = pd.DataFrame({'pu_c': pu_c, 'do_c': do_c, 'fare': fare})
df2 = pd.DataFrame({'pu_c': pu_c, 'do_c': do_c2})
Turn a into a dataframe and rename the values to median_fare using a.to_frame('median_fare'), reset the index, then do an outer merge with df2. It will automatically merge on the 2 columns in common (do_c and pu_c)
df2.merge(a.to_frame('median_fare').reset_index(), how='outer')
do_c pu_c median_fare
0 3 0 NaN
1 5 0 15.0
2 1 1 3.0
I'm trying to create a class column in a pandas dataframe conditional another columns values. The value will be 1 if the other column's i+1 value is greater than the i value and 0 otherwise.
For example:
column1 column2
5 1
6 0
3 0
2 1
4 0
How do create column2 by iterating through column1?
You can use the diff method on the first column with a period of -1, then check if it is less than zero to create the second column.
import pandas as pd
df = pd.DataFrame({'c1': [5,6,3,2,4]})
df['c2'] = (df.c1.diff(-1) < 0).astype(int)
df
# returns:
c1 c2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0
You can also use shift. Performance is almost the same as diff but diff seems to be faster by a a little.
df = pd.DataFrame({'column1': [5,6,3,2,4]})
df['column2'] = (df['column1'] <df['column1'].shift(-1)).astype(int)
print(df)
column1 column2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7
I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')