How to exclude data present on another dataframe? - python

I'm trying to exclude data that is filtered on another data frame using pandas jupyter. An example of the data frame can be seen below.
Data frame 1:
ID
Amount
AB-01
2.65
AB-02
3.6
AB-03
5.6
AB-04
7.6
AB-05
2
Dataframe 2:
ID
Amount
AB-01
2.65
AB-02
3.6
Desired outcome:
ID
Amount
AB-03
5.6
AB-04
7.6
AB-05
2

You can use isin
out = df1[~df1['ID'].isin(df2['ID'])]
print(out)
ID Amount
2 AB-03 5.6
3 AB-04 7.6
4 AB-05 2.0

Related

Move index values into column names in pandas Data Frame

I'm trying to reshape a multi-indexed data frame so that the values from the second level of the index are incorporated into the column names in the new data frame. In the data frame below, I want to move A and B from "source" into the columns so that I have s1_A, s1_B, s2_A, ..., s3_B.
I've tried creating the structure of the new data frame explicitly and populating it with a nested for loop to reassign the values, but it is excruciatingly slow. I've tried a number of functions from the pandas API, but without much luck. Any help would be much appreciated.
midx = pd.MultiIndex.from_product( [[1,2,3], ['A','B']], names=["sample","source"])
df = pd.DataFrame( index=midx, columns=['s1', 's2', 's3'], data=np.ndarray(shape=(6,3)) )
>>> df
s1 s2 s3
sample source
1 A 1.2 3.4 5.6
B 1.2 3.4 5.6
2 A 1.2 3.4 5.6
B 1.2 3.4 5.6
3 A 1.2 3.4 5.6
B 1.2 3.4 5.6
# Want to build a new data frame thatlooks like this:
>>> df_new
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6
Here's how I'm currently doing it. It's extremely slow, and I know there must be a more idiomatic way to do this with pandas, but I'm still new to its API:
substances = df.columns.values
sources = ['A','B']
subst_and_src = sorted([ subst + "_" + src for src in sources for subst in substances ])
df_new = pd.DataFrame(index=df.index.unique(0), columns=subst_and_src)
# Runs forever
for (sample, source) in df.index:
for subst in df.columns:
df_new[sample, subst + "_" + source] = df.loc[(sample,source), subst]
df = df.unstack(level=1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print(df)
Prints:
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 4.665045e-310 6.904071e-310 0.0 0.0 6.903913e-310 2.121996e-314
2 6.904071e-310 0.000000e+00 0.0 0.0 3.458460e-323 0.000000e+00
3 0.000000e+00 0.000000e+00 0.0 0.0 0.000000e+00 0.000000e+00
Unstack into a new dataframe and collapse multilevel index of resulting frmae using f string
df1= df.unstack()
df1.columns = df1.columns.map('{0[0]}_{0[1]}'.format)
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6

Complex Transformation of GB to TB in mixed column using Python

I have a dataframe, df, that looks like this:
TB TB2
4.6 5.0
6.8 502.4 G
My desired output is to convert any value that has the letter G behind it to TB, without disturbing the other values within that column.
1000 Gigabytes = 1 Terabyte
TB TB2
4.6 5.0
6.8 0.5024
A member has suggested the following code:
df['Partial_Capacity TB']=df['Partial_Capacity TB'].str.replace('\s\w+','').astype(float).div(1000)
However, all of the values within the column are being converted, regardless of the 'G' behind it.
I am working on this now, any suggestions are appreciated
#Select rows containing G
m=df.TB2.str.contains('G')
#Use the loc accessor to mask relevant column, strip G from string and divide by 1000
df.loc[m,'TB2']=df.loc[m,'TB2'].str.replace('\s\w+','').astype(float).div(1000)
print(df)
TB TB2
0 4.6 5.0
1 6.8 0.5024

How to create new columns by looping through columns in different dataframes?

I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]

reorder panda data frame columns vertical

I'm a bit new to panda and have some diabetic data that i would like to reorder.
I'd like to copy the data from column 'wakeup' through '23:00:00',
and put this data vertical under each other so I would get a new dataframe column:
5.6
8.1
9.9
6.3
4.1
13.3
NAN
3.9
3.3
6.8
.....etc
I'm assuming the data is in a dataframe already. You can index the columns you want and then use melt as suggested. Without any parameters, melt will 'stack' all your data into one column of a new dataframe. There's another column created to identify the original column names, but you can drop that if needed.
df.loc[:, 'wakeup':'23:00:00'].melt()
variable value
0 wakeup 5.6
1 wakeup 8.1
2 wakeup 9.9
3 wakeup 6.3
4 wakeup 4.1
5 wakeup 13.3
6 wakeup NAN
7 09:30:00 3.9
8 09:30:00 3.3
9 09:30:00 6.8
...
You mention you want this as another column, but there's no way to sensibly add it into your existing dataframe. The shape likely won't match also.
Solved it myself finally took me quite some time.
Notice here the orginal data was in df1 result in dfAllMeasurements
dfAllMeasurements = df1.loc[:, 'weekday':'23:00:00']
temp = dfAllMeasurements.set_index('weekday','ID').stack(dropna=False) #dropna = keeping NAN
dfAllMeasurements = temp.reset_index(drop=False, level=0).reset_index()

Pandas data manipulation - multiple measurements per line to one per line [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!
Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1
Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1

Categories