dataframe concatenating with indexing - python

I have a Python dataframe that reads from a file
the next step I do is to break dataset into 2 datasets df_LastYear & df_ThisYear
Note : that Index is not continuous missing 2 & 6
ID AdmissionAge
0 14 68
1 22 86
3 78 40
4 124 45
5 128 35
7 148 92
8 183 71
9 185 98
10 219 79
after applying some predictive models I get results of predictive values y_ThisYear
Prediction
0 2.400000e+01
1 1.400000e+01
2 1.000000e+00
3 2.096032e+09
4 2.000000e+00
5 -7.395179e+11
6 6.159412e+06
7 5.592327e+07
8 5.303477e+08
9 5.500000e+00
10 6.500000e+00
I am trying to concat both datasets df_ThisYear and y_ThisYear into one dataset
but I always get these results
ID AdmissionAge Prediction
0 14.0 68.0 2.400000e+01
1 22.0 86.0 1.400000e+01
2 NaN NaN 1.000000e+00
3 78.0 40.0 2.096032e+09
4 124.0 45.0 2.000000e+00
5 128.0 35.0 -7.395179e+11
6 NaN NaN 6.159412e+06
7 148.0 92.0 5.592327e+07
8 183.0 71.0 5.303477e+08
9 185.0 98.0 5.500000e+00
10 219.0 79.0 6.500000e+00
There are NaNs which did not exist before
I found that these NaNs are belonging to the index which was not included in df_ThisYear
Therefore I try reset index so I get continuous Indices
I used
df_ThisYear.reset_index(drop=True)
but still getting same indices
How to fix this problem so I can concatenate df_ThisYear with y_ThisYear correctly?

Then you just need join
df.join(Y)
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00

If you are really excited about using concat, you can provide 'inner' to the how argument:
pd.concat([df_ThisYear, y_ThisYear], axis=1, join='inner')
This returns
Out[6]:
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00

Because y_ThisYear has different index than df_ThisYear
When I joined both using
df_ThisYear.join(y_ThisYear )
it started to match each number it its matching index
I know this is right if indices are actually represent the same record i.e. index 7 in df_ThisYear value is matching y_ThisYear index 7 too
In my case I just want to match first record in y_ThisYear to the first in df_ThisYear regardless of their index number
I found this code that does that.
df_ThisYear = pd.concat([df_ThisYear.reset_index(drop=True), pd.DataFrame(y_ThisYear)], axis=1)
Thanks for everyone helped with the answer

Related

Swipe or turn data for stacked bar chart in Matplotlib

I'm trying to create or generate some graphs in stacked bar I'm using this data:
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 No 94 123 96 108 122 106.0 95.0 124 104 118 73 82 106 124 109 70 59
1 Yes 34 4 33 21 5 25.0 34.0 5 21 9 55 46 21 3 19 59 41
2 Dont know 1 2 1 1 2 NaN NaN 1 4 2 2 2 2 2 2 1 7
Basically I want to use the columns names as x and the Yes, No, Don't know as the Y values, here is my code and the result that I have at the moment.
ax = dfu.plot.bar(x='index', stacked=True)
UPDATE:
Here is an example:
data = [{0:1,1:2,2:3},{0:3,1:2,2:1},{0:1,1:1,2:1}]
index = ["yes","no","dont know"]
df = pd.DataFrame(data,index=index)
df.T.plot.bar(stacked=True) # Note .T is used to transpose the DataFrame

Pandas.DataFrame: How to sort rows by the largest value in each row

I have a dataframe as in the figure (result of a word2vec analysis). I need to sort the rows
descendingly by the largest value in each row. So I want the order of the rows after sorting to be as indicated by the red numbers in the image.
Thanks
Michael
Find max on axis=1 and sort this series of maxes. reindex using this index.
Sample df
A B C D E F
0 95 86 29 38 79 18
1 15 8 34 46 71 50
2 29 9 78 97 83 45
3 88 25 17 83 78 77
4 40 82 3 0 78 38
df_final = df.reindex(df.max(1).sort_values(ascending=False).index)
Out[675]:
A B C D E F
2 29 9 78 97 83 45
0 95 86 29 38 79 18
3 88 25 17 83 78 77
4 40 82 3 0 78 38
1 15 8 34 46 71 50
You can use .max(axis=1) to find the row-wise max and then use .argsort() to return the integer indices that would sort the Series values. Finally, use .loc to arrange the rows in the desired sequence:
df.loc[df.max(axis=1).argsort()[::-1]]
([::-1] added for descending order. Remove it for ascending order)
Input:
1 2 3 4
0 0.32 -1.09 -0.040000 0.600062
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
Output:
1 2 3 4
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
0 0.32 -1.09 -0.040000 0.600062

Calculate mean of df, BUT if =>1 of the values differs >20% from this mean, the mean is set to NaN

I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})

print(df)

a = df.loc[:, ['a','b','c','d']]

df['mean'] = (a.iloc[:,0:]).mean(1)


print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667

How to select and replace similar occurrences in a column

I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!
There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')

Pandas plotting two columns with series defined by value in third column

Hi I have a pandas dataframe that looks like
deflector wFlow aContent DO Difference
64 3 127.5 10 2.007395
65 3 127.5 3 1.163951
66 3 127.5 5 1.451337
67 3 127.5 7 1.535639
68 3 24.0 10 1.046328
69 3 24.0 3 0.854763
70 3 24.0 5 0.766780
71 3 24.0 7 0.905270
72 3 56.0 10 1.274954
73 3 56.0 3 1.298657
74 3 56.0 5 1.049621
75 3 56.0 7 1.004255
76 3 88.0 10 1.194174
77 3 88.0 3 1.056968
78 3 88.0 5 1.066173
79 3 88.0 7 1.097231
I would like to plot the aContent column vs the DO Difference column with each line defined by the wFlow column (x = aContent, y = DO Difference, 4 different lines, one for each wFlow.
Thanks!
You can pivot the data and use pandas.dataframe.plot:
df.pivot(index='aContent',columns='wFlow',values='DO Difference').plot()

Categories