How to select and replace similar occurrences in a column - python

I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!

There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')

Related

Pandas.DataFrame: How to sort rows by the largest value in each row

I have a dataframe as in the figure (result of a word2vec analysis). I need to sort the rows
descendingly by the largest value in each row. So I want the order of the rows after sorting to be as indicated by the red numbers in the image.
Thanks
Michael
Find max on axis=1 and sort this series of maxes. reindex using this index.
Sample df
A B C D E F
0 95 86 29 38 79 18
1 15 8 34 46 71 50
2 29 9 78 97 83 45
3 88 25 17 83 78 77
4 40 82 3 0 78 38
df_final = df.reindex(df.max(1).sort_values(ascending=False).index)
Out[675]:
A B C D E F
2 29 9 78 97 83 45
0 95 86 29 38 79 18
3 88 25 17 83 78 77
4 40 82 3 0 78 38
1 15 8 34 46 71 50
You can use .max(axis=1) to find the row-wise max and then use .argsort() to return the integer indices that would sort the Series values. Finally, use .loc to arrange the rows in the desired sequence:
df.loc[df.max(axis=1).argsort()[::-1]]
([::-1] added for descending order. Remove it for ascending order)
Input:
1 2 3 4
0 0.32 -1.09 -0.040000 0.600062
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
Output:
1 2 3 4
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
0 0.32 -1.09 -0.040000 0.600062

Adding a column to a dataframe through a mapping between 2 dataframes in Python?

I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...

Single column with value counts from multiple column dataframe

I would like to sum the frequencies over multiple columns with pandas. The amount of columns can vary between 2-15 columns. Here is an example of just 3 columns:
code1 code2 code3
27 5 56
534 27 78
27 312 55
89 312 27
And I would like to have the following result:
code frequency
5 1
27 4
55 1
56 2
78 1
312 2
534 1
To count values inside one column is not the problem, just need a sum of all frequencies in a dataframe a value can appear, no matter the amount of columns.
You could stack and take the value_counts on the resulting series:
df.stack().value_counts().sort_index()
5 1
27 4
55 1
56 1
78 1
89 1
312 2
534 1
dtype: int64

Compare Relative Start Dates in Pandas

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

dataframe concatenating with indexing

I have a Python dataframe that reads from a file
the next step I do is to break dataset into 2 datasets df_LastYear & df_ThisYear
Note : that Index is not continuous missing 2 & 6
ID AdmissionAge
0 14 68
1 22 86
3 78 40
4 124 45
5 128 35
7 148 92
8 183 71
9 185 98
10 219 79
after applying some predictive models I get results of predictive values y_ThisYear
Prediction
0 2.400000e+01
1 1.400000e+01
2 1.000000e+00
3 2.096032e+09
4 2.000000e+00
5 -7.395179e+11
6 6.159412e+06
7 5.592327e+07
8 5.303477e+08
9 5.500000e+00
10 6.500000e+00
I am trying to concat both datasets df_ThisYear and y_ThisYear into one dataset
but I always get these results
ID AdmissionAge Prediction
0 14.0 68.0 2.400000e+01
1 22.0 86.0 1.400000e+01
2 NaN NaN 1.000000e+00
3 78.0 40.0 2.096032e+09
4 124.0 45.0 2.000000e+00
5 128.0 35.0 -7.395179e+11
6 NaN NaN 6.159412e+06
7 148.0 92.0 5.592327e+07
8 183.0 71.0 5.303477e+08
9 185.0 98.0 5.500000e+00
10 219.0 79.0 6.500000e+00
There are NaNs which did not exist before
I found that these NaNs are belonging to the index which was not included in df_ThisYear
Therefore I try reset index so I get continuous Indices
I used
df_ThisYear.reset_index(drop=True)
but still getting same indices
How to fix this problem so I can concatenate df_ThisYear with y_ThisYear correctly?
Then you just need join
df.join(Y)
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
If you are really excited about using concat, you can provide 'inner' to the how argument:
pd.concat([df_ThisYear, y_ThisYear], axis=1, join='inner')
This returns
Out[6]:
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
Because y_ThisYear has different index than df_ThisYear
When I joined both using
df_ThisYear.join(y_ThisYear )
it started to match each number it its matching index
I know this is right if indices are actually represent the same record i.e. index 7 in df_ThisYear value is matching y_ThisYear index 7 too
In my case I just want to match first record in y_ThisYear to the first in df_ThisYear regardless of their index number
I found this code that does that.
df_ThisYear = pd.concat([df_ThisYear.reset_index(drop=True), pd.DataFrame(y_ThisYear)], axis=1)
Thanks for everyone helped with the answer

Categories