Highlighting Values in Pandas - python

Good Evening all,
I have made a pandas DataFrame from an excel spreadsheet. I am trying to highlight the names in this list that have logged in at 9:01:00 etc. Anyone who has logged in past the hour or half hour by 1 minute, but excluding those that have logged in early eg 07:59:00 or 07:29:00. EG. Those with * around the time. I am a complete amateur coder so I apologise. If things could be put in the simplest form without assuming a great degree of knowledge I would very much appreciate it. Also, if this is incredibly complex/ impossible I also apologise.
Name Login\nTime
0 ITo 07:59:09
1 Ann 07:59:13
2 Darryll 07:59:24
3 Darren 07:59:31
4 FlorR 07:59:42
5 Colm 07:59:56
6 NatashaBr 07:59:59
7 AlexRobe 07:59:59
8 JonathanSinn 08:00:02
9 BrendanJo 08:00:04
10 DanielCov 08:00:15
11 RW 08:00:17
12 SaraHerrma 08:00:26
13 RobertStew 08:00:37
14 JasonBal *08:04:36*
17 KevinAll 08:59:52
18 JFo 09:00:05
19 LiviaHarr 09:00:22
20 Patrick *09:01:36*
24 SianDi 09:30:32
25 AlisonBri 09:59:27
26 MMulholl 10:00:02
27 TiffanyThom 10:00:07
29 GeorgeEdw 11:00:00
30 JackSha 11:00:50
31 UsmanA 11:59:46
32 LewisBrad 12:02:30
34 RyanmacCor 12:59:20
35 GerardMcphil 12:59:56
36 TanjaN 13:00:07
37 MartinRichar 13:30:08
38 MarkBellin 13:30:20
39 KyranSpur 13:30:24
40 RichRam 13:58:53
41 OctavioSan 14:30:10
42 CharlesS 16:45:07
43 DanielHoll 16:50:55
44 ThomasHoll 16:59:45
45 RosieFl 16:59:56
46 CiaranMur 17:00:01
47 LouiseDa 17:29:29
48 WilliamAi 17:30:02

You can have a look at the Pandas styling options. It has an applymap function which helps you to color code specific columns based on conditions of your choice.
The documention (https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) has some examples, you can go through these and decide how you would want to highlight the column values.
Assuming you have a DataFrame called df. You can define your own styling function func() and apply it your DataFrame.
df.style.apply(func)
You can define your styling function based on the examples in the documentation. Let me know if you have any more questions.

Related

Turn a 2 second order array into pandas dataframe

I have a data set as such of 2 order array, with arbitrary length. as shown below
[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']
.
.
.
]
I want to turn it into a panda dataframe as columns and rows, shown below
15 39 17 43
23 40 18 44
28 41 18 45
28 42 27 46
34 43 26 47
.
.
.
anyone has idea how to achieve it without saving the data out to files during process?
My strategy is defining a function first to deal with the comma and quotes. Keeping in mind that your data is already a 2 dimensional numpy array I define the following function:
def str_to_flt(lst):
tmp = np.array([[float(i.split(",")[0]),float(i.split(",")[1])] for i in lst])
return tmp
import pandas as pd
df = pd.DataFrame(np.concatenate((str_to_flt(data[:,0]), str_to_flt(data[:,1])), axis=1))
Your data:
from io import StringIO
s="""[['15,39' '17,43']
['23,40' '18,44']
['28,41' '18,45']
['28,42' '27,46']
['34,43' '26,47']]"""
df=pd.read_csv(StringIO(s),header=None)
You can do:
d={"\[\['":"","'\]\]":"","'\]\]'":"","'\]":"","\['":"","' '":','}
df=df.replace(d,regex=True)
df[[1.2,1.5]]=df.pop(1).str.extract(r"(\d+),(\d+)")
df=df.sort_index(axis=1)
output of df:
0.0 1.2 1.5 2.0
0 15 39 17 43
1 23 40 18 44
2 28 41 18 45
3 28 42 27 46
4 34 43 26 47
Ofcourse you can rename the name of columns according to your need by using columns attribute or rename() method and typecast data by using astype() method according to your need

Plotting of dot points based on np.where condition

I have a lot of data points (in .CSV form) that I am trying to visualize, what I would like to do is to read the csv and read the "result" column, if the value in the corresponding column is positive(I was trying to use np.where condition), I would like to plot the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.(Something like a dot/scatter plot) I would like to plot all the values in the same graph, Furthermore, if the number of points are more than 20 I would like to use the first 20 points for the plotting.
An example of the type of dataset is below. (Mine contains around 12000 rows)
A B C D E F G result
23 -54 36 27 98 39 80 -0.86
14 44 -16 47 28 29 26 1.65
67 84 26 67 -88 29 10 0.5
-45 14 76 37 68 59 90 0
24 34 56 27 38 79 48 -1.65
Any help in guiding for this would be appreciated !
From your question I assume that your data is a pandas dataframe. In this case you can do the selection with pandas and use its built-in plotting function:
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
If you want to plot the first 20 rows only, just add [:20] (or better .iloc[:20]) to df.loc.

How do I sort columns of numerical file data in python

I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data.
I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit.
Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.
First of all, you should not put code as images, since there is a functionality to insert and format here in the editor.
It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays).
Here is an example:
import numpy as np
array = np.random.randint(0,100, size=50)
print(array)
Output:
[89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4]
So if we use the method mentioned before:
print(array.sort())
Output:
[ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97]
Easy as that :)

Select/Group rows from a data frame with the nearest values for a specific column(s)

I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086
A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()
I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.

Analysing Json file in Python using pandas

I have to analyse a lot of data doing my Bachelors project.
The data will be handed to me in .json files. My supervisor has told me that it should be fairly easy if I just use Pandas.
Since I am all new to Python (I have decent experience with MatLab and C though) I am having a rough start.
If someone would be so kind to explain me how to do this I would really appreciate it.
The files look like this:
{"columns":["id","timestamp","offset_freq","reprate_freq"],
"index":[0,1,2,3,4,5,6,7 ...
"data":[[526144,1451900097533,20000000.495000001,250000093.9642499983],[...
need to import the data and analyse it (make some plots), but I'm not sure how to import data like this..
Ps. I have Python and the required packages installed.
You did not give the full format of JSON file, but if it looks like
{"columns":["id","timestamp","offset_freq","reprate_freq"],
"index":[0,1,2,3,4,5,6,7,8,9],
"data":[[39,69,50,51],[62,14,12,49],[17,99,65,79],[93,5,29,0],[89,37,42,47],[83,79,26,29],[88,17,2,7],[95,87,34,34],[40,54,18,68],[84,56,94,40]]}
then you can do (I made up random numbers)
df = pd.read_json(file_name_or_Python_string, orient='split')
print df
id timestamp offset_freq reprate_freq
0 39 69 50 51
1 62 14 12 49
2 17 99 65 79
3 93 5 29 0
4 89 37 42 47
5 83 79 26 29
6 88 17 2 7
7 95 87 34 34
8 40 54 18 68
9 84 56 94 40

Categories