Make a frequency table with categories in Python

Make a frequency table with categories in Python - python

I am trying to make an easy frequency table in Python, but I can't find the answer. My data contains numbers from 0 to 10, for example:
1,2,3,4,5,5,5,8,8,8,0,9,10,2,2,10,10,7,7,7,7,9.
I want to make a frequency table with the counts and percentiles (zero excluded!) of these values turned into 3 categories:
Category 1 : lower than 5,5
Category 2 : Between 5,5 and 8
Category 3 : 8 or higher
My output then needs to be:
Category 1 : frequency 9/ 43%
Category 2 : frequency 4/19%
Category 3 : frequency 8/38%
How do I do this in Python?

Updated version that will work for your use-case:
dd = {"cat_1":0, "cat_2":0, "cat_3":0}
values = [1,2,3,4,5,5,5,8,8,8,0,9,10,2,2,10,10,7,7,7,7,9]
for value in values:
if value > 0 and value < 5.5:
dd["cat_1"] += 1
elif value >= 5.5 and value < 8:
dd["cat_2"] += 1
elif value >= 8:
dd["cat_3"] += 1
print(f"Category 1 : frequency {dd['cat_1']}/{(dd['cat_1']/(len(values)-values.count(0)))*100}")
print(f"Category 2 : frequency {dd['cat_2']}/{(dd['cat_2']/(len(values)-values.count(0)))*100}")
print(f"Category 3 : frequency {dd['cat_3']}/{(dd['cat_3']/(len(values)-values.count(0)))*100}")

Related

Time Series from different variables

I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')

Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10

Eliminating Negative or Non_Negative values in pandas

-)I'm working on an automation task in python wherein in each row the 1st negative value should be added up with the 1st non-negative value from the left. Further, the result should replace the positive value and 0 should replace the negative value
-)This process should continue until the entire row contains all negative or all positive values.
**CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days**
ABC -2 23 2 3 2 2 -1
(>360Days)+(180-360Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC -2 23 2 3 2 1 0
(<30Days)+(180-360Days)
-2 + 1
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 -1 0
(180-360Days)+(120-180Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 0 0

Check this code:
import pandas as pd
#Empty DataFrame
df=pd.DataFrame()
#Enter the data
new_row={'CUSTOMER':'ABC','<30Days':-2,'31-60 Days':23,'61-90Days':2,'91-120Days':3,'120-180Days':2,'180-360Days':2,'>360Days':-1}
df=df.append(new_row,ignore_index=True)
#Keep columns order as per the requirement
df=df[['CUSTOMER','<30Days','31-60 Days','61-90Days','91-120Days','120-180Days','180-360Days','>360Days']]
#Take column names and reverse the order
ls=list(df.columns)
ls.reverse()
#Remove non integer column
ls.remove('CUSTOMER')
#Initialize variables
flag1=1
flag=0
new_ls=[]
new_ls_index=[]
for j in range(len(df)):
while flag1!=0:
#Perform logic
for i in ls:
if int(df[i][j]) < 0 and flag == 0:
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=1
elif flag==1 and int(df[i][j]) >= 0 :
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=2
elif flag==2:
df[new_ls_index[1]]=new_ls[0]+new_ls[1]
df[new_ls_index[0]]=0
flag=0
new_ls=[]
new_ls_index=[]
#Check all values in row either positive or negative
if new_ls==[]:
new_ls_neg=[]
new_ls_pos=[]
for i in ls:
if int(df[i][j]) < 0:
new_ls_neg.append(int(df[i][j]))
if int(df[i][j]) >= 0 :
new_ls_pos.append(int(df[i][j]))
if len(new_ls_neg)==len(ls) or len(new_ls_pos)==len(ls):
flag1=0 #Set flag to stop the loop

Frequency of numbers in an array

I want to get the frequency of numbers in an unsorted array. I am getting the frequency of numbers, but the output shows the frequency of a particular number multiple times. I want the resulting frequency to be shown only once.
A = [2,5,1,2,4,6,3,10,3,4,3,2,3,2,15]
B = max(A) + 1
F =[None] * B
for i in range(0,B):
F[i] = 0
for j in range(0,len(A)):
F[A[j]] = F[A[j]] + 1
for k in range(0,len(A)):
if F[A[k]] != 0:
print("Frequency of ", A[k] , " is : " , F[A[k]])
Output obtained showing frequency of say 2, four times.
Frequency of 2 is : 4
Frequency of 5 is : 1
Frequency of 1 is : 1
Frequency of 2 is : 4
Frequency of 4 is : 2
Frequency of 6 is : 1
Frequency of 3 is : 4
Frequency of 10 is : 1
Frequency of 3 is : 4
Frequency of 4 is : 2
Frequency of 3 is : 4
Frequency of 2 is : 4
Frequency of 3 is : 4
Frequency of 2 is : 4
Frequency of 15 is : 1

Use collections.Counter for this
In [1]: from collections import Counter
In [2]: A = [2,5,1,2,4,6,3,10,3,4,3,2,3,2,15]
In [3]: for k, v in Counter(A).items():
...: print('Frequency of {} is {}'.format(k, v))
...:
Frequency of 2 is 4
Frequency of 5 is 1 ...

You can use a dict data structure for that. See the well commented code within:
# This function creates the collection frequencies
def get_collection_frequency(mylist):
# Dictionary data structure is used
mydict = {}
# Loop through the input list
for index in mylist:
# If the item is already there
if (index in mydict):
# Increase its frequency
mydict[index] += 1
# If it is not
else:
# Set its frequency equal to 1
mydict[index] = 1
# Return the dictionary
return mydict
A = [2,5,1,2,4,6,3,10,3,4,3,2,3,2,15]
new = get_collection_frequency(A)
print(new)
Returns: {2: 4, 5: 1, 1: 1, 4: 2, 6: 1, 3: 4, 10: 1, 15: 1}

get the set of the list to remove multiple occurrences, then just loop through:
for num in set(A):
print("Frequency of {} is {}".format(num,A.count(num)))
output:
Frequency of 1 is 1
Frequency of 2 is 4
Frequency of 3 is 4
Frequency of 4 is 2
Frequency of 5 is 1
Frequency of 6 is 1
Frequency of 10 is 1
Frequency of 15 is 1

How to compare values in a column and create a new column using pandas?

I have a df named value of size 567 and it has a column index as follows:
index
96.875
96.6796875
96.58203125
96.38671875
95.80078125
94.7265625
94.62890625
94.3359375
58.88671875
58.7890625
58.69140625
58.59375
58.49609375
58.3984375
58.30078125
58.203125
I also have 2 additional variables:
mu = 56.80877955613938
sigma= 17.78935620293665
What I want is to check the values in the index column. If the value is greater than, say, mu+3*sigma, a new column named alarm must be added to the value df and a value of 4 must be added.
I tried:
for i in value['index']:
if (i >= mu+3*sigma):
value['alarm'] = 4
elif ((i < mu+3*sigma) and (i >= mu+2*sigma)):
value['alarm'] = 3
elif((i < mu+2*sigma) and (i >= mu+sigma)):
value['alarm'] = 2
elif ((i < mu+sigma) and (i >= mu)):
value['alarm'] = 1
But it creates an alarm column and fills it completely with 1.
What is the mistake I am doing here?
Expected output:
index alarm
96.875 3
96.6796875 3
96.58203125 3
96.38671875 3
95.80078125 3
94.7265625 3
94.62890625 3
94.3359375 3
58.88671875 1
58.7890625 1
58.69140625 1
58.59375 1
58.49609375 1
58.3984375 1
58.30078125 1
58.203125 1

If you have multiple conditions, you don't want to loop through your dataframe and use if, elif, else. A better solution would be to use np.select where we define conditions and based on those conditions we define choices:
conditions=[
value['index'] >= mu+3*sigma,
(value['index'] < mu+3*sigma) & (value['index'] >= mu+2*sigma),
(value['index'] < mu+2*sigma) & (value['index'] >= mu+sigma),
]
choices = [4, 3, 2]
value['alarm'] = np.select(conditions, choices, default=1)
value
alarm
index
96.875000 3
96.679688 3
96.582031 3
96.386719 3
95.800781 3
94.726562 3
94.628906 3
94.335938 3
58.886719 1
58.789062 1
58.691406 1
58.593750 1
58.496094 1
58.398438 1
58.300781 1
58.203125 1
If you have 10 min time, here's a good post by CS95 explaining why looping over a dataframe is bad practice.

2D binary string pyevolve

I am new to pyevolve and GA in Python.
I am trying to make a 2D binary array representing matches. Something like this:
A B C
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
My goal is to have only one "1" in each row and the "1" in the array should be equal to the number of rows. One number can be matched with only one letter but a letter can be matched with multiple numbers.
I wrote this code in Evaluation function
def eval_func(chromosome):
score = 0.0
num_of_rows = chromosome.getHeight()
num_of_cols = chromosome.getWidth()
# create 2 lists. One with the sums of each row and one
# with the sums of each column
row_sums = [sum(chromosome[i]) for i in xrange(num_of_rows)]
col_sums = [sum(x) for x in zip(*chromosome)]
# if the sum of "1"s in a row is > 1 then a number (1,2,3,4) is matched with
# more than one letter. We want to avoid that.
for row_sum in row_sums:
if row_sum <= 1:
score += 0.5
else:
score -= 1.0
# col_sum is actually the number of "1"s in the array
col_sum = sum(col_sums)
# if all the numbers are matched we increase the score
if col_sum == num_of_rows:
score += 0.5
if score < 0:
score = 0.0
return score
Seems to work but when I add some other checks, eg if 1 is in A, 2 can not be in C, it fails.
how can this become possible? (many checks)
Thanks in advance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Make a frequency table with categories in Python - python

Related

Time Series from different variables

Eliminating Negative or Non_Negative values in pandas

Frequency of numbers in an array

How to compare values in a column and create a new column using pandas?

2D binary string pyevolve

Categories

Resources