Counting frequency of specific dataframe cells - python

I need to count the specific number of times a number is seen within a specific cell.
DataFrame ScreenShot
The values are between 1 to 7.
In this column Entity_Types, the first occurrence has 7,7,6,7,6,7,1,7,7,7,2. I think I need to create 7 additional empty columns and count the frequency of each occurrence(for each number) and append them to a new column labeled Entity_Types_1,Entity_Types_2...etc.
Example: New column 7 would have each count of 7 while New Column 1 would have the count of all 1's in that respective cell. I have a table that has 30,000 rows so I was wondering how to run it in a loop to fill out the rest of the dataset.
I can easily do it in excel using this formula
=SUMPRODUCT(LEN(O2)-LEN(SUBSTITUTE(O2,"2","")))
Where O2 is Entity_Types and "2" = the number we are looking to find.
End Example

It looks like Entity_Types is a column of strings in you data frame df. If that is the case, you can use:
for i in range(8):
df['Entity_Types_{}'.format(i)] = df.Entity_Types.str.count(str(i))

Related

I need to slice a lists column into 5 columns in pandas/spark dataframe

I have a spark dataframe with 3 columns
df = df.select("timestamp","outer_value","values")
current format
“Values” column contains lists of variable lengths (although in multiple of 5, so for instance 1st row might have 30 values, 2nd row might have 60 values…etc)
I want to split values column in such a way that – it takes first 5 values and puts it in 5 new columns and then next 5 values in the second row of the same 5 columns and so on
So lets say for above example 1st row of “values” column is a list of 15 elements and second row is a list of 20 and 3rd row is a list 10 elements than my dataframe should look something like this
desired format
Is there a better way to do this?
I tried F.slice by using something like this but this only works for fixed values assuming I take 60 as a fixed number for 12 slices of 5 each but I need it to be dividing the values column dynamically so it stops when there are no more list elements left.
code I tried
So instead of having 12 slices of 5 each, I need it to keep stacking up 5 values into 5 separate columns and then from 6th value onwards return to the 1st column and insert values 6-10 into the same 5 columns and keep doing this until each row list is finished.
Is there a better way to do it.

Converting multiple columns headers into a single column with python?

I don't even know where to start with this one. I've got a Data set where the Yield Percentages of a particular product are broken into date columns. So for instance 08/03 is one column with a few hundred percentages as the values. and the columns go on and on. 08/04 is another column. I want to break this out and put the dates in their own column and then the Yield % in its own column. I need to create a single column out of the date headers and then create another column out of the percentages. I have no code to share as I'm not sure where to start.

How to create binary representations of words in pandas column?

I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.

Slicing Pandas DataFrame every nth row

I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.
You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...

Iterating rows AND columns, in Pandas/Python

so I am trying to iterate rows and columns within a data frame and would like to compare some of the column values to another dataframe's values of identical column names.
So both data frames with about 30 columns where some are objects, some are floats and some are integers.
I would mostly like to compare all of the columns that are integers to another data frame that has 1 row extracted from the data frame, as I would like to compute the similarities of each row in the dataframe 'CB' to the one row in 'ip' and then input that value into the sim column in my dataframe.
(if it's possible to compare all relevant columns in a way that would be great too)
Image of dataframes
In the end I would like to be able to change the sim column value based on the final if statement for each row. This would be best if it was reusable in future like a function as I would like to compare it to multiple "ip's".
Below is an example of one of the variations I tried doing it:
for i in range(len(CB)):
current = CB.iloc[i, j]
for j in current:
ipValue = ip.iloc[0, j]
if current == ipValue: top += 1
continue
if (current == 1) or (ipValue == 1): bottom += 1
break
if(bottom > 0 ): CB.iloc[i, 30] = top / bottom
If anyone could help me with this it would be wonderful, thank you :)

Categories