I have a spark dataframe with 3 columns
df = df.select("timestamp","outer_value","values")
current format
“Values” column contains lists of variable lengths (although in multiple of 5, so for instance 1st row might have 30 values, 2nd row might have 60 values…etc)
I want to split values column in such a way that – it takes first 5 values and puts it in 5 new columns and then next 5 values in the second row of the same 5 columns and so on
So lets say for above example 1st row of “values” column is a list of 15 elements and second row is a list of 20 and 3rd row is a list 10 elements than my dataframe should look something like this
desired format
Is there a better way to do this?
I tried F.slice by using something like this but this only works for fixed values assuming I take 60 as a fixed number for 12 slices of 5 each but I need it to be dividing the values column dynamically so it stops when there are no more list elements left.
code I tried
So instead of having 12 slices of 5 each, I need it to keep stacking up 5 values into 5 separate columns and then from 6th value onwards return to the 1st column and insert values 6-10 into the same 5 columns and keep doing this until each row list is finished.
Is there a better way to do it.
I don't even know where to start with this one. I've got a Data set where the Yield Percentages of a particular product are broken into date columns. So for instance 08/03 is one column with a few hundred percentages as the values. and the columns go on and on. 08/04 is another column. I want to break this out and put the dates in their own column and then the Yield % in its own column. I need to create a single column out of the date headers and then create another column out of the percentages. I have no code to share as I'm not sure where to start.
I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.
I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.
You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...
so I am trying to iterate rows and columns within a data frame and would like to compare some of the column values to another dataframe's values of identical column names.
So both data frames with about 30 columns where some are objects, some are floats and some are integers.
I would mostly like to compare all of the columns that are integers to another data frame that has 1 row extracted from the data frame, as I would like to compute the similarities of each row in the dataframe 'CB' to the one row in 'ip' and then input that value into the sim column in my dataframe.
(if it's possible to compare all relevant columns in a way that would be great too)
Image of dataframes
In the end I would like to be able to change the sim column value based on the final if statement for each row. This would be best if it was reusable in future like a function as I would like to compare it to multiple "ip's".
Below is an example of one of the variations I tried doing it:
for i in range(len(CB)):
current = CB.iloc[i, j]
for j in current:
ipValue = ip.iloc[0, j]
if current == ipValue: top += 1
continue
if (current == 1) or (ipValue == 1): bottom += 1
break
if(bottom > 0 ): CB.iloc[i, 30] = top / bottom
If anyone could help me with this it would be wonderful, thank you :)