Slicing Pandas DataFrame every nth row - python

I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.

You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...

Related

I need to slice a lists column into 5 columns in pandas/spark dataframe

I have a spark dataframe with 3 columns
df = df.select("timestamp","outer_value","values")
current format
“Values” column contains lists of variable lengths (although in multiple of 5, so for instance 1st row might have 30 values, 2nd row might have 60 values…etc)
I want to split values column in such a way that – it takes first 5 values and puts it in 5 new columns and then next 5 values in the second row of the same 5 columns and so on
So lets say for above example 1st row of “values” column is a list of 15 elements and second row is a list of 20 and 3rd row is a list 10 elements than my dataframe should look something like this
desired format
Is there a better way to do this?
I tried F.slice by using something like this but this only works for fixed values assuming I take 60 as a fixed number for 12 slices of 5 each but I need it to be dividing the values column dynamically so it stops when there are no more list elements left.
code I tried
So instead of having 12 slices of 5 each, I need it to keep stacking up 5 values into 5 separate columns and then from 6th value onwards return to the 1st column and insert values 6-10 into the same 5 columns and keep doing this until each row list is finished.
Is there a better way to do it.

How to get data from a row on a DataFrame

I have a list of lists that contain the indexes of the mininum values on each column of a DataFrame that has row and column name going from 0 to 399 (on columns) and 0 to 1595 (on rows). I want to use this list to access the data of another DataFrame. For example, I have the list (43,579,100) and I want to access the 43rd, 579th and 100th value of a column in the second DataFrame. However, this DataFrame has row number names that do not go from 0 to 1595 so I don't want to make the mistake of accessing the data on the row that may have the name "43", I want to access the 43rd row.
I added a picture of my Data Frames.
I would like to get a list with the data on the selected rows.
You can use .values to convert the column data to a numpy array and index with your list. For example, if your data is in variable df and the list of indexes is idxs, then for a given column:
df[column].values[idxs]

Concatinating the entire row in Pandas

I'm new to Pandas. I want to join entire row of strings to a paragraph, where each row comes up with only paragraph, instead for a number of columns. For example, if 10 columns and 5 rows are there in a Dataframe, I want the output to be 5 rows with a single column combining all 10 column's String Data to a single column. I actually need it to use Bag of Words/ TF-IDF to it for sentiment analysis. Need help to do it.
I thought of a solution taken for row 1 may be
' '.join(str(x) for x in df.iloc(1,0:11))
if any better solution is there, it will be more helpful for me

Python pandas dataframe: Is it possible to have rows in dataframe with different number of columns(column length)

I'm trying to compare two csv files where each row in one file is compared with all rows in another file and if row matches(based on some unique column in both rows), i'm creating a new row in third csv file where each row in this third file contains some values in the cell.
I'm using pandas dataframe for this as i can first create an empty dataframe and then add values into it and then at last use dataframe.to_csv() method to print the content of the dataframe to csv file(third file).
The problem statement that i have is, based on condition like if some condition is true then i need to have/print only 522 columns in one row and if some condition is false then i need to have/print 621 columns on another row in the third file.
Currently, always the third file contains 621 columns(because i have initialized dataframe as pd.DataFrame(pd.np.empty((0,621), dtype=object))) irrespective of the condition whether its true or false but i need to have above mentioned behavior.

Counting frequency of specific dataframe cells

I need to count the specific number of times a number is seen within a specific cell.
DataFrame ScreenShot
The values are between 1 to 7.
In this column Entity_Types, the first occurrence has 7,7,6,7,6,7,1,7,7,7,2. I think I need to create 7 additional empty columns and count the frequency of each occurrence(for each number) and append them to a new column labeled Entity_Types_1,Entity_Types_2...etc.
Example: New column 7 would have each count of 7 while New Column 1 would have the count of all 1's in that respective cell. I have a table that has 30,000 rows so I was wondering how to run it in a loop to fill out the rest of the dataset.
I can easily do it in excel using this formula
=SUMPRODUCT(LEN(O2)-LEN(SUBSTITUTE(O2,"2","")))
Where O2 is Entity_Types and "2" = the number we are looking to find.
End Example
It looks like Entity_Types is a column of strings in you data frame df. If that is the case, you can use:
for i in range(8):
df['Entity_Types_{}'.format(i)] = df.Entity_Types.str.count(str(i))

Categories