Pandas - Read only first few lines of each rows - python

I have a large CSV file with about 10000 rows of text information and each row of my dataset consists of a number of lines. However I just want to read say first 20 lines of each of the rows of my CSV file.
I came across n_rows parameter in pandas read_csv method which helps us in limiting the number of rows of the dataset that we would like to load. Is there also a way where we could only read first 20 lines of data from each of the rows in pandas?

You can read in the csv with df = pd.read_csv('path\file.csv') and than just select the first 20 rows by df_new = df.head(20). Is that what you where thinking of?

If I get your question correctly, your CSV file has multiple rows, where each row has multiple lines separated by the newline character '\n'. And you want to choose the first (say, for example) 3 lines from each row.
This can be achieved as:
# Read in CSV file using pandas-
data = pd.read_csv("example.csv")
# The first two rows (toy example) of dataset are-
data.iloc[0,0]
# 'Hello, this is first line\nAnd this is the second line\nThird and final line'
data.iloc[1,0]
# 'Today is 3rd June\nThe year is 2020\nSummer weather'
# First row's first line-
data.iloc[0,0].split("\n")[0]
# 'Hello, this is first line'
# First row's first two lines-
data.iloc[0,0].split("\n")[0:2]
# ['Hello, this is first line', 'And this is the second line']
The general syntax to get the first 'n' lines from row 'x' (assuming that the first column has the string data) is:
data.iloc[x,0].split("\n")[:n]
To select the first 'm' lines (assuming there are m lines or more) from the first 'x' rows, use the code:
data.iloc[:x, 0].apply(lambda y: y.split("\n")[0:m])
Does this help?

If TiTo's answer is not what you are looking for, maybe the iloc method is. You can store the first 20 rows by doing firstRows = df.iloc[:20].
However, if you only ever need the first 20 rows, you shouldn't load the whole file into memory. As you mentioned, this can be achieved with the nrows parameter.

Related

Pandas: how to keep data that has all the needed columns

I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.

Slicing Pandas DataFrame every nth row

I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.
You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...

How to remove a row that has no index number - Python

I am a newbie in Python.
Say I have the following very simple table saved in a text file:
one two three
four five six
seven eight nine
I can view this table using pandas:
import pandas as pd
df = pd.read_csv('table.txt')
print(df)
The outcome of this is:
one two three
0 four five six
1 seven eight nine
Now, I want to delete the first row (i.e. one two three). My first thought was to write:
df.drop([0])
Because rows and columns are numbered starting from 0. But what this does is that it drops the second row (i.e. four five six).
Thus the title of this question. How to remove a row that has no index number? Because as seen from print(df), the first row is not given an index.
I did try to Google this, but I was not able to find an answer. This is probably because I do not know some keywords.
One Two Three is the header of CSV file.
So to skip it, write the code mentioned below:
df = pd.read_csv('table.txt', header=none)

Concatinating the entire row in Pandas

I'm new to Pandas. I want to join entire row of strings to a paragraph, where each row comes up with only paragraph, instead for a number of columns. For example, if 10 columns and 5 rows are there in a Dataframe, I want the output to be 5 rows with a single column combining all 10 column's String Data to a single column. I actually need it to use Bag of Words/ TF-IDF to it for sentiment analysis. Need help to do it.
I thought of a solution taken for row 1 may be
' '.join(str(x) for x in df.iloc(1,0:11))
if any better solution is there, it will be more helpful for me

Read data from .dat file as rows and columns, ignore comments with loop

I have a .plt file with 110 datasets each with 11 rows and 9 columns. The datasets are separated by 3 rows of comments each time. I want to read the file as rows and columns. Panda reads it as rows but doesn't recognize the columns. This may be because the first three lines of the file are comments. One starting with '#' and the other two with '$'. How do I make panda convert this to a csv type file?
I have tried read_csv, read_fwf and put delimiters and comments and skipped the first three rows but it still recognizes all the columns as 1 column with index 0.
I ended up ignoring those lines by using
test = pd.read_table(filename, header=None)
list_a = test.iloc[start:end]

Categories