How to remove a row that has no index number - Python - python

I am a newbie in Python.
Say I have the following very simple table saved in a text file:
one two three
four five six
seven eight nine
I can view this table using pandas:
import pandas as pd
df = pd.read_csv('table.txt')
print(df)
The outcome of this is:
one two three
0 four five six
1 seven eight nine
Now, I want to delete the first row (i.e. one two three). My first thought was to write:
df.drop([0])
Because rows and columns are numbered starting from 0. But what this does is that it drops the second row (i.e. four five six).
Thus the title of this question. How to remove a row that has no index number? Because as seen from print(df), the first row is not given an index.
I did try to Google this, but I was not able to find an answer. This is probably because I do not know some keywords.

One Two Three is the header of CSV file.
So to skip it, write the code mentioned below:
df = pd.read_csv('table.txt', header=none)

Related

Counting elements in specified column of a .csv file

I am programming in Python
I want to count how many times each word appears in a column. Coulmn 4 of my .csv file contains cca. 7 different words and need to know how many times each one appears. Eg. there are 700 lines and I need to count how many times the phrase HelloWorld appears in column 4.
You can use pandas.Series.value_counts() on the column you want. Since you mentioned it's the fourth column, you can get it by index using iloc as well. Of course you have to install pandas as it's not from the standard library, e.g. using pip with pip install pandas if you haven't already.
An example:
import pandas as pd
df = pd.read_csv("path/to/file.csv")
forth_column = df.iloc[:, 3] # Gets all rows for the fourth column (index starts at 0)
counts = forth_column.value_counts()
print(counts) # You'll see the number of times each string appears in the column
# The keys are the strings and the values are the number of times they appear
hello_world_counts = counts["HelloWorld"]

Slicing Pandas DataFrame every nth row

I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.
You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...

pandas - single column to multiple columns (getting key length error)

I'm having trouble with a thought process on a single to multiple columns using pandas. I have a main column that could have up to ten words separated by commas. I only have eight columns to split out these words to (no more).
I'm Currently using the code below to split out words into multiple columns. This code works as long as I know exactly how many words is in the longest cell. Example: In this case below, one of the cells in the original file will have exactly eight words in order for this to work properly. Otherwise, I will get an error ( Columns must be same length as key ). In testing, I have found that I must have the same number of columns needed to split the longest cell with the same number of words. No more, no less.
df[['column1','column2','column3','column4','column5','column6','column7','column8']] =
df['main'].str.split(',',expand=True)
What I'd like to see happen is a way to not worry about how many words are in the cells of the main column. If longest cell contains 6 words then split them out to 6 columns. If longest cell contains 8 words then split them out to 8 columns. If longest cell contains 10 words then drop last two words and split the rest out using 8 columns.
About the original file main column. I will not know how many words exist in each of the cells. I just have 8 columns so the first eight (if that many) get the honor of splitting to a column. The rest of the words (if any) will get dropped.
Question, Why do I need to have the exact amount of columns in the code above if my longest cell with words doesn't exceed that of my columns? I'm not understanding something.
Any help with the logic would be appreciated.
cols = df[['column1','column2','column3','column4','column5','column6','column7','column8']]
df2 = df['main'].str.split(',',expand=True, n=8)
#df = df.assign(**df2.set_axis(cols[:df2.shape[1]], axis=1))
#-------
if 0 in df2.columns:
df['column1']= np.where(df2[0].isnull(), df['column1'], df2[0])
You can use n=8 and then split the last column
df2 = df['main'].str.split(',', expand=True, n=8)
df = df.assign(**df2.set_axis(df.columns[:df2.shape[1]], axis=1))
df['column8'] = df['column8'].str.split(',').str[0]
You can use a list of labels instead df.columns if you don't want save the result in the first df2.shape[1] columns of df

Pandas - Read only first few lines of each rows

I have a large CSV file with about 10000 rows of text information and each row of my dataset consists of a number of lines. However I just want to read say first 20 lines of each of the rows of my CSV file.
I came across n_rows parameter in pandas read_csv method which helps us in limiting the number of rows of the dataset that we would like to load. Is there also a way where we could only read first 20 lines of data from each of the rows in pandas?
You can read in the csv with df = pd.read_csv('path\file.csv') and than just select the first 20 rows by df_new = df.head(20). Is that what you where thinking of?
If I get your question correctly, your CSV file has multiple rows, where each row has multiple lines separated by the newline character '\n'. And you want to choose the first (say, for example) 3 lines from each row.
This can be achieved as:
# Read in CSV file using pandas-
data = pd.read_csv("example.csv")
# The first two rows (toy example) of dataset are-
data.iloc[0,0]
# 'Hello, this is first line\nAnd this is the second line\nThird and final line'
data.iloc[1,0]
# 'Today is 3rd June\nThe year is 2020\nSummer weather'
# First row's first line-
data.iloc[0,0].split("\n")[0]
# 'Hello, this is first line'
# First row's first two lines-
data.iloc[0,0].split("\n")[0:2]
# ['Hello, this is first line', 'And this is the second line']
The general syntax to get the first 'n' lines from row 'x' (assuming that the first column has the string data) is:
data.iloc[x,0].split("\n")[:n]
To select the first 'm' lines (assuming there are m lines or more) from the first 'x' rows, use the code:
data.iloc[:x, 0].apply(lambda y: y.split("\n")[0:m])
Does this help?
If TiTo's answer is not what you are looking for, maybe the iloc method is. You can store the first 20 rows by doing firstRows = df.iloc[:20].
However, if you only ever need the first 20 rows, you shouldn't load the whole file into memory. As you mentioned, this can be achieved with the nrows parameter.

How to compare two dataframes and create a new one for those entries which are the same across two columns in the same row

I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra

Categories