How do I skip first rows with Pandas in Python [duplicate]

How do I skip first rows with Pandas in Python [duplicate] - python

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?

You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d

skip[1] will skip second line, not the first one.

Related

Half of dataframe csv file missing column

I have a csv file which is continuously updated with new data. The data has 5 columns, however lately the data have been changed to 4 columns. The column that is not present in later data is the first one. When I try to read this csv file into a dataframe, half of the data is under the wrong columns. The data is around 50k entries.
df
################################
0 time C-3 C-4 C-5
______________________________
0 1 str 4 5 <- old entries
0 1 str 4 5
1 str 4 5 Nan <- new entries
1 str 4 5 Nan
1 str 4 5 Nan
#################################
The first column in earlier entries (where value = 0) are not important.
My expected output is the Dataframe being printed with the right values in the right columns. I have no idea on how to add the 0 before the str where the 0 is missing. Or reverse, remove the 0. (since the 0 looks to be a index counter with value starting at 1, then 2, etc.)
Here is how i load and process the csv file at the moment:
def process_csv(obj):
data='some_data'
path='data/'
file=f'{obj}_{data}.csv'
df=pd.read_csv(path+file)
df=df.rename(columns=df.iloc[0]).drop(df.index[0])
df = df[df.time != 'time']
mask = df['time'].str.contains(' ')
df['time'] = (pd.to_datetime(df.loc[mask,'time'])
.reindex(df.index)
.fillna(pd.to_datetime(df.loc[~mask, 'time'], unit='ms')))
df=df.drop_duplicates(subset=['time'])
df=df.set_index("time")
df = df.sort_index()
return df
since column time should be of type int, a missing first column causes str values that should end up in C-3 to end up in column time which causes error:
ValueError: non convertible value str with the unit 'ms'
Question: How can I either remove the early values from column 0 or add some values to the later entries?

If one CSV file contains multiple formats, and these CSV files cannot be changed, then you could parse the files before creating data frames.
For example, the test data has both 3- and 4-field records. The function gen_read_lines() always returns 3 fields per records.
from io import StringIO
import csv
import pandas as pd
data = '''1,10,11
2,20,21
3,-1,30,31
4,-2,40,41
'''
def gen_read_lines(filename):
lines = csv.reader(StringIO(filename))
for record in lines:
if len(record) == 3: # return all fields
yield record
elif len(record) == 4: # drop field 1
yield [record[i] for i in [0, 2, 3]]
else:
raise ValueError(f'did not expect {len(records)}')
records = (record for record in gen_read_lines(data))
df = pd.DataFrame(records, columns=['A', 'B', 'C'])
print(df)
A B C
0 1 10 11
1 2 20 21
2 3 30 31
3 4 40 41

Pandas Dataframe reads value of one column wrong

The csv file:
Link to Github
This is my code:
import pandas as pd
df = pd.read_csv("log_1_2018_09_07.csv", encoding="ISO-8859-1", delimiter=';')
print(df.columns.tolist())
dates = []
times = []
outputs = []
for date in df.loc[:, "Datum"]:
dates.append(date)
print("date")
print(date)
for time in df.loc[:, " Zeit"]:
times.append(time)
print("time")
print(time)
for out in df.iloc[:, 19]:
print("output")
outputs.append(out)
print(out)
It reads the dates and times correctly, but the 19th column (column T) are all 0 and the 6th value is 990, however pandas reads it as all 0 and the 9th value as 1.
Does anybody know why it's reading the wrong values?
Thank you!!

import pandas as pd
url = 'https://raw.github.com/liamrisch/helper/master/log_1_2018_09_07.csv'
df = pd.read_csv(url, encoding="ISO-8859-1", delimiter=';')
df.iloc[:,[6,19]]
Gives:
Teil 1-8 - Abstand Rasthaken MP1-MP2 Teil 1-8
0 26,764 0
1 26,787 0
2 26,792 0
3 26,788 0
4 26,771 0
5 999,990 0
6 26,786 0
7 26,785 0
8 26,780 1
9 26,783 0
10 26,798 0

take a very close look at the data, the value is actually 1 but since 999 takes more visual space it makes the illusion of 999 being the value in that cell.
printing the df in that column (before any maniulation) shows the actual values of that column, without any surprises.

Converting text file to .xlsx in python

10 4
0 0 0 0
0 0 1 0
1 0 2 0
2 2 0 1
0 0 1 1
0 1 2 1
1 1 1 1
1 2 2 0
0 0 1 0
2 0 1 0
I want to read the above text file store this in xlsx format. But with my code whole row is stored in same cell. But I want them in different cells. Each item is space seperated.
import pandas as pd
df=pd.read_csv('input.txt')
df.to_excel('test2.xls',index=False)
Expected result: Each item stored in different cell.
Note that the first row only has 2 elements.

You need to specify the delimiter to be a space instead of a comma.
import pandas as pd
df = pd.read_csv("input.txt", header=None, delim_whitespace=True, names=['a','b','c','d'])
df.to_excel('test2.xls',index=False, header=None)
Notice that I used delim_whitespace=True to tell pandas to use spaces instead of commas. Also since you have only 2 elements in the first row I name the columns to let pandas know that you expect 4 columns.

import pandas as pd
with open('input.txt') as fin:
lines = fin.readlines()
lines[0] = "0 " + lines[0] + " 0"
with open('input2.txt', 'w') as fout:
for line in lines:
fout.write(line)
df = pd.read_csv('input2.txt', header=None, delim_whitespace=True)
df.to_excel('output.xls', index=False, header=None)
So this creates a unnecessary input2.txt and this should work assuming your first line is your first row in the text.
But what it is changing is that 10 4 becomes 0 10 4 0 so this has 4 elements in it.
I did delim_whitespace=True for read_csv() because it tells pandas not to do it in commas but in whitespaces.
At the top, I opened the file, used lines = fin.readlines() and then did lines[0] because lines is a array and the first element is arr[0]. Then I put the 0 and a space at the left and a space then a 0 at the right. And then I wrote a new text and wrote all the lines inside by using a for loop.
Finally I plug that into the read_csv() function and apply to_excel() with the parameters and one more parameter that I talked about above.
It's okay if this creates a new text file, and that the text thing is not at all pandas. This should be working and I have not checked yet so I don't know if this works or not.
Hope this helps!

How to skip the first two lines when reading in multiple files into a Pandas df [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?

You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df

All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)

Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d

skip[1] will skip second line, not the first one.

pandas read_csv parse header as string type but i want integer

for example, csv file is as below ,(1,2,3) is header!
1,2,3
0,0,0
I read csv file using pd.read_csv and print
import pandas as pd
df = pd.read_csv('./test.csv')
print(df[1])
it occur error key error:1
it seems like that read_csv parse header as string..
is there any way using integer type in dataframe column?

I think more general is cast to columns names to integer by astype:
df = pd.read_csv('./test.csv')
df.columns = df.columns.astype(int)
Another way is first get only first column and use parameter names in read_csv:
import csv
with open("file.csv", "r") as f:
reader = csv.reader(f)
i = np.array(next(reader)).astype(int)
#another way
#i = pd.read_csv("file.csv", nrows=0).columns.astype(int)
print (i)
[1 2 3]
df = pd.read_csv("file.csv", names=i, skiprows=1)
print (df.columns)
Int64Index([1, 2, 3], dtype='int64')

Skip the header column using skiprows=1 and header=None. This automatically loads in a dataframe with integer headers starting from 0 onwards.
df = pd.read_csv('test.csv', skiprows=1, header=None).rename(columns=lambda x: x + 1)
df
1 2 3
0 0 0 0
The rename call is optional, but if you want your headers to start from 1, you may keep it in.
If you have a MultiIndex, use set_levels to set just the 0th level to integer:
df.columns = df.columns.set_levels(
df.columns.get_level_values(0).astype(int), level=0
)

You can use set_axis in conjunction with a lambda and pd.Index.map
Consider a csv that looks like:
1,1,2,2
a,b,a,b
1,3,5,7
0,2,4,6
Read it like:
df = pd.read_csv('test.csv', header=[0, 1])
df
1 2
a b a b
0 1 3 5 7
1 0 2 4 6
You can pipeline the column setting with integers in the first level like:
df.set_axis(df.columns.map(lambda i: (int(i[0]), i[1])), axis=1, inplace=False)
1 2
a b a b
0 1 3 5 7
1 0 2 4 6

is there any way using integer type in dataframe column?
I find this quite elegant:
df = pd.read_csv('test.csv').rename(columns=int)
Note that int here is the built-in function int().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I skip first rows with Pandas in Python [duplicate] - python

You can try yourself: >>> import pandas as pd >>> from StringIO import StringIO >>> s = """1, 2 ... 3, 4 ... 5, 6""" >>> pd.read_csv(StringIO(s), skiprows=[1], header=None) 0 1 0 1 2 1 5 6 >>> pd.read_csv(StringIO(s), skiprows=1, header=None) 0 1 0 3 4 1 5 6

I don't have reputation to comment yet, but I want to add to alko answer for further reference. From the docs: skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows

skip[1] will skip second line, not the first one.

Related

Half of dataframe csv file missing column

Pandas Dataframe reads value of one column wrong

Converting text file to .xlsx in python

How to skip the first two lines when reading in multiple files into a Pandas df [duplicate]

pandas read_csv parse header as string type but i want integer

Categories

Resources