pandas beginner here,
I read that pandas.read_csv automatically assumes that the first column is a header column, and if this is not the case, I should pass a flag, header=None.
Now I have a code which loads CSVs which sometimes have headers and sometimes not... Is there a way or a flag to read_csv to try and automatically detect a header row?
If a column (or several) has numbers in all rows except the first - then it's a header row, otherwise no headers.
Ok, so quick (and probably fragile) idea:
import pandas as pd
df = pd.DataFrame(columns=["ints_only", "strings_only"],
data=[[1,"a"], [3,"b"]])
df.to_csv("header.csv")
df.to_csv("noheader.csv", header=None)
def has_header(file, nrows=20):
df = pd.read_csv(file, header=None, nrows=nrows)
df_header = pd.read_csv(file, nrows=nrows)
return tuple(df.dtypes) != tuple(df_header.dtypes)
has_header("header.csv") # gives True
has_header("noheader.csv") # gives False
What's happening here?
We read the first nrows (default 20) lines of the csv file. One time with header and one time without. Then we look at what datatypes pandas assigns to each column. If the datatypes don't change when ignoring the first row, then there is no header (that of course only works if you always at least one column where the header is a string, but all other entries are of one other datatype that is not a string, e.g. all floats).
When the dataframe has no header its Dataframe.columns property employs numerical indexes. Otherwise, it uses strings. So, just check the type of the first column label.
import pandas as pd
import io
def has_header(df):
return isinstance(df.columns[0], str)
csv=u"""col1,col2,col3
5,2,7
4,9,6
7,3,1"""
df1 = pd.read_csv(io.StringIO(csv))
print(df1.head())
if has_header(df1):
print("Dataframe 1 has header")
else:
print("Dataframe 1 doesn't have header")
csv=u"""5,2,7
4,9,6
7,3,1"""
df2 = pd.read_csv(io.StringIO(csv), header=None)
print(df2.head())
if has_header(df2):
print("Dataframe 2 has header")
else:
print("Dataframe 2 doesn't have header")
df3= pd.read_csv(io.StringIO(csv))
print(df3.head())
if has_header(df3):
print("Dataframe 3 has header")
else:
print("Dataframe 3 doesn't have header")
df4 = pd.read_csv(io.StringIO(csv), header='infer')
print(df4.head())
if has_header(df4):
print("Dataframe 4 has header")
else:
print("Dataframe 4 doesn't have header")
Here is the output produced by the above code.
col1 col2 col3
0 5 2 7
1 4 9 6
2 7 3 1
Dataframe 1 has header
0 1 2
0 5 2 7
1 4 9 6
2 7 3 1
Dataframe 2 doesn't have header
5 2 7
0 4 9 6
1 7 3 1
Dataframe 3 has header
5 2 7
0 4 9 6
1 7 3 1
Dataframe 4 has header
Please note that when using pd.read_csv to create your Dataframe you have to explicitly set header=None. Otherwise, the column names are inferred from the first line of the file (see pasntas.read_csv).
You may use
str and contains
df['column_name'].str.contains('text_you_are_expecting_in_header')
This would return a True/False based on whether the column entries contain what you are looking for.
Thereafter, you may read off the first entry (for your header row), and if it matches the text you expect in your header, then you have a header, else you don't have a header.
Related
So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.
I am doing this:
df.drop(['id','Unnamed: 0'],axis=1,inplace=True)
My syntax is correct but it is not working.
The "Unnamed: 0" column usually happens when you read a file which contains the row's indexes, however, the corresponding column name is empty. Suppose you have the following ".csv" file:
,id,A,B
0,1,2,3
1,4,5,6
2,7,8,9
Note that the above file starts the first line with a comma, which will generate the undesired "Unnamed: 0" column. To read from the file and build your DataFrame without this issue, try setting index_col=0:
import pandas as pd
df = pd.read_csv('my_file.csv', index_col=0)
print(df)
The result will be:
id A B
0 1 2 3
1 4 5 6
2 7 8 9
now let's drop the 'id' column:
df.drop(['id'], axis=1, inplace=True)
print(df)
and the result is:
B C
0 2 3
1 5 6
2 8 9
I have a dataframe converted from tab seperated text file. But the first label is an extra unnecessary label.
a b c
0 1 2 NaN
1 2 3 NaN
The label a is an extra one. The dataframe should be:
b c
0 1 2
1 2 3
How to remove a? Thanks in advance.
You can omit first header row by skiprows parameter and then add parameter names for new columns - is necessary same length of names and length of another rows of data:
df = pd.read_csv(file, skiprows=1, names=['b','c'])
print (df)
b c
0 1 2
1 2 3
Or more dynamic is get only first row by nrows=0 for columns and then pass to parameter names with remove first value by indexing:
names = pd.read_csv(file, nrows=0).columns
df = pd.read_csv(file, skiprows=1, names=names[1:])
Another idea is default columns - RangeIndex:
df = pd.read_csv(file, skiprows=1, header=None)
print (df)
0 1
0 1 2
1 2 3
Say I have the following Excel file:
A B C
0 - - -
1 Start - -
2 3 2 4
3 7 8 4
4 11 2 17
I want to read the file in a dataframe making sure that I start to read it below the row where the Start value is.
Attention: the Start value is not always located in the same row, so if I were to use:
import pandas as pd
xls = pd.ExcelFile('C:\Users\MyFolder\MyFile.xlsx')
df = xls.parse('Sheet1', skiprows=4, index_col=None)
this would fail as skiprows needs to be fixed. Is there any workaround to make sure that xls.parse finds the string value instead of the row number?
df = pd.read_excel('your/path/filename')
This answer helps in finding the location of 'start' in the df
for row in range(df.shape[0]):
for col in range(df.shape[1]):
if df.iat[row,col] == 'start':
row_start = row
break
after having row_start you can use subframe of pandas
df_required = df.loc[row_start:]
And if you don't need the row containing 'start', just u increment row_start by 1
df_required = df.loc[row_start+1:]
If you know the specific rows you are interested in, you can skip from the top using skiprow and then parse only the row (or rows) you want using nrows - see pandas.read_excel
df = pd.read_excel('myfile.xlsx', 'Sheet1', skiprows=2, nrows=3,)
You could use pd.read_excel('C:\Users\MyFolder\MyFile.xlsx', sheet_name='Sheet1') as it ignores empty excel cells.
Your DataFrame should then look like this:
A B C
0 Start NaN NaN
1 3 2 4
2 7 8 4
3 11 2 17
Then drop the first row by using
df.drop([0])
to get
A B C
0 3 2 4
1 7 8 4
2 11 2 17
I need to output only a particular row from a pandas dataframe to a CSV file. In other words, the output needs to have only the data in row X, in a single line separated by commas, and nothing else. The problem I am running into with to_CSV is that I cannot find a way to do just the data; I am always receiving an extra line with a column count.
data.to_csv(filename, index=False)
gives
0,1,2,3,4,5
X,Y,Z,A,B,C
The first line is just a column count and is part of the dataframe, not the data. I need just the data. Is there any way to do this simply, or do I need to break out of pandas and manipulate the data further in python?
Note: the preceding example has only 1 row of data, but it would be nice to have the syntax for choosing row too.
You can try this:
df = pd.DataFrame({'A': ['a','b','c','d','e','f'], 'B': [1,2,3,4,5,6]})
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
You can select the row you want, in this case, I select the row at index 1:
df.iloc[1:2].to_csv('test.csv', index=False, header=False)
The output to the csv file looks like this (makes sure you use header=False):
b 2
You can use this
data.to_csv(filename, index=False, header=False)
the header means:
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
you can find more specific info in pandas.DataFrame.to_csv
it seems like you are looking for filtering data from the existing dataframe and write it into .csv file.
for that you need to filter your data . then apply to_csv command.
here is the command
df[df.index.isin([3,4])]
if this is your data
>>> df
A B
0 X 1
1 Y 2
2 Z 3
3 A 4
4 B 5
5 C 6
then this would be your expected filtered content. then you can apply to_csv on top of it.
>>> df[df.index.isin([3,4])]
A B
3 A 4
4 B 5