Pandas CSV output only the data in a certain row (to_csv) - python

I need to output only a particular row from a pandas dataframe to a CSV file. In other words, the output needs to have only the data in row X, in a single line separated by commas, and nothing else. The problem I am running into with to_CSV is that I cannot find a way to do just the data; I am always receiving an extra line with a column count.
data.to_csv(filename, index=False)
gives
0,1,2,3,4,5
X,Y,Z,A,B,C
The first line is just a column count and is part of the dataframe, not the data. I need just the data. Is there any way to do this simply, or do I need to break out of pandas and manipulate the data further in python?
Note: the preceding example has only 1 row of data, but it would be nice to have the syntax for choosing row too.

You can try this:
df = pd.DataFrame({'A': ['a','b','c','d','e','f'], 'B': [1,2,3,4,5,6]})
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
You can select the row you want, in this case, I select the row at index 1:
df.iloc[1:2].to_csv('test.csv', index=False, header=False)
The output to the csv file looks like this (makes sure you use header=False):
b 2

You can use this
data.to_csv(filename, index=False, header=False)
the header means:
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
you can find more specific info in pandas.DataFrame.to_csv

it seems like you are looking for filtering data from the existing dataframe and write it into .csv file.
for that you need to filter your data . then apply to_csv command.
here is the command
df[df.index.isin([3,4])]
if this is your data
>>> df
A B
0 X 1
1 Y 2
2 Z 3
3 A 4
4 B 5
5 C 6
then this would be your expected filtered content. then you can apply to_csv on top of it.
>>> df[df.index.isin([3,4])]
A B
3 A 4
4 B 5

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Is there a way to create an index column for a pandas df that is not one of my data columns? [duplicate]

I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.
Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do.
Here is the loading code (to save time in the checking, I set the nrows=10):
import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)
To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):
In [20]: fec
Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)
And here is the book's output (again with data columns excluded):
In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)
The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.
I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.
Quick Answer
Use index_col=False instead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.
More Detail
After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):
index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.
from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.
EDIT 10/20/2014 - More information
I found another valuable entry that is specifically about trailing limiters and how to simply ignore them:
If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ...
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...
Re: craigts's response, for anyone having trouble with using either False or None parameters for index_col, such as in cases where you're trying to get rid of a range index, you can instead use an integer to specify the column you want to use as the index. For example:
df = pd.read_csv('file.csv', index_col=0)
The above will set the first column as the index (and not add a range index in my "common case").
Update
Given the popularity of this answer, I thought i'd add some context/ a demo:
# Setting up the dummy data
In [1]: df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
In [2]: df
Out[2]:
A B
0 1 4
1 2 5
2 3 6
In [3]: df.to_csv('file.csv', index=None)
File[3]:
A B
1 4
2 5
3 6
Reading without index_col or with None/False will all result in a range index:
In [4]: pd.read_csv('file.csv')
Out[4]:
A B
0 1 4
1 2 5
2 3 6
# Note that this is the default behavior, so the same as In [4]
In [5]: pd.read_csv('file.csv', index_col=None)
Out[5]:
A B
0 1 4
1 2 5
2 3 6
In [6]: pd.read_csv('file.csv', index_col=False)
Out[6]:
A B
0 1 4
1 2 5
2 3 6
However, if we specify that "A" (the 0th column) is actually the index, we can avoid the range index:
In [7]: pd.read_csv('file.csv', index_col=0)
Out[7]:
B
A
1 4
2 5
3 6
If pandas is treating your first row as a header, you can use header = none as such:
df = pd.read_csv ("csv-file.csv", header=None)
this way pandas will treat your first row as like any row.

Can pandas auto recognize if header is present

pandas beginner here,
I read that pandas.read_csv automatically assumes that the first column is a header column, and if this is not the case, I should pass a flag, header=None.
Now I have a code which loads CSVs which sometimes have headers and sometimes not... Is there a way or a flag to read_csv to try and automatically detect a header row?
If a column (or several) has numbers in all rows except the first - then it's a header row, otherwise no headers.
Ok, so quick (and probably fragile) idea:
import pandas as pd
df = pd.DataFrame(columns=["ints_only", "strings_only"],
data=[[1,"a"], [3,"b"]])
df.to_csv("header.csv")
df.to_csv("noheader.csv", header=None)
def has_header(file, nrows=20):
df = pd.read_csv(file, header=None, nrows=nrows)
df_header = pd.read_csv(file, nrows=nrows)
return tuple(df.dtypes) != tuple(df_header.dtypes)
has_header("header.csv") # gives True
has_header("noheader.csv") # gives False
What's happening here?
We read the first nrows (default 20) lines of the csv file. One time with header and one time without. Then we look at what datatypes pandas assigns to each column. If the datatypes don't change when ignoring the first row, then there is no header (that of course only works if you always at least one column where the header is a string, but all other entries are of one other datatype that is not a string, e.g. all floats).
When the dataframe has no header its Dataframe.columns property employs numerical indexes. Otherwise, it uses strings. So, just check the type of the first column label.
import pandas as pd
import io
def has_header(df):
return isinstance(df.columns[0], str)
csv=u"""col1,col2,col3
5,2,7
4,9,6
7,3,1"""
df1 = pd.read_csv(io.StringIO(csv))
print(df1.head())
if has_header(df1):
print("Dataframe 1 has header")
else:
print("Dataframe 1 doesn't have header")
csv=u"""5,2,7
4,9,6
7,3,1"""
df2 = pd.read_csv(io.StringIO(csv), header=None)
print(df2.head())
if has_header(df2):
print("Dataframe 2 has header")
else:
print("Dataframe 2 doesn't have header")
df3= pd.read_csv(io.StringIO(csv))
print(df3.head())
if has_header(df3):
print("Dataframe 3 has header")
else:
print("Dataframe 3 doesn't have header")
df4 = pd.read_csv(io.StringIO(csv), header='infer')
print(df4.head())
if has_header(df4):
print("Dataframe 4 has header")
else:
print("Dataframe 4 doesn't have header")
Here is the output produced by the above code.
col1 col2 col3
0 5 2 7
1 4 9 6
2 7 3 1
Dataframe 1 has header
0 1 2
0 5 2 7
1 4 9 6
2 7 3 1
Dataframe 2 doesn't have header
5 2 7
0 4 9 6
1 7 3 1
Dataframe 3 has header
5 2 7
0 4 9 6
1 7 3 1
Dataframe 4 has header
Please note that when using pd.read_csv to create your Dataframe you have to explicitly set header=None. Otherwise, the column names are inferred from the first line of the file (see pasntas.read_csv).
You may use
str and contains
df['column_name'].str.contains('text_you_are_expecting_in_header')
This would return a True/False based on whether the column entries contain what you are looking for.
Thereafter, you may read off the first entry (for your header row), and if it matches the text you expect in your header, then you have a header, else you don't have a header.

Python Pandas Dynamically Read Excel Sheet with Multiple Header Rows of Different Column Size

I have an excel sheet that I am trying to read into as a dataframe. The sheet has multiple header rows that each can have varying amount of columns. Some of the columns are similar, but no always. Is there a way I can split the rows into separate dataframes?
The data for example would be:
A B C D
1 1 1 1
2 2 2 2
A B C D E
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
A B C
1 1 1
The ideal output would be three separate dataframes which their respective rows and column headers.
.read_excel has header, skiprows and skipfooter arguments that let you do this, provided that you can detect or know ahead of time what row each header is on. With these and usecols you can define any "window" on the sheet as your df. Combining multiple windows can then be accomplished with concat, merge, append, and join, per usual.

Pandas automatically converts row to column

I have a very simple dataframe like so:
In [8]: df
Out[8]:
A B C
0 2 a a
1 3 s 3
2 4 c !
3 1 f 1
My goal is to extract the first row in such a way that looks like this:
A B C
0 2 a a
As you can see the dataframe shape (1x3) is preserved and the first row still has 3 columns.
However when I type the following command df.loc[0] the output result is this:
df.loc[0]
Out[9]:
A 2
B a
C a
Name: 0, dtype: object
As you can see the row has turned into a column with 3 rows! (3x1 instead of 3x1). How is this possible? how can I simply extract the row and preserve its shape as described in my goal? Could you provide a smart and elegant way to do it?
I tried to use the transpose command .T but without success... I know I could create another dataframe where the columns are extracted by the original dataframe but this way quite tedious and not elegant I would say (pd.DataFrame({'A':[2], 'B':'a', 'C':'a'})).
Here is the dataframe if you need it:
import pandas as pd
df = pd.DataFrame({'A':[2,3,4,1], 'B':['a','s','c','f'], 'C':['a', 3, '!', 1]})
You need add [] for DataFrame:
#select by index value
print (df.loc[[0]])
A B C
0 2 a a
Or:
print (df.iloc[[0]])
A B C
0 2 a a
If need transpose Series, first need convert it to DataFrame by to_frame:
print (df.loc[0].to_frame())
0
A 2
B a
C a
print (df.loc[0].to_frame().T)
A B C
0 2 a a
Use a range selector will preserve the Dataframe format.
df.iloc[0:1]
Out[221]:
A B C
0 2 a a

Categories