I'm creating a Pandas DataFrame to store data. Unfortunately, I can't know the number of rows of data that I'll have ahead of time. So my approach has been the following.
First, I declare an empty DataFrame.
df = DataFrame(columns=['col1', 'col2'])
Then, I append a row of missing values.
df = df.append([None] * 2, ignore_index=True)
Finally, I can insert values into this DataFrame one cell at a time. (Why I have to do this one cell at a time is a long story.)
df['col1'][0] = 3.28
This approach works perfectly fine, with the exception that the append statement inserts an additional column to my DataFrame. At the end of the process the output I see when I type df looks like this (with 100 rows of data).
<class 'pandas.core.frame.DataFrame'>
Data columns (total 2 columns):
0 0 non-null values
col1 100 non-null values
col2 100 non-null values
df.head() looks like this.
0 col1 col2
0 None 3.28 1
1 None 1 0
2 None 1 0
3 None 1 0
4 None 1 1
Any thoughts on what is causing this 0 column to appear in my DataFrame?
The append is trying to append a column to your dataframe. The column it is trying to append is not named and has two None/Nan elements in it which pandas will name (by default) as column named 0.
In order to do this successfully, the column names coming into the append for the data frame must be consistent with the current data frame column names or else new columns will be created (by default)
#you need to explicitly name the columns of the incoming parameter in the append statement
df = DataFrame(columns=['col1', 'col2'])
print df.append(Series([None]*2, index=['col1','col2']), ignore_index=True)
#as an aside
df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
dfRowImproper = [1,2,3,4]
#dfRowProper = DataFrame(arange(4)+1,columns=['A','B','C','D']) #will not work!!! because arange returns a vector, whereas DataFrame expect a matrix/array#
dfRowProper = DataFrame([arange(4)+1],columns=['A','B','C','D']) #will work
print df.append(dfRowImproper) #will make the 0 named column with 4 additional rows defined on this column
print df.append(dfRowProper) #will work as you would like as the column names are consistent
print df.append(DataFrame(np.random.randn(1,4))) #will define four additional columns to the df with 4 additional rows
print df.append(Series(dfRow,index=['A','B','C','D']), ignore_index=True) #works as you want
You could use a Series for row insertion:
df = pd.DataFrame(columns=['col1', 'col2'])
df = df.append(pd.Series([None]*2), ignore_index=True)
df["col1"][0] = 3.28
df looks like:
col1 col2
0 3.28 NaN
Related
Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 - df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here's one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address #Kay's comments)
The error you're encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
Has the title say, I would like to find a way to drop the row (erase it) in a data frame from a column to the end of the data frame but I don't find any way to do so.
I would like to start with
A B C
-----------
1 1 1
1 1 1
1 1 1
and get
A B C
-----------
1
1
1
I was trying with
df.drop(df.loc[:, 'B':].columns, axis = 1, inplace = True)
But this delete the column itself too
A
-
1
1
1
am I missing something?
If you only know the column name that you want to keep:
import pandas as pd
new_df = pd.DataFrame(df["A"])
If you only know the column names that you want to drop:
new_df = df.drop(["B", "C"], axis=1)
For your case, to keep the columns, but remove the content, one possible way is:
new_df = pd.DataFrame(df["A"], columns=df.columns)
Resulting df contains columns "A" and "B" but without values (NaN instead)
I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.
Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do.
Here is the loading code (to save time in the checking, I set the nrows=10):
import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)
To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):
In [20]: fec
Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)
And here is the book's output (again with data columns excluded):
In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)
The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.
I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.
Quick Answer
Use index_col=False instead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.
More Detail
After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):
index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.
from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.
EDIT 10/20/2014 - More information
I found another valuable entry that is specifically about trailing limiters and how to simply ignore them:
If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ...
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...
Re: craigts's response, for anyone having trouble with using either False or None parameters for index_col, such as in cases where you're trying to get rid of a range index, you can instead use an integer to specify the column you want to use as the index. For example:
df = pd.read_csv('file.csv', index_col=0)
The above will set the first column as the index (and not add a range index in my "common case").
Update
Given the popularity of this answer, I thought i'd add some context/ a demo:
# Setting up the dummy data
In [1]: df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
In [2]: df
Out[2]:
A B
0 1 4
1 2 5
2 3 6
In [3]: df.to_csv('file.csv', index=None)
File[3]:
A B
1 4
2 5
3 6
Reading without index_col or with None/False will all result in a range index:
In [4]: pd.read_csv('file.csv')
Out[4]:
A B
0 1 4
1 2 5
2 3 6
# Note that this is the default behavior, so the same as In [4]
In [5]: pd.read_csv('file.csv', index_col=None)
Out[5]:
A B
0 1 4
1 2 5
2 3 6
In [6]: pd.read_csv('file.csv', index_col=False)
Out[6]:
A B
0 1 4
1 2 5
2 3 6
However, if we specify that "A" (the 0th column) is actually the index, we can avoid the range index:
In [7]: pd.read_csv('file.csv', index_col=0)
Out[7]:
B
A
1 4
2 5
3 6
If pandas is treating your first row as a header, you can use header = none as such:
df = pd.read_csv ("csv-file.csv", header=None)
this way pandas will treat your first row as like any row.
I'd like to change zero values in a dataframe for the value found in the last column for each row. I can solve this using a for in the columns or the rows, but it didnt seem too pythonic to me.
In short, I have a dataframe like this:
col1 col2 col3 nonzero
1 2 0 10
1 0 3 20
and I'd like to do an operation like
df[df==0] = df.nonzero
so I'd get
col1 col2 col3 nonzero
1 2 10 10
1 20 3 20
This however does not work, as [df==0] is a DataFrame itself with True/False values. How can this be done?
One option is to use apply method, loop through rows of the data frame and replace zeros with the last element of the row:
df.apply(lambda row: row.where(row != 0, row.iat[-1]), axis=1)
You can also modify the data frame in place:
df[df == 0] = (df == 0).mul(df.nonzero, axis=0)
Which yields df as the same result above. In this method, (df == 0).mul(df.nonzero, axis=0) creates a data frame with zeros entries replaced by the values in the nonzero column and other entries zero; Combined with boolean indexing and assignment, you can conditionally modify the zero entries in the original data frame:
(df == 0).mul(df.nonzero, axis=0)
I have a pandas dataframe that looks like this:
I would like to iterate through column 3 and if an element exists, add a new row to the dataframe, using the value in column 3 as the new value in column 2, while also using the data in columns 0 and 1 from the row where it was found as the values for columns 0 and 1 in the newly added row:
Here, row 2 is the newly added row. The values in columns 0 and 1 in this row come from the row where "D" was found, and now column 2 of the new row contains the value from column 3 in the first row, "D".
Here is one way to do it, but surely there must be a more general solution, especially if I wish to scan more than a single column:
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
for tu in a.itertuples(index=False): # Iterate by row
if tu[3]: # If exists
b = b.append([[tu[0],tu[1],tu[3]]], ignore_index=True) # Append with new row using correct tuple elements.
You can do this without any loops by creating a new df with the columns you want and appending it to the original.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
ndf = df[pd.notnull(df[3])][[0,1,3]]
ndf.columns = [0,1,2]
df = df.append(ndf, ignore_index=True)
This will leave NaN for the new missing values which you can change then change to None.
df[3] = df[3].where((pd.notnull(df[3])), None)
prints
0 1 2 3
0 A B C D
1 1 2 C None
2 A B D None
This may be a bit more general (assuming your columns are integers and that you are always looking to fill the previous columns in this pattern)
import pandas as pd
def append_rows(scan_row,scanned_dataframe):
new_df = pd.DataFrame()
for i,row in scanned_dataframe.iterrows():
if row[scan_row]:
new_row = [row[i] for i in range(scan_row -1)]
new_row.append(row[scan_row])
print new_row
new_df = new_df.append([new_row],ignore_index=True)
return new_df
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
b = b.append(append_rows(3,a))