df =
0 20
1 19
2 18
3 17
4 16
I am iterating with a loop:
for k in df:
af = AffinityPropagation(preference=k).fit(X)
labels = af.labels_
score = silhouette_score(frechet, labels)
print("Preference: {0}, Silhouette score: {1}".format(k,score))
I get 1 number. But I need/want to get dataframe with numbers in the length of df len(df)
You need to use iterrows as #CodeDifferently points out in his comment above.
Here is an example:
Where df is:
df = pd.DataFrame({0:range(20,0,-1)})
Then using your method:
for k in df:
print(k)
Output:
0
This zero is the column header for a dataframe. You are iterating thow the dataframe column names.
Using iterrows:
for _,k in df.iterrows():
print(k.iloc[0])
Output:
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Here you are getting each row of the dataframe as series, and using iloc you are getting the first and only value in the rows for this case.
You almost never need to iterate over a DataFrame. Columns are basically NumPy arrays and have array-like 'elementwise' superpowers. (You ~never need to iterate over NumPy arrays either.)
Maybe formulate your task as a function and use the apply() method on the DataFrame or Series. This 'applies' a function to every item in a column without the need for a loop.
But if you really only have one column like this, why use a DataFrame at all? Just use a NumPy array (or get at it with the column's values attribute).
Related
May I ask you please if we can use set() to read the data in a specific column in pandas? For example, I have the following output from a DataFrame df1:
df1= [
0 -10 2 5
1 24 5 10
2 30 3 6
3 30 2 1
4 30 4 5
]
where the first column is the index.. I tried first to isolate the second column
[-10
24
30
30
30]
using the following: x = pd.DataFrame(df1, coulmn=[0]) Then, I transposed the column using the following XX = x.T Then, I used set() function.
However, instead of obtaining [-10 24 30] I got the following [0 1 2 3 4]
So set() read the index instead of reading the first column
set() takes an itterable.
using a pandas dataframe as an itterable yields the column names in turn.
Since you've transposed the dataframe, your index values are now column names, so when you use the transposed dataframe as an itterable you get those index values.
If you want to use set to get the values in the column using set() you can use:
x = pd.DataFrame(df1, colmns=[0])
set(x.iloc[:,0].values)
But if you just want the unique values in column 0 then you can use
df1[[0]].unique()
I have a dataframe df :
>>> df
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20060630 6.590 NaN 6.590 5.291
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that ?
Use DataFrame.drop and pass it a Series of index labels:
In [65]: df
Out[65]:
one two
one 1 4
two 2 3
three 3 2
four 4 1
In [66]: df.drop(index=[1,3])
Out[66]:
one two
one 1 4
three 3 2
Note that it may be important to use the "inplace" command when you want to do the drop in line.
df.drop(df.index[[1,3]], inplace=True)
Because your original question is not returning anything, this command should be used.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html
If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.
In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.
Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).
indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))
In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.
I solved this in a simpler way - just in 2 steps.
Make a dataframe with unwanted rows/data.
Use the index of this unwanted dataframe to drop the rows from the original dataframe.
Example:
Suppose you have a dataframe df which as many columns including 'Age' which is an integer. Now let's say you want to drop all the rows with 'Age' as negative number.
df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2
Hope this is much simpler and helps you.
You can also pass to DataFrame.drop the label itself (instead of Series of index labels):
In[17]: df
Out[17]:
a b c d e
one 0.456558 -2.536432 0.216279 -1.305855 -0.121635
two -1.015127 -0.445133 1.867681 2.179392 0.518801
In[18]: df.drop('one')
Out[18]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Which is equivalent to:
In[19]: df.drop(df.index[[0]])
Out[19]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
If I want to drop a row which has let's say index x, I would do the following:
df = df[df.index != x]
If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:
desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]
Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.
ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index
And now to drop those rows using their indexes
new_df = df.drop(ind_drop)
Use only the Index arg to drop row:-
df.drop(index = 2, inplace = True)
For multiple rows:-
df.drop(index=[1,3], inplace = True)
In a comment to #theodros-zelleke's answer, #j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:
dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)
where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.
Determining the index from the boolean as described above e.g.
df[df['column'].isin(values)].index
can be more memory intensive than determining the index using this method
pd.Index(np.where(df['column'].isin(values))[0])
applied like so
df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)
This method is useful when dealing with large dataframes and limited memory.
To drop rows with indices 1, 2, 4 you can use:
df[~df.index.isin([1, 2, 4])]
The tilde operator ~ negates the result of the method isin. Another option is to drop indices:
df.loc[df.index.drop([1, 2, 4])]
Look at the following dataframe df
df
column1 column2 column3
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
Lets drop all the rows which has an odd number in column1
Create a list of all the elements in column1 and keep only those elements that are even numbers (the elements that you dont want to drop)
keep_elements = [x for x in df.column1 if x%2==0]
All the rows with the values [2, 4, 6, 8, 10] in its column1 will be retained or not dropped.
df.set_index('column1',inplace = True)
df.drop(df.index.difference(keep_elements),axis=0,inplace=True)
df.reset_index(inplace=True)
We make the column1 as index and drop all the rows that are not required. Then we reset the index back.
df
column1 column2 column3
0 2 12 22
1 4 14 24
2 6 16 26
3 8 18 28
4 10 20 30
As Dennis Golomazov's answer suggests, using drop to drop rows. You can select to keep rows instead. Let's say you have a list of row indices to drop called indices_to_drop. You can convert it to a mask as follows:
mask = np.ones(len(df), bool)
mask[indices_to_drop] = False
You can use this index directly:
df_new = df.iloc[mask]
The nice thing about this method is that mask can come from any source: it can be a condition involving many columns, or something else.
The really nice thing is, you really don't need the index of the original DataFrame at all, so it doesn't matter if the index is unique or not.
The disadvantage is of course that you can't do the drop in-place with this method.
Consider an example dataframe
df =
index column1
0 00
1 10
2 20
3 30
we want to drop 2nd and 3rd index rows.
Approach 1:
df = df.drop(df.index[2,3])
or
df.drop(df.index[2,3],inplace=True)
print(df)
df =
index column1
0 00
3 30
#This approach removes the rows as we wanted but the index remains unordered
Approach 2
df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =
index column1
0 00
1 30
#This approach removes the rows as we wanted and resets the index.
This worked for me
# Create a list containing the index numbers you want to remove
index_list = list(range(42766, 42798))
df.drop(df.index[index_list], inplace =True)
df.shape
This should drop all indexes within that created range
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
A B C D
0 2 0 11 0.053095
1 2 0 11 0.059815
2 0 35 11 0.055268
3 0 35 11 0.054573
4 0 1 11 0.054081
5 0 2 11 0.054426
6 0 1 11 0.054426
7 0 1 11 0.054426
8 42 7 3 0.048208
9 42 7 3 0.050765
10 42 7 3 0.05325
....
The problem is, the data is naturally "clustered" into groups, but this data is not given. From the above, rows 0-1 are one group, rows 2-3 are a group, rows 4-7 are a group, and 8-10 are a group.
I need to impute this information. One could use machine learning; however, is it possible to do this only using pandas?
Can users groupby the values of the columns to create these groups? The problem is the values are not exact. For the third group, column B has group 1, 2, 1, 1.
A pure pandas solution would involve binning, assuming that your values are close to each other and your bin size is large enough for cluster variation but smaller than distance between cluster values. That answer depends on your data.
The binning approach uses the cut function in pandas. You provide a series (or array) and the number of bins you want to the function. The function evenly subdivides the range of your series into the given number of bins and determines where each value in the input falls. The output for the below set of columns will be which bin the value fell in and will be what you can group by, following your original train of thought.
The way this would come out in practice for bins of size ~5 is
for col in df.columns:
binned_name = col + '_binned'
num_bins = np.ceil(df[col].max()/5)
df[binned_name] = pd.cut(df[col],num_bins,labels=False)
Is there a more elegant way to achieve this? my current solution based on various stackoverflow answers is as following
df = pds.DataFrame([[11,12,13,14],[15,16,17,18]], columns = [0,1,2,3])
print df
dT = df.T
dT.reindex(dT.index[::-1]).cumsum().reindex(dT.index).T
Output
df is:
0 1 2 3
0 11 12 13 14
1 15 16 17 18
after by row reverse cumsum
0 1 2 3
0 50 39 27 14
1 66 51 35 18
I have to perform this often on my data (much bigger size also), and try to find out a short/better way to do achieve this.
Thanks
Here is a slightly more readable alternative:
df[df.columns[::-1]].cumsum(axis=1)[df.columns]
There is no need to transpose your DataFrame; just use the axis=1 argument to cumsum.
Obviously the easiest thing would be to just store your DataFrame columns in the opposite order, but I assume there is some reason why you're not doing that.
I have a dataframe df :
>>> df
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20060630 6.590 NaN 6.590 5.291
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that ?
Use DataFrame.drop and pass it a Series of index labels:
In [65]: df
Out[65]:
one two
one 1 4
two 2 3
three 3 2
four 4 1
In [66]: df.drop(index=[1,3])
Out[66]:
one two
one 1 4
three 3 2
Note that it may be important to use the "inplace" command when you want to do the drop in line.
df.drop(df.index[[1,3]], inplace=True)
Because your original question is not returning anything, this command should be used.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html
If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.
In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.
Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).
indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))
In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.
I solved this in a simpler way - just in 2 steps.
Make a dataframe with unwanted rows/data.
Use the index of this unwanted dataframe to drop the rows from the original dataframe.
Example:
Suppose you have a dataframe df which as many columns including 'Age' which is an integer. Now let's say you want to drop all the rows with 'Age' as negative number.
df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2
Hope this is much simpler and helps you.
You can also pass to DataFrame.drop the label itself (instead of Series of index labels):
In[17]: df
Out[17]:
a b c d e
one 0.456558 -2.536432 0.216279 -1.305855 -0.121635
two -1.015127 -0.445133 1.867681 2.179392 0.518801
In[18]: df.drop('one')
Out[18]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Which is equivalent to:
In[19]: df.drop(df.index[[0]])
Out[19]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
If I want to drop a row which has let's say index x, I would do the following:
df = df[df.index != x]
If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:
desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]
Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.
ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index
And now to drop those rows using their indexes
new_df = df.drop(ind_drop)
Use only the Index arg to drop row:-
df.drop(index = 2, inplace = True)
For multiple rows:-
df.drop(index=[1,3], inplace = True)
In a comment to #theodros-zelleke's answer, #j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:
dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)
where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.
Determining the index from the boolean as described above e.g.
df[df['column'].isin(values)].index
can be more memory intensive than determining the index using this method
pd.Index(np.where(df['column'].isin(values))[0])
applied like so
df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)
This method is useful when dealing with large dataframes and limited memory.
To drop rows with indices 1, 2, 4 you can use:
df[~df.index.isin([1, 2, 4])]
The tilde operator ~ negates the result of the method isin. Another option is to drop indices:
df.loc[df.index.drop([1, 2, 4])]
Look at the following dataframe df
df
column1 column2 column3
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
Lets drop all the rows which has an odd number in column1
Create a list of all the elements in column1 and keep only those elements that are even numbers (the elements that you dont want to drop)
keep_elements = [x for x in df.column1 if x%2==0]
All the rows with the values [2, 4, 6, 8, 10] in its column1 will be retained or not dropped.
df.set_index('column1',inplace = True)
df.drop(df.index.difference(keep_elements),axis=0,inplace=True)
df.reset_index(inplace=True)
We make the column1 as index and drop all the rows that are not required. Then we reset the index back.
df
column1 column2 column3
0 2 12 22
1 4 14 24
2 6 16 26
3 8 18 28
4 10 20 30
As Dennis Golomazov's answer suggests, using drop to drop rows. You can select to keep rows instead. Let's say you have a list of row indices to drop called indices_to_drop. You can convert it to a mask as follows:
mask = np.ones(len(df), bool)
mask[indices_to_drop] = False
You can use this index directly:
df_new = df.iloc[mask]
The nice thing about this method is that mask can come from any source: it can be a condition involving many columns, or something else.
The really nice thing is, you really don't need the index of the original DataFrame at all, so it doesn't matter if the index is unique or not.
The disadvantage is of course that you can't do the drop in-place with this method.
Consider an example dataframe
df =
index column1
0 00
1 10
2 20
3 30
we want to drop 2nd and 3rd index rows.
Approach 1:
df = df.drop(df.index[2,3])
or
df.drop(df.index[2,3],inplace=True)
print(df)
df =
index column1
0 00
3 30
#This approach removes the rows as we wanted but the index remains unordered
Approach 2
df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =
index column1
0 00
1 30
#This approach removes the rows as we wanted and resets the index.
This worked for me
# Create a list containing the index numbers you want to remove
index_list = list(range(42766, 42798))
df.drop(df.index[index_list], inplace =True)
df.shape
This should drop all indexes within that created range