I am using the following to select specific columns from the dataframe comb, which I would like to bring into a new dataframe. The individual selects work fine EG: comb.ix[:,0:1], but when I attempt to combine them using the + I get a bad result the 1st selection ([:,0:1]) getting stuck on the end of the dataframe and the values contained in original col 1 are wiped out while appearing at the end of the row. What is the right way to get just the columns I want? (I'd include sample data but as you may see, too many columns...which is why I'm trying to do it this way)
comb.ix[:,0:1]+comb.ix[:,17:342]
If you want to concatenate a sub selection of your df columns then use pd.concat:
pd.concat([comb.ix[:,0:1],comb.ix[:,17:342]], axis=1)
So long as the indices match then this will align correctly.
Thanks to #iHightower that you can also sub-select by passing the labels:
pd.concat([df.ix[:,'Col1':'Col5'],df.ix[:,'Col9':'Col15']],axis=1)
Note that .ix will be deprecated in a future version the following should work:
In [115]:
df = pd.DataFrame(columns=['col' + str(x) for x in range(10)])
df
Out[115]:
Empty DataFrame
Columns: [col0, col1, col2, col3, col4, col5, col6, col7, col8, col9]
Index: []
In [118]:
pd.concat([df.loc[:, 'col2':'col4'], df.loc[:, 'col7':'col8']], axis=1)
Out[118]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
Or using iloc:
In [127]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')]], axis=1)
Out[127]:
Empty DataFrame
Columns: [col2, col3, col7]
Index: []
Note that iloc slicing is open/closed so the end range is not included so you'd have to find the column after the column of interest if you want to include it:
In [128]:
pd.concat([df.iloc[:, df.columns.get_loc('col2'):df.columns.get_loc('col4')+1], df.iloc[:, df.columns.get_loc('col7'):df.columns.get_loc('col8')+1]], axis=1)
Out[128]:
Empty DataFrame
Columns: [col2, col3, col4, col7, col8]
Index: []
NumPy has a nice module named r_, allowing you to solve it with the modern DataFrame selection interface, iloc:
df.iloc[:, np.r_[0:1, 17:342]]
I believe this is a more elegant solution.
It even support more complex selections:
df.iloc[:, np.r_[0:1, 5, 16, 17:342:2, -5:]]
I recently solved it by just appending ranges
r1 = pd.Series(range(5))
r2 = pd.Series([10,15,20])
final_range = r1.append(r2)
df.iloc[:,final_range]
Then you will get columns from 0:5 and 10, 15, 20.
Related
I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
I want to find duplicate columns from a list, so not just any columns.
example of correct csv looks like this:
col1, col2, col3, col4, custom, custom
1,2,3,4,test,test
4,3,2,1,test,test
list looks like this:
columnNames = ['col1', 'col2', 'col3', 'col4']
So when I run something like df.columns.duplicated() I don't want to it detect the duplicate 'custom' fields, only if there is more than one 'col1' column, or more than one 'col2' column, etc, and return True when one of those columns is found to be duplicated.
I found when including a duplicate 'colN' column name, col4 in the example, and I print it out, it shows me that index(['col1', 'col2', 'col3', 'col4', 'col4.1'], dtype='object')
No idea how to write that line of code.
Use Index.isin + Index.duplicated to create a boolean mask:
c = df.columns.str.rsplit('.', n=1).str[0]
mask = c.isin(columnNames) & c.duplicated()
If want to find duplicated column names use boolean indexing with this mask:
dupe_cols = df.columns[mask]
When u will read this csv file using pandas u will not get any of the two columns with same name. As I know, second custom column name will get replaced by custom.1, so u can get idea from that how many duplicates are there
Here is one more way to do it using a list comprehension:
import pandas as pd
df = pd.DataFrame([[1,1,2,3,4,"test","test"],[4,4,3,2,1,"test","test"]],
columns = ["col1", "col1.1", "col2", "col3", "col4", "custom", "custom"])
print(df)
Out[1]:
col1 col1.1 col2 col3 col4 custom custom
0 1 1 2 3 4 test test
1 4 4 3 2 1 test test
columnNames = ['col1', 'col2', 'col3', 'col4']
splitColumns = pd.Index([i.split('.')[0] for i in df.columns])
[False if col not in columnNames else dup for col, dup in zip(splitColumns, splitColumns.duplicated())]
Out[2]: [False, True, False, False, False, False, False]
I am using this code
searchfor = ["s", 'John']
df = df[~df.iloc[1].astype(str).str.contains('|'.join(searchfor),na=False)]
This returns the error
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
However this works fine if run as a column search
df = df[~df.iloc[;,1].astype(str).str.contains('|'.join(searchfor),na=False)]
I am trying to remove a row based on if the row contains a certain phrase
To drop rows
Create a mask which returns True or False depending whether that cell contains your strings
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
Then use filter .any to check for at least one True per row with boolean indexing and take only the rows where no True was found.
df_filtered = df[~mask.any(axis=1)]
To drop columns
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
axis=0 instead of 1 to check for each column:
columns_analysis = mask.any(axis=0)
get the indexes when True to drop
columns_to_drop = columns_analysis[columns_analysis == True].index.tolist()
df_filtered = data.drop(columns_to_drop, axis=1)
This is related to the way you are splitting your data.
In the first statement, you are asking python to split your dataframe and give you the second (index 1 is second if you want first change index to 0) row, while in the second case, you are asking for the second column and in your dataframe these have different lengths (my mistake, is shapes). See this example:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3':[23,23]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 col3
1 3 23
2 4 23
First row:
df.iloc[0]
col1 1
col2 3
col3 23
Name: 0, dtype: int64
First column:
df.iloc[:,]
1
2
Name: col2, dtype: int64
Try this and if you like the answer vote...
Good luck.
I found a bunch of answers about how to drop columns using index numbers.
I am stumped about how to drop all columns after a particular column by name.
df = pd.DataFrame(columns=['A','B','C','D'])
df.drop(['B':],inplace=True)
I expect the new df to have only A B columns.
Dropping all columns after is the same as keeping all columns up to and including. So:
In [321]: df = pd.DataFrame(columns=['A','B','C','D'])
In [322]: df = df.loc[:, :'B']
In [323]: df
Out[323]:
Empty DataFrame
Columns: [A, B]
Index: []
(Using inplace is typically not worth it.)
get_loc and iloc
Dropping some is the same as selecting the others.
df.iloc[:, :df.columns.get_loc('B') + 1]
Empty DataFrame
Columns: [A, B]
Index: []
df.drop(df.columns[list(df.columns).index("B")+1:],inplace=True)
df = df[df.columns[:list(df.columns).index('B')+1]]
should work.
I would like to add new records with new indices to a pandas dataframe
for example:
df = pandas.DataFrame(columns = ['COL1', 'COL2'])
Now I have a new record, with index label 'Test1', and values [20, 30]
i would like to do something like (pseudo code):
df.append(index='Test1', [20, 30])
so my result would be
COL1 COL2
Test1 20 30
The furthest i've reached was:
df = df.append({'COL1':20, 'COL2':30}, ignore_index=True)
but this solution does not includes the new index
Thanks!
Please note that, as per here, Series are size-immutable (i.e. appending an entry to a Series will copy the original series and create a new object). This means that appending rows to a DataFrame will keep making unnecessary copies of the entire DataFrame. Highly recommend building a list with your rows, and then making one DataFrame when you have all the rows you need
Citing from the documentation here:
Warning Starting in 0.20.0, the .ix indexer is deprecated, in favor of
the more strict .iloc and .loc indexers.
So, you should use loc instead
>>> import pandas as pd
>>> df = pd.DataFrame(columns = ['COL1', 'COL2'])
>>> df.loc['test1'] = [20, 30]
>>> df
COL1 COL2
test1 20 30
>>> df.shape
(1, 2)
You can use .ix:
In [1]: df = pd.DataFrame(columns = ['COL1', 'COL2'])
In [2]: df.ix['test1'] = [20, 30]
In [3]: df
Out[3]:
COL1 COL2
test1 20 30
[1 rows x 2 columns]