Drop row from data-frame where that contains a specific string - python

I have a number of CSV files where the head looks something like:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
I need to read this into a dataframe and remove any rows with ,, however when I read the CSV data into a dataframe using:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None)
I get:
0 1 2 3
0 09/07/2014 26268315 NaN NaN
1 10/07/2014 6601181 16.3857 NaN
2 11/07/2014 916651 12.5879 NaN
3 14/07/2014 213357 NaN NaN
4 15/07/2014 205019 10.8607 NaN
How can I read the CSV data into a dataframe and get:
0
0 09/07/2014,26268315,,
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
3 14/07/2014,213357,,
4 15/07/2014,205019,10.8607
I need to remove any rows where there are ,, present. and then resave the adjusted dataframe to a new CSV file. I was going to use:
stringList = [',,']
df = df[~df[0].isin([stringList])]
to remove the rows with ,, present so the resulting .csv head looks like:
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
15/07/2014,205019,10.8607

I guess here is possible remove all columns with all NaNs and then rows with any NaNs:
df = df.dropna(axis=1, how='all').dropna()
print (df)
0 1 2
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
4 15/07/2014 205019 10.8607
Another solution is add separator which value is not in data like | and then filter by endswith:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None, sep='|')
df = df[~df[0].str.endswith(',')]
#alternative solution - $ is for end of string
#df = df[~df[0].str.contains(',$')]
print (df)
0
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
4 15/07/2014,205019,10.8607

Related

pandas dataframe: delete empty label name

I have a dataframe converted from tab seperated text file. But the first label is an extra unnecessary label.
a b c
0 1 2 NaN
1 2 3 NaN
The label a is an extra one. The dataframe should be:
b c
0 1 2
1 2 3
How to remove a? Thanks in advance.
You can omit first header row by skiprows parameter and then add parameter names for new columns - is necessary same length of names and length of another rows of data:
df = pd.read_csv(file, skiprows=1, names=['b','c'])
print (df)
b c
0 1 2
1 2 3
Or more dynamic is get only first row by nrows=0 for columns and then pass to parameter names with remove first value by indexing:
names = pd.read_csv(file, nrows=0).columns
df = pd.read_csv(file, skiprows=1, names=names[1:])
Another idea is default columns - RangeIndex:
df = pd.read_csv(file, skiprows=1, header=None)
print (df)
0 1
0 1 2
1 2 3

How to read in Pandas DataFrame while ignoring index and column labels?

A while back I made a DataFrame full of ints with strings for column and index labels and saved it as a .csv.
Something like this:
A B C
A 1 5 8
B 5 2 4
C 8 4 0
Now I am trying to read the csv and perform operations on it. In order to do that, I have to get rid of those labels. I have tried using drop but they don't go away. This is my code:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='path',header=None,index_col=False)
print(df.head())
This is what comes out:
0 ... 12
0 NaN ... 10.1021/nn502895s
1 10.1063/1.4973245 ... 3.1641066942926606
2 10.3891/acta.chem.scand.26-0333 ... 3.8644527240688675
3 10.1063/1.463096 ... 2.9273855677735448
4 10.1146/annurev-physchem-040412-110130 ... 6.1534904155247325
How do I get rid of the labels (the strings)?
Thank you!
Use parameter skiprows=1 for avoid header of csv to first line of DataFrame, then add index_col=[0] for correct parsing index and last remove it by DataFrame.reset_index with drop=True:
df = pd.read_csv('file.csv', header=None, skiprows=1, index_col=[0]).reset_index(drop=True)
print (df)
1 2 3
0 1 5 8
1 5 2 4
2 8 4 0

how to split the data in a column based on multiple delimiters, into multiple columns, in pandas

I have a dataframe with only one column named 'ALL_category[![enter image description here][1]][1]'. There are multiple names in a row ranging between 1 to 3 and separated by delimiters '|', '||' or '|||', which can be either at the beginning, in between or end of the words in every row. I want to split the column into multiple columns such that the new columns contain the names. How can I do it?
Below is the code to generate the dataframe:
x = {'ALL Categories': ['Rakesh||Ramesh|','||Rajesh|','HARPRIT|||','Tushar||manmit|']}
df = pd.DataFrame(x)
When I used the below code for modification of the above dataframe, it didn't give me any result.
data = data.ALL_HOLDS.str.split(r'w', expand = True)
I believe you need Series.str.extractall if want each word to separate column:
df1 = df['ALL Categories'].str.extractall(r'(\w+)')[0].unstack()
print (df1)
match 0 1
0 Rakesh Ramesh
1 Rajesh NaN
2 HARPRIT NaN
3 Tushar manmit
Or a bit changed code of #Chris A from comments with Series.str.strip and Series.str.split by one or more |:
df1 = df['ALL Categories'].str.strip('|').str.split(r'\|+', expand=True)
print (df1)
0 1
0 Rakesh Ramesh
1 Rajesh None
2 HARPRIT None
3 Tushar manmit

how to convert categorical data to numerical data in for loop in python pandas

I have a categorical data framework and I want to convert it into numerical data, I have more than 50 columns so I want to run .repalce command in a loop.
replace_map = {'w': 4, '+': 5, '.': 6, 'g': 7}
and I have written code which iterates over columns
for column in df1_replace.columns[1:76]:
# Select column contents by column name using [] operator
columnSeriesObj = df1_replace[column]
print('Colunm Name : ', column)
print('Column Contents : ', columnSeriesObj.values)
Here is how you could do it using dropna() and drop_duplicated()
I have used my own sample data with one column with no values.
import pandas as pd
from io import StringIO
csv = StringIO('''2001,1,,a,a
2001,2,,b,b
2001,3,,c,c
2005,1,,a,a
2005,1,,c,c''')
df = pd.read_csv(csv, header=None )
print(df)
df will look like this
0 1 2 3 4
0 2001 1 NaN a a
1 2001 2 NaN b b
2 2001 3 NaN c c
3 2005 1 NaN a a
4 2005 1 NaN c c
Then drop all columns (how='all') where all values are na(NaN)
df_new = df.dropna(how='all', axis=1)
Take a transpose of the dataframe, the duplicate columns will become duplicate rows. Then use drop_duplicates on it to drop duplicate rows. Transpose it back to get your original data, without empty columns and duplicate columns.
df_new = df_new.T.drop_duplicates().T
df_new.columns = range(len(df_new.columns))
print(df_new)

Pandas: reading Excel file starting from the row below that with a specific value

Say I have the following Excel file:
A B C
0 - - -
1 Start - -
2 3 2 4
3 7 8 4
4 11 2 17
I want to read the file in a dataframe making sure that I start to read it below the row where the Start value is.
Attention: the Start value is not always located in the same row, so if I were to use:
import pandas as pd
xls = pd.ExcelFile('C:\Users\MyFolder\MyFile.xlsx')
df = xls.parse('Sheet1', skiprows=4, index_col=None)
this would fail as skiprows needs to be fixed. Is there any workaround to make sure that xls.parse finds the string value instead of the row number?
df = pd.read_excel('your/path/filename')
This answer helps in finding the location of 'start' in the df
for row in range(df.shape[0]):
for col in range(df.shape[1]):
if df.iat[row,col] == 'start':
row_start = row
break
after having row_start you can use subframe of pandas
df_required = df.loc[row_start:]
And if you don't need the row containing 'start', just u increment row_start by 1
df_required = df.loc[row_start+1:]
If you know the specific rows you are interested in, you can skip from the top using skiprow and then parse only the row (or rows) you want using nrows - see pandas.read_excel
df = pd.read_excel('myfile.xlsx', 'Sheet1', skiprows=2, nrows=3,)
You could use pd.read_excel('C:\Users\MyFolder\MyFile.xlsx', sheet_name='Sheet1') as it ignores empty excel cells.
Your DataFrame should then look like this:
A B C
0 Start NaN NaN
1 3 2 4
2 7 8 4
3 11 2 17
Then drop the first row by using
df.drop([0])
to get
A B C
0 3 2 4
1 7 8 4
2 11 2 17

Categories