csv file data cleaning process - python

enter image description here See the attached screenshot. I want to delete all the rows which contain entries from 'Unnamed' column.
i know that the column can be removed by data.drop(data.columns[27], axis=1, inplace=True) but it wont delete the entire rows with it
import pandas as pd
import numpy as np
data = pd.read_csv('/home/syed/ML-Notebook/FL-P1/DATASET_FRAUDE.csv',
engine='python',
encoding=('latin1'),
parse_dates=['FECHA_SINIESTRO','FECHA_INI_VIGENCIA','FECHA_FIN_VIGENCIA','FECHA_DENUNCIO'])
#data.drop(data.columns[27], axis=1, inplace=True)
print(data.info())

df = df[df['Unnamed: 27'].astype(str).map(len) >0]
df
Drop Column:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

To delete rows macthing a condition you can do:
df = df.drop(df[df.column_name == 'Unnamed'].index)
However this question should be helpfull: Deleting DataFrame row in Pandas based on column value

Related

Dropping index in DataFrame for CSV file

Working with a CSV file in PyCharm. I want to delete the automatically-generated index column. When I print it, however, the answer I get in the terminal is "None". All the answers by other users indicate that the reset_index method should work.
If I just say "df = df.reset_index(drop=True)" it does not delete the column, either.
import pandas as pd
df = pd.read_csv("music.csv")
df['id'] = df.index + 1
cols = list(df.columns.values)
df = df[[cols[-1]]+cols[:3]]
df = df.reset_index(drop=True, inplace=True)
print(df)
I agree with #It_is_Chris. Also,
This is not true because return is None:
df = df.reset_index(drop=True, inplace=True)
It's should be like this:
df.reset_index(drop=True, inplace=True)
or
df = df.reset_index(drop=True)
Since you said you're trying to "delete the automatically-generated index column" I could think of two solutions!
Fist solution:
Assign the index column to your dataset index column. Let's say your dataset has already been indexed/numbered, then you could do something like this:
#assuming your first column in the dataset is your index column which has the index number of zero
df = pd.read_csv("yourfile.csv", index_col=0)
#you won't see the automatically-generated index column anymore
df.head()
Second solution:
You could delete it in the final csv:
#To export your df to a csv without the automatically-generated index column
df.to_csv("yourfile.csv", index=False)

how can I return column that i already deleted in dataframe pandas

Using df.drop() I removed the "ID" column from the df, and now I want to return that column.
df.drop('ID', axis=1, inplace=True)
df
# shows me df without ID column
What method should I use?
found that all you need to do is to reload again the first command of the import of the df :)

How do I use my first row in my spreadsheet for my Dataframe column names instead of 0 1 2...etc?

I want my dataframe to display the first row names as my dataframe column name instead of numbering from 0 etc. How do I do this?
I tried using pandas and openpyxl modules to turn my Excel spreadsheet into a dataframe.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(filename='Budget1.xlsx')
print(wb.sheetnames)
sheet_ranges=wb['May 2019']
print(sheet_ranges['A3'].value)
ws=wb['May 2019']
df=pd.DataFrame(ws.values)
print(df) # This displays my dataframe.
I expect my column titles of my dataframe to display Date, Description, and Amount instead of 0, 1, 2.
After reading data dataframe using pandas you can separate first row then use that as column name:
columnNames = df.iloc[0]
df = df[1:]
df.columns = columnNames
Or, you can directly read using pandas that will set first row as column name:
excelDF = pd.ExcelFile('Budget1.xlsx')
df1 = pd.read_excel(excelDF, 'SheetNameThatYouWantTORead')
print(df1.columns)
you can reset the columns to be the first row of your dataframe:
df.columns = df.iloc[0, :]
df.drop(df.index[0], inplace=True)
df

Need help to solve the Unnamed and to change it in dataframe in pandas

how set my indexes from "Unnamed" to the first line of my dataframe in python
import pandas as pd
df = pd.read_excel('example.xls','Day_Report',index_col=None ,skip_footer=31 ,index=False)
df = df.dropna(how='all',axis=1)
df = df.dropna(how='all')
df = df.drop(2)
To set the column names (assuming that's what you mean by "indexes") to the first row, you can use
df.columns = df.loc[0, :].values
Following that, if you want to drop the first row, you can use
df.drop(0, inplace=True)
Edit
As coldspeed correctly notes below, if the source of this is reading a CSV, then adding the skiprows=1 parameter is much better.

Moving multiple columns of pandas dataframe to csv

I have a dataframe that I imported using pandas.read_csv that is two columns. I manipulated one column, and now would like to save all three columns as a .csv file. I have been able to save one column at a time, but am unable to get all three (df.Time, df.Distance, and df.Velocity). Here is what I'm working with.
`import pandas as pd
df=pd.read_csv('/Users/path/file.csv', delimiter=',', usecols=['A', 'B'])
df.columns = ['Time', 'Range']
df.Time = df['Time'].round(14)
df.Range = df['Range'].round(14)
df.Velocity = (df.Range.shift(1) - df.Range) / (df.Time.shift(1) -df.Time)
df2 = [df.Time, df.Range, df.Velocity]
df2.to_csv('test5.csv', columns = header)`
your assignment makes df2 a list and not a dataframe (df2 = [df.Time, df.Range, df.Velocity]).
You probably want:
df[['Time', 'Range', 'Velocity']].to_csv('test5.csv')
import pandas as pd
data=pd.read_csv('filename.csv')
data[['column1','column2','column3',...]].to_csv('fileNameWhereYouwantToWrite.csv')
You can use like this

Categories