How to preserve the format when writing to csv using pandas? - python

I have a text file like this:
id,name,sex,
1,Sam,M,
2,Ann,F,
3,Peter,
4,Ben,M,
Then, I read the file:
df = pd.read_csv('data.csv')
After that, I write it to another file:
df.to_csv('new_data.csv', index = False)
Then, I get
id,name,sex,Unnamed: 3
1,Sam,M,
2,Ann,F,
3,Peter,,
4,Ben,M,
You see that there are two commas instead of one in the fourth line.
How to preserve the format when using pd.to_csv?

pandas is preserving the format - the 3d row has no sex, and as such the csv should have an empty column - that is why you get to commas, since you are separating an empty column.
Your original text file was not a valid csv file.
What you want to do is something else, which is not write a valid csv file - you will have to do this yourself, I do not know of any existing method to create your format.

The problem in your code is that you have a comma after the sex column in your file. So read_csv thinks it's a new column, which has no name and data.
df= pd.read_csv('data.csv')
df
id name sex Unnamed: 3
0 1 Sam M NaN
1 2 Ann F NaN
2 3 Peter NaN NaN
3 4 Ben M NaN
Hence you have an extra Unnamed column. So when you write the to_csv, it adds two empty values in the 3rd row and hence why, two ,.
Try:
df = pd.read_csv('data.csv', use_cols = ['id', 'name', 'sex'])
df.to_csv('new_data.csv', index = False)

Related

Csv from Kaggle puts all columns into 1 - how to separate with pd.read_csv and make usable df

I just downloaded this CSV from kaggle
https://www.kaggle.com/psvishnu/bank-direct-marketing?select=bank-full.csv
However, when it downloads, all the 17 or so columns are in 1, so when I use
df = pd.read_csv('bank-full.csv)
it too has all values in one column.
Any thoughts would be great, I haven't come across this issue before, thanks!
df sample
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
0 44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
1 33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
2 47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
3 33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
4 35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
5 28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
6 42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
7 58;"retired";"married";"primary";"no";121;"yes";"no";"unknown";5;"may";50;1;-1;0;"unknown";"no"
8 43;"technician";"single";"secondary";"no";593;"yes";"no";"unknown";5;"may";55;1;-1;0;"unknown";"no"
9 41;"admin.";"divorced";"secondary";"no";270;"yes";"no";"unknown";5;"may";222;1;-1;0;"unknown";"no"
You can do this:
import pandas as pd
df=pd.read_csv("<filename.csv>",sep=";") #Or you may use delimiter=";"
print(df)
Your file's columns are separated by ; so we assigned separator as ;.
You can get further more information about read_csv from documentation.
you can use delimiter argument for read_csv function to set the character for separation as
df = pd.read_csv('bank-full.csv', delimiter=';')

Skip initial empty rows and columns while reading in pandas

I have a excel like below
I have to read the excel and do some operations. The problem is I have to skip the empty rows and columns.In the above example it should read only from B3:D6. But with below code, it considers all the empty rows also like below
Code i'm using
import pandas as pd
user_input = input("Enter the path of your file: ")
user_input_sheet_master = input("Enter the Sheet name : ")
master = pd.read_excel(user_input,user_input_sheet_master)
print(master.head(5))
How to ignore the empty rows and columns to get the below output
ColA ColB ColC
0 10 20 30
1 23 NaN 45
2 NaN 30 50
Based on some research i have tried using df.dropna(how='all') but it also deleted the COLA and COLB. I cannot hardcode value for skiprows or skipcolumns because it may not be same format every time.The no of rows and columns to be skipped may vary. Sometimes there may not be any empty rows or columns. In that case, there is no need to delete anything.
You surely need to use dropna
df = df.dropna(how='all').dropna(axis=1, how='all')
EDIT:
If we have following file:
And then use this code:
df = pd.read_excel('tst1.xlsx', header=None)
df = df.dropna(how='all').dropna(how='all', axis=1)
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
new_df looks following way:
If we start with:
And use exactly the same code, I get:
Finally, start from:
Get the same as in the first case.

How do i parse a json string in a csv column and break it down into multiple columns?

My objective was to read a json string located in the 4th column "REQUEST_RE" of a csv file and breakdown that 4th column into its own individual columns.
my data is in the following format per csv row on the 4th column:
Row 2: {"Fruit":"Apple","Cost":"1.5","Attributes":{"ID":"001","Country":"America"}}
Row 3: {"Fruit":"Orange","Cost":"2.0","Attributes":{"ID":"002","Country":"China"}}
to be changed into:
i was trying this:
Parsing a JSON string which was loaded from a CSV using Pandas
and i ended up using this:
InterimReport = pd.read_csv(filename, index_col=False, usecols=lambda col: col not in ["SYSTEMID"])
InterimReport.join(InterimReport['REQUEST_RE'].apply(json.loads).apply(pd.Series))
but i was unable to split my json string into columns.
my json string still remained a json string and was unchanged.
You should load the CSV file ignoring the JSON string at this time.
Then you convert the column to a json list and normalize it:
tmp = pd.json_normalize(InterimReport['REQUEST_RE'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('Attributes.', ''))
You should get something like:
Fruit Cost ID Country
0 Apple 1.5 001 America
1 Orange 2.0 002 China
That you can easily concat to the original dataframe:
InterimReport = pd.concat([InterimReport.drop(columns=['REQUEST_RE']), tmp], axis=1)

pandas multiple separator not working

I'm having an issue importing a dataset with multiple separators. The files are mostly tab separated, but there is a single column that has around 700 values that are all semi-colon delimited.
I saw a previous similar question and the solution is simply to specify multiple separators as follows using the 'sep' argument:
dforigin = pd.read_csv(filename, header=0, skiprows=6,
skipfooter=1, sep='\t|;', engine='python')
This does not work for some reason. If I do this it just looks like a mess. Up to this point my workaround has been to import the file as tab-separated, cut out the offending column ('emg data', which is offscreen just to the right of the last column) and save as a temporary .csv, reimport the data and then append it to the initial dataframe.
My workaround feels a bit sloppy and I'm wondering if anybody can help make it a cleaner process.
IIUC, you want the semicolon-delimited values from that one column to each occupy a column in your data frame, alongside the other initial columns from your file. In that case, I'd suggest you read in the file with sep='\t' and then split out the semicolon column afterwards.
With sample data:
data = {'foo':[1,2,3], 'bar':['a;b;c', 'i;j;k', 'x;y;z']}
df = pd.DataFrame(data)
df
bar foo
0 a;b;c 1
1 i;j;k 2
2 x;y;z 3
Concat df with a new data frame, constructed of the splitted semicolon column:
pd.concat([df.drop('bar', 1),
df.bar.str.split(";", expand=True)], axis=1)
foo 0 1 2
0 1 a b c
1 2 i j k
2 3 x y z
Note: If your actual data don't include a column name for the semicolon-separated column, but if it's definitely the last column in the table, then per unutbu's suggestion, replace df.bar with df.iloc[:, -1].

Using Python Panda's to fill new table with NaN values

I've imported data from a csv file which has columns NAME, ADDRESS, PHONE_NUMBER.
Sometimes, at least 1 of the columns has a missing value for that row. e.g
0 - Bill, Flat 2, 555123
1 - Katie, NaN, NaN
2 - Ruth, Flat 1, ?
I'm trying to get the NaN values to fill a new table which I can do if a filler value has been put in such as:
newDetails = details [details['PHONE_NUMBER']=="?"]
which gives me:
2 - Ruth, Flat 1, ?
I tried to use fillna but I couldn't find the syntax that would work.
Pandas fillna (pandas.DataFrame.fillna) is quite simple. Suppose your data frame is df. Here's how you can do.
df.fillna('_missing_value_', inplace=True)
It looks like you have different fields with missing value. May be try this:
df = df.where((pd.notnull(df)),'_missing_value_')
Edit1 to replace in a column
If you want to replace a column Flat 2, here's how:
col_flat = df[['Flat 2']].fillna('?')
df['Flat 2'] = col_flat['Flat 2']

Categories