pandas multiple separator not working - python

I'm having an issue importing a dataset with multiple separators. The files are mostly tab separated, but there is a single column that has around 700 values that are all semi-colon delimited.
I saw a previous similar question and the solution is simply to specify multiple separators as follows using the 'sep' argument:
dforigin = pd.read_csv(filename, header=0, skiprows=6,
skipfooter=1, sep='\t|;', engine='python')
This does not work for some reason. If I do this it just looks like a mess. Up to this point my workaround has been to import the file as tab-separated, cut out the offending column ('emg data', which is offscreen just to the right of the last column) and save as a temporary .csv, reimport the data and then append it to the initial dataframe.
My workaround feels a bit sloppy and I'm wondering if anybody can help make it a cleaner process.

IIUC, you want the semicolon-delimited values from that one column to each occupy a column in your data frame, alongside the other initial columns from your file. In that case, I'd suggest you read in the file with sep='\t' and then split out the semicolon column afterwards.
With sample data:
data = {'foo':[1,2,3], 'bar':['a;b;c', 'i;j;k', 'x;y;z']}
df = pd.DataFrame(data)
df
bar foo
0 a;b;c 1
1 i;j;k 2
2 x;y;z 3
Concat df with a new data frame, constructed of the splitted semicolon column:
pd.concat([df.drop('bar', 1),
df.bar.str.split(";", expand=True)], axis=1)
foo 0 1 2
0 1 a b c
1 2 i j k
2 3 x y z
Note: If your actual data don't include a column name for the semicolon-separated column, but if it's definitely the last column in the table, then per unutbu's suggestion, replace df.bar with df.iloc[:, -1].

Related

Csv from Kaggle puts all columns into 1 - how to separate with pd.read_csv and make usable df

I just downloaded this CSV from kaggle
https://www.kaggle.com/psvishnu/bank-direct-marketing?select=bank-full.csv
However, when it downloads, all the 17 or so columns are in 1, so when I use
df = pd.read_csv('bank-full.csv)
it too has all values in one column.
Any thoughts would be great, I haven't come across this issue before, thanks!
df sample
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
0 44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
1 33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
2 47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
3 33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
4 35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
5 28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
6 42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
7 58;"retired";"married";"primary";"no";121;"yes";"no";"unknown";5;"may";50;1;-1;0;"unknown";"no"
8 43;"technician";"single";"secondary";"no";593;"yes";"no";"unknown";5;"may";55;1;-1;0;"unknown";"no"
9 41;"admin.";"divorced";"secondary";"no";270;"yes";"no";"unknown";5;"may";222;1;-1;0;"unknown";"no"
You can do this:
import pandas as pd
df=pd.read_csv("<filename.csv>",sep=";") #Or you may use delimiter=";"
print(df)
Your file's columns are separated by ; so we assigned separator as ;.
You can get further more information about read_csv from documentation.
you can use delimiter argument for read_csv function to set the character for separation as
df = pd.read_csv('bank-full.csv', delimiter=';')

How to preserve the format when writing to csv using pandas?

I have a text file like this:
id,name,sex,
1,Sam,M,
2,Ann,F,
3,Peter,
4,Ben,M,
Then, I read the file:
df = pd.read_csv('data.csv')
After that, I write it to another file:
df.to_csv('new_data.csv', index = False)
Then, I get
id,name,sex,Unnamed: 3
1,Sam,M,
2,Ann,F,
3,Peter,,
4,Ben,M,
You see that there are two commas instead of one in the fourth line.
How to preserve the format when using pd.to_csv?
pandas is preserving the format - the 3d row has no sex, and as such the csv should have an empty column - that is why you get to commas, since you are separating an empty column.
Your original text file was not a valid csv file.
What you want to do is something else, which is not write a valid csv file - you will have to do this yourself, I do not know of any existing method to create your format.
The problem in your code is that you have a comma after the sex column in your file. So read_csv thinks it's a new column, which has no name and data.
df= pd.read_csv('data.csv')
df
id name sex Unnamed: 3
0 1 Sam M NaN
1 2 Ann F NaN
2 3 Peter NaN NaN
3 4 Ben M NaN
Hence you have an extra Unnamed column. So when you write the to_csv, it adds two empty values in the 3rd row and hence why, two ,.
Try:
df = pd.read_csv('data.csv', use_cols = ['id', 'name', 'sex'])
df.to_csv('new_data.csv', index = False)

Pandas drop first columns after csv read

Is there a way to reference an object within the line of the instantiation ?
See the following example :
I wanted to drop the first column (by index) of a csv file just after reading it (usually pd.to_csv outputs the index as first col) :
df = pd.read_csv(csvfile).drop(self.columns[[0]], axis=1)
I understand self should be placed in the object context but it here describes what I intent to do.
(Of course, doing this operation in two separate lines works perfectly.)
One way is to use pd.DataFrame.iloc:
import pandas as pd
from io import StringIO
mystr = StringIO("""col1,col2,col3
a,b,c
d,e,f
g,h,i
""")
df = pd.read_csv(mystr).iloc[:, 1:]
# col2 col3
# 0 b c
# 1 e f
# 2 h i
Assuming you know the total number of columns in the dataset, and the indexes you want to remove -
a = range(3)
a.remove(1)
df = pd.read_csv('test.csv', usecols = a)
Here 3 is the total number of columns, and I wanted to remove 2nd column. You can directly write index of columns to use

Dynamic - Automated multiplication - Pandas dataframes

after spending quite a while search and reading on Stackoverflow and around the web, I am desperate...
I have a Pandas DataFrame with some imported data (spectra). The first column is the wavelength while the others are the various spectra (the data). The names of the columns are imported from a list that reads the filenames from a path and keeps just the names.
What I would like to achieve and I can't quite seem to get how is to multiply each of the columns with the wavelength column and either overwrite the existing ones or create a new dataframe (doesn't matter that much).
This is the code I have so far that does the job (even if not the most elegant, it get's the job done):
path = r'"thePathToData\PL_calc\Data_NIR'
idx = 0
#Create the DataFrame with all the data from the path above, use the filenames as column names
all_files = glob.glob(os.path.join(path, "*.asc"))
df = pd.concat((pd.read_csv(f, usecols=[1], sep='\t') for f in all_files), axis=1) #usecol=1 for the spectrum only
fileNames = [] # create a list for the filenames
for i in range(0,len(all_files)):
fileNames.append(all_files[i][71:-4])
df.columns = fileNames # assign the filenames as columns
wavelengths = pd.read_csv(all_files[0], usecols=[0], sep='\t') # add the wavelength column as first column of the dataframe
df.insert(loc=idx, column='Wavelength', value=wavelengths)
If I plot just the head of the DF it looks like this:
Wavelength F8BT_Pure_Batch1_px1_spectra_4V \ ...
0 478.0708 -3.384101
1 478.3917 -1.580399
2 478.7126 -0.323580
3 479.0334 -1.131425
4 479.3542 1.202728
The complete DF is:
1599 rows × 46 columns
Question 1:
I can't quite find an automated (dynamic) way of multiplying each col with the first one, essentially this:
for i in range(1, len(df.columns)):
df[[i]] = df[[0]] * df[[i]]
Question 2:
Why does this work:
df['F8BT_Pure_Batch1_px1_spectra_4V'] = df['Wavelength']*df['F8BT_Pure_Batch1_px1_spectra_4V']
while this doesn't and gives me an "IndexError: indices are out-of-bounds"
df[[1]] = df[[0]]*df[[1]]
But when I print(df[['Wavelength']]) Name: Wavelength, dtype: float64 and print(df[[0]]) [1599 rows x 1 columns] I get the same numbers..
Question 3:
Why does this df[fileNames] = df[fileNames].multiply(df.Wavelength) give me a ValueError: Columns must be same length as key? All the columns are of the same length (1599 rows long, 0-1598 and a total of 46 columns in this case). fileNames contains the names of the imported files and the names of the columns of the dataframe.
Many many thanks in advance for your help...
Alex
Question 1
To multiply your wavelength column by every other column in your DataFrame, you can use:
df.iloc[:, 1:] = df.iloc[:, 1:].mul(df['Wavelength'], axis=0)
This assumes your wavelength column is the first column.
Question 2
Selecting columns like that using an integer is asking for columns of your DataFrame that are named 0, 1, etc., as ints. There are none in your DataFrame. To select columns by index number look into the documentation for pandas' iloc method.
Question 3
When you call df[fileNames], you are getting a a DataFrame with the same number of columns as the length of your list fileNames. Your code df[fileNames].multiply(df.Wavelength) is not giving you a DataFrame with the same number of columns as df[fileNames], hence you cannot assign the values. Using the axis=0 parameter in the multiply function is working for me.

How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Categories