How to import CSV data in a single column dataframe? - python

I know where this error is coming from.
I try to df = pd.read_csv("\words.csv")
In this CSV, I have a column with each row filled with text.
Sometimes in this text, I have this separator a comma ,.
But I have practically all the possible symbols so I can't give it a good separator! (I have ; too)
The only thing that I know is that I only need 1 column. Is there a way to force the number of columns and not interpret the others "separators"?

Since you are aiming to have one column, one way to achieve this goal is to use a newline \n as a separator like so:
import pandas as pd
df = pd.read_csv("\words.csv", sep="\n")
Since there will always be one line per row it is bound to always detect one column.

Related

sep=';' not shaping dataframe in Python

I am importing a file that is semicolon delimited. my code:
df = pd.read_csv('bank-full.csv', sep = ';')
print(df.shape)
When I use this in Jupyter Notebooks and Spyder I get a shape output of (45211, 1). When I print my dataframe the data looks like this at this point:
<bound method NDFrame.head of age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
I can get the correct shape by using
df = pd.read_csv('bank-full.csv', sep = '[;]')
print(df.shape)
or
df = pd.read_csv('bank-full.csv', sep = '\;')
print(df.shape)
However when I do this the data seems to get pulled in as though each row is a string. The first and last column get added preceding and ending double quotations respectively, and when I attempt to strip them nothing is working to remove them so either way I am stuck with many of my columns called objects and unable to force them into integers when needed. My data comes out like this:
"age ""job"" ""marital"" ""education"" ""default"" \
0 "58 ""management"" ""married"" ""tertiary"" ""no""
with final column:
""y"""
0 ""no"""
I have reached out to those in my class and had them send me their .csv file, restarted from scratch, tried a different UI, and even copy/pasted their line of code to read and shape the data and get nothing. I have used every resource except asking this here and am out of ideas.
CSVs are usually separated by commas, but sometimes the cells are separated by a different character(s). So, since I don't have access to your exact dataset, I will give you advice that should help you overall.
First, look at the CSV and assess what character(s) are separating each value, then use that as the value in "sep" during your pd.read_csv() call.
Then, whatever columns you want to convert to numeric, you can use pd.to_numeric() to convert the data type. This may present problems if any of the values in the column cannot be converted to numeric, and you will then need to do additional data cleaning.
Below is an example of how to do this to a particular column that I am calling "col":
import pandas as pd
df = pd.read_csv('bank-full.csv', sep = '[;]')
df[col] = pd.to_numeric(df[col])
Let me know if you have further questions, or better yet, share the data with me if you can't get this to work for you.

How to split column without delimiter pandas

I have this csv-file, where the first column is the date in this format: YYYYMMDD. I don't need the YYYY. I would like to delete the YYYY part and keep the MMDD part. But there is no delimiter between those. I've tried a couple of things, but nothing worked. Except for this method, which loops through each row and deletes the year, but this takes ages for my file with more than a million rows.
This is my loop, but I can't seem to find a way to do it for all rows in one go.
def drop_year(row):
print(row[0])
data.iloc[row[0]] = str(row[0])[4:]
[drop_year(row) for row in data.iterrows()]
I found a solution, thank you all for your help.
data["YYYYMMDD"] = (data.YYYYMMDD.astype(str).str)[4:]

How do you separate the Column names and its values in Pandas?

I wanted to import this [dataset][1] named "wind.data" to perform some operations on it but I couldn't find a way to turn it into a proper table-like structure.
This is how it's looking like after importing:
wind dataframe.
I tried using sep=' ' parameter in pd.read_csv('wind.data', sep=' ') but it's not working.
How do I separate the column names and their respective values from this dataset?
[1]:
The file is not comma (or any other character) separated but is fixed width formatted.
Instead of trying to force read_csv to handle it correctly, you should use read_fwf.
df = pd.read_fwf("wind.data", header=1)
Try:
pd.read_csv('wind.data', delimiter=r'\s+')
Because there is not always a single space between columns.

How to prevent truncation using pd.read_sas()

I am new to python and use the follwing code to read in a sas-file:
df=pd.read_sas('C:\\test\\test.sas7bdat', format = 'sas7bdat', encoding = 'latin-1')
There are columns which have either a 7-string code or just "M" for missing. Columns where the first rows just have a M in the first couple of rows and only in later rows the 7-string codes are truncated to just one character for all rows, which does not happen for rows where I have a 7-string code in the first rows.
this is how the original data looks like in sas
How can I prevent pandas to truncate the text when reading in the data?
Thank you.
Lia

How to search in a pandas dataframe column with the space in the column name

If I need to search if a value exists in a pandas data frame column , which has got a name without any spaces, then I simply do something like this
if value in df.Timestamp.values
This will work if the column name is Timestamp. However, I have got plenty of data with column names as 'Date Time'. How do I use the if in statement in that case?
If there is no easy way to check for this using the if in statement, can I search for the existence of the value in some other way? Note that I just need to search for the existence of the value. Also, this is not an index column.
Thank you for any inputs
It's better practice to use the square bracket notation:
df["Date Time"].values
Which does exactly the same thing
There are 2 ways of indexing columns in pandas. One is using the dot notation which you are using and the other is using square brackets. Both work the same way.
if value in df["Date Time"].values
in the case where you want to work with a column that has a header name with spaces
but you don't want it changed because you may have to forward the file
...one way is to just rename it, do whatever you want with the new no-spaced-name, them rename it back...# e.g. to drop the rows with the value "DUMMY" in the column 'Recipient Fullname'
df.rename(columns={'Recipient Fullname':'Recipient_Fullname'}, inplace=True)
df = df[(df.Recipient_Fullname != "DUMMY")]
df.rename(columns={'Recipient_Fullname':'Recipient Fullname'}, inplace=True)

Categories