I wanted to import this [dataset][1] named "wind.data" to perform some operations on it but I couldn't find a way to turn it into a proper table-like structure.
This is how it's looking like after importing:
wind dataframe.
I tried using sep=' ' parameter in pd.read_csv('wind.data', sep=' ') but it's not working.
How do I separate the column names and their respective values from this dataset?
[1]:
The file is not comma (or any other character) separated but is fixed width formatted.
Instead of trying to force read_csv to handle it correctly, you should use read_fwf.
df = pd.read_fwf("wind.data", header=1)
Try:
pd.read_csv('wind.data', delimiter=r'\s+')
Because there is not always a single space between columns.
Related
I have a Pandas dataframe that I'm outputting to csv. I would like to keep the data types (i.e. not convert everything to string). I need to format the date properly and there are other non-float columns.
How do I remove trailing zeros from the floats while not changing datatypes? This is what I've tried:
pd.DataFrame(myDataFrame).to_csv("MyOutput.csv", index=False, date_format='%m/%d/%Y', float_format="%.8f")
For example, this:
09/26/2022,43.27334000,2,111.37000000
09/24/2022,16.25930000,5,73.53000000
Should be this:
09/26/2022,43.27334,2,111.37
09/24/2022,16.2593,5,73.53
Any help would be greatly appreciated!
You can load your code like this, without the float_format. Also, if the myDataFrame variable is already a dataframe object, you don't need to add the pd.DataFrame part, you can just do the following.
myDataFrame.to_csv("MyOutput.csv", index=False, date_format='%m/%d/%Y')
I am importing a file that is semicolon delimited. my code:
df = pd.read_csv('bank-full.csv', sep = ';')
print(df.shape)
When I use this in Jupyter Notebooks and Spyder I get a shape output of (45211, 1). When I print my dataframe the data looks like this at this point:
<bound method NDFrame.head of age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
I can get the correct shape by using
df = pd.read_csv('bank-full.csv', sep = '[;]')
print(df.shape)
or
df = pd.read_csv('bank-full.csv', sep = '\;')
print(df.shape)
However when I do this the data seems to get pulled in as though each row is a string. The first and last column get added preceding and ending double quotations respectively, and when I attempt to strip them nothing is working to remove them so either way I am stuck with many of my columns called objects and unable to force them into integers when needed. My data comes out like this:
"age ""job"" ""marital"" ""education"" ""default"" \
0 "58 ""management"" ""married"" ""tertiary"" ""no""
with final column:
""y"""
0 ""no"""
I have reached out to those in my class and had them send me their .csv file, restarted from scratch, tried a different UI, and even copy/pasted their line of code to read and shape the data and get nothing. I have used every resource except asking this here and am out of ideas.
CSVs are usually separated by commas, but sometimes the cells are separated by a different character(s). So, since I don't have access to your exact dataset, I will give you advice that should help you overall.
First, look at the CSV and assess what character(s) are separating each value, then use that as the value in "sep" during your pd.read_csv() call.
Then, whatever columns you want to convert to numeric, you can use pd.to_numeric() to convert the data type. This may present problems if any of the values in the column cannot be converted to numeric, and you will then need to do additional data cleaning.
Below is an example of how to do this to a particular column that I am calling "col":
import pandas as pd
df = pd.read_csv('bank-full.csv', sep = '[;]')
df[col] = pd.to_numeric(df[col])
Let me know if you have further questions, or better yet, share the data with me if you can't get this to work for you.
I know where this error is coming from.
I try to df = pd.read_csv("\words.csv")
In this CSV, I have a column with each row filled with text.
Sometimes in this text, I have this separator a comma ,.
But I have practically all the possible symbols so I can't give it a good separator! (I have ; too)
The only thing that I know is that I only need 1 column. Is there a way to force the number of columns and not interpret the others "separators"?
Since you are aiming to have one column, one way to achieve this goal is to use a newline \n as a separator like so:
import pandas as pd
df = pd.read_csv("\words.csv", sep="\n")
Since there will always be one line per row it is bound to always detect one column.
I have a dataframe for example df :
I'm trying to replace the dot with a comma to be able to do calculations in excel.
I used :
df = df.stack().str.replace('.', ',').unstack()
or
df = df.apply(lambda x: x.str.replace('.', ','))
Results :
Nothing changes but I receive his warning at the end of an execution without errors :
FutureWarning: The default value of regex will change from True to
False in a future version. In addition, single character regular
expressions willnot be treated as literal strings when regex=True.
View of what I have :
Expected Results :
Updated Question for more information thanks to #Pythonista anonymous:
print(df.dtypes)
returns :
Date object
Open object
High object
Low object
Close object
Adj Close object
Volume object
dtype: object
I'm extracting data with the to_excel method:
df.to_excel()
I'm not exporting the dataframe in a .csv file but an .xlsx file
Where does the dataframe come from - how was it generated? Was it imported from a CSV file?
Your code works if you apply it to columns which are strings, as long as you remember to do
df = df.apply() and not just df.apply() , e.g.:
import pandas as pd
df = pd.DataFrame()
df['a'] =['some . text', 'some . other . text']
df = df.apply(lambda x: x.str.replace('.', ','))
print(df)
However, you are trying to do this with numbers, not strings.
To be precise, the other question is: what are the dtypes of your dataframe?
If you type
df.dtypes
what's the output?
I presume your columns are numeric and not strings, right? After all, if they are numbers they should be stored as such in your dataframe.
The next question: how are you exporting this table to Excel?
If you are saving a csv file, pandas' to_csv() method has a decimal argument which lets you specify what should be the separator for the decimals (tyipically, dot in the English-speaking world and comma in many countries in continental Europe). Look up the syntax.
If you are using the to_excel() method, it shouldn't matter because Excel should treat it internally as a number, and how it displays it (whether with a dot or comma for decimal separator) will typically depend on the options set in your computer.
Please clarify how you are exporting the data and what happens when you open it in Excel: does Excel treat it as a string? Or as a number, but you would like to see a different separator for the decimals?
Also look here for how to change decimal separators in Excel: https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html
UPDATE
OP, you have still not explained where the dataframe comes from. Do you import it from an external source? Do you create it/ calculate it yourself?
The fact that the columns are objects makes me think they are either stored as strings, or maybe some rows are numeric and some are not.
What happens if you try to convert a column to float?
df['Open'] = df['Open'].astype('float64')
If the entire column should be numeric but it's not, then start by cleansing your data.
Second question: what happens when you use Excel to open the file you have just created? Excel displays a comma, but what character Excel sues to separate decimals depends on the Windows/Mac/Excel settings, not on how pandas created the file. Have you tried the link I gave above, can you change how Excel displays decimals? Also, does Excel treat those numbers as numbers or as strings?
Any idea why below code can't keep the first column of my csv file? I would like to keep several columns in a new csv file, first column included. And if I select the name of first column to be on new file.
I get an error :
"Type" not index.
import pandas as pd
f = pd.read_csv("1.csv")
keep_col = ['Type','Pol','Country','User Site Code','PG','Status']
new_f = f[keep_col]
new_f.to_csv("2.csv", index=False)
Thanks a lot.
Try f.columns.values.tolist() and check the output of the first column. It sounds like there is an encoding issue when you are reading the CSV. You can try specifying the "encoding" option in your pd.read_csv() to see if that will get rid of the extra characters at the front. Otherwise, you can use f.rename(columns={'F48FBFBFType':'Type'} to change whatever the current name of your first column is to simply be 'Type'.
You are better off by specifying the columns to read from your csv file.
pd.read_csv('1.csv', names=keep_col).to_csv("2.csv", index=False)
Do you have any special characters in your first column?