I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.
Related
I am importing a file that is semicolon delimited. my code:
df = pd.read_csv('bank-full.csv', sep = ';')
print(df.shape)
When I use this in Jupyter Notebooks and Spyder I get a shape output of (45211, 1). When I print my dataframe the data looks like this at this point:
<bound method NDFrame.head of age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
I can get the correct shape by using
df = pd.read_csv('bank-full.csv', sep = '[;]')
print(df.shape)
or
df = pd.read_csv('bank-full.csv', sep = '\;')
print(df.shape)
However when I do this the data seems to get pulled in as though each row is a string. The first and last column get added preceding and ending double quotations respectively, and when I attempt to strip them nothing is working to remove them so either way I am stuck with many of my columns called objects and unable to force them into integers when needed. My data comes out like this:
"age ""job"" ""marital"" ""education"" ""default"" \
0 "58 ""management"" ""married"" ""tertiary"" ""no""
with final column:
""y"""
0 ""no"""
I have reached out to those in my class and had them send me their .csv file, restarted from scratch, tried a different UI, and even copy/pasted their line of code to read and shape the data and get nothing. I have used every resource except asking this here and am out of ideas.
CSVs are usually separated by commas, but sometimes the cells are separated by a different character(s). So, since I don't have access to your exact dataset, I will give you advice that should help you overall.
First, look at the CSV and assess what character(s) are separating each value, then use that as the value in "sep" during your pd.read_csv() call.
Then, whatever columns you want to convert to numeric, you can use pd.to_numeric() to convert the data type. This may present problems if any of the values in the column cannot be converted to numeric, and you will then need to do additional data cleaning.
Below is an example of how to do this to a particular column that I am calling "col":
import pandas as pd
df = pd.read_csv('bank-full.csv', sep = '[;]')
df[col] = pd.to_numeric(df[col])
Let me know if you have further questions, or better yet, share the data with me if you can't get this to work for you.
I have a dataframe for example df :
I'm trying to replace the dot with a comma to be able to do calculations in excel.
I used :
df = df.stack().str.replace('.', ',').unstack()
or
df = df.apply(lambda x: x.str.replace('.', ','))
Results :
Nothing changes but I receive his warning at the end of an execution without errors :
FutureWarning: The default value of regex will change from True to
False in a future version. In addition, single character regular
expressions willnot be treated as literal strings when regex=True.
View of what I have :
Expected Results :
Updated Question for more information thanks to #Pythonista anonymous:
print(df.dtypes)
returns :
Date object
Open object
High object
Low object
Close object
Adj Close object
Volume object
dtype: object
I'm extracting data with the to_excel method:
df.to_excel()
I'm not exporting the dataframe in a .csv file but an .xlsx file
Where does the dataframe come from - how was it generated? Was it imported from a CSV file?
Your code works if you apply it to columns which are strings, as long as you remember to do
df = df.apply() and not just df.apply() , e.g.:
import pandas as pd
df = pd.DataFrame()
df['a'] =['some . text', 'some . other . text']
df = df.apply(lambda x: x.str.replace('.', ','))
print(df)
However, you are trying to do this with numbers, not strings.
To be precise, the other question is: what are the dtypes of your dataframe?
If you type
df.dtypes
what's the output?
I presume your columns are numeric and not strings, right? After all, if they are numbers they should be stored as such in your dataframe.
The next question: how are you exporting this table to Excel?
If you are saving a csv file, pandas' to_csv() method has a decimal argument which lets you specify what should be the separator for the decimals (tyipically, dot in the English-speaking world and comma in many countries in continental Europe). Look up the syntax.
If you are using the to_excel() method, it shouldn't matter because Excel should treat it internally as a number, and how it displays it (whether with a dot or comma for decimal separator) will typically depend on the options set in your computer.
Please clarify how you are exporting the data and what happens when you open it in Excel: does Excel treat it as a string? Or as a number, but you would like to see a different separator for the decimals?
Also look here for how to change decimal separators in Excel: https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html
UPDATE
OP, you have still not explained where the dataframe comes from. Do you import it from an external source? Do you create it/ calculate it yourself?
The fact that the columns are objects makes me think they are either stored as strings, or maybe some rows are numeric and some are not.
What happens if you try to convert a column to float?
df['Open'] = df['Open'].astype('float64')
If the entire column should be numeric but it's not, then start by cleansing your data.
Second question: what happens when you use Excel to open the file you have just created? Excel displays a comma, but what character Excel sues to separate decimals depends on the Windows/Mac/Excel settings, not on how pandas created the file. Have you tried the link I gave above, can you change how Excel displays decimals? Also, does Excel treat those numbers as numbers or as strings?
Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.
I have an input CSV with timestamps in the header like this (the number of timestamps forming columns is several thousand):
header1;header2;header3;header4;header5;2013-12-30CET00:00:00;2013-12-30CET00:01:00;...;2014-00-01CET00:00:00
In Pandas 0.12 I was able to do this, to convert string timestamps into datetime objects. The following code strips out the 'CEST' in the timestamp string (translate()), reads it in as a datetime (strptime()) and then localizes it to the correct timezone (localize()) [The reason for this approach was because, with the versions I had at least, CEST wasn't being recognised as a timezone].
DF = pd.read_csv('some_csv.csv',sep=';')
transtable = string.maketrans(string.uppercase,' '*len(string.uppercase))
tz = pytz.country_timezones('nl')[0]
timestamps = DF.columns[5:]
timestamps = map(lambda x:x.translate(transtable), timestamps)
timestamps = map(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'), timestamps)
timestamps = map(lambda x: pytz.timezone(tz).localize(x), timestamps)
DF.columns[5:] = timestamps
However, my downstream code required that I run off of pandas 0.16
While running on 0.16, I get this error with the above code at the last line of the above snippet:
*** TypeError: Indexes does not support mutable operations
I'm looking for a way to overwrite my index with the datetime object. Using the method to_datetime() doesn't work for me, returning:
*** ValueError: Unknown string format
I have some subsequent code that copies, then drops, the first few columns of data in this dataframe (all the 'header1; header2, header3'leaving just the timestamps. The purpose being to then transpose, and index by the timestamp.
So, my question:
Either:
how can I overwrite a series of column names with a datetime, such that I can pass in a pre-arranged set of timestamps that pandas will be able to recognise as a timestamp in subsequent code (in pandas v0.16)
Or:
Any other suggestions that achieve the same effect.
I've explored set_index(), replace(), to_datetime() and reindex() and possibly some others but non seem to be able to achieve this overwrite. Hopefully this is simple to do, and I'm just missing something.
TIA
I ended up solving this by the following:
The issue was that I had several thousand column headers with timestamps, that I couldn't directly parse into datetime objects.
So, in order to get these timestamp objects incorporated I added a new column called 'Time', and then included the datetime objects in there, then setting the index to the new column (I'm omitting code where I purged the rows of other header data, through drop() methods:
DF = DF.transpose()
DF['Time'] = timestamps
DF = DF.set_index('Time')
Summary: If you have a CSV with a set of timestamps in your headers that you cannot parse; a way around this is to parse them separately, include in a new column of Time with the correct datetime objects, then set_index() based on the new column.
My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)