How to add a number in the beginning of a row (pandas) - python

I'm trying to convert my column type from object to date. But, I got an error that says
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 20-04-09 00:00:00
After looking at the data and trying to find the error, I saw the all the dates that I have has the following format:
print(df['date'])
0 020-04-02
1 020-04-02
2 020-04-05
3 NaN
4 020-04-05
...
60 NaN
61 020-04-07
62 NaN
63 020-04-09
64 020-04-09
As you can see the dates begin with a zero, so adding a 2 at the beginning will fix the problem for me.
So, the question is, how can I add a 2 while ignoring the NaNs?

If add 2 to missing values there are still missing values, so use:
pd.to_datetime('2' + df['date'])

Related

ValueError: could not convert string to float - without positional indication

For a current project, I am planning to run a scikit-learn Stochastic Graduent Booster algorithm over a CSV set that includes numerical data.
When calling line sgbr.fit(X_train, y_train) of the script, I am however receiving a ValueError: could not convert string to float: with no further details given on the respective area that cannot be formatted.
I assume that this error is not related to the Python code itself but rather the CSV input. I have however already checked the CSV file to confirm all sections exclusively include floats:
Does anyone have an idea why the ValueError is appearing without further positional indication?
I thing there are not direct function to get positional indication.
you can try this to convert
print (df)
column
0 01
1 02
2 03
3 04
4 05
5 LS
print (pd.to_numeric(df.column.str, errors='coerce'))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
Name: column, dtype: float64

Python calculations using groups to sum

Good afternoon all,
Bit stuck with the last stage of a calculation.
I have a dataframe which outputs as such:
LaCode Group Frequency
0 718 NaN 2
1 718 3 1
2 719 1 4
3 719 2 10
I'm struggling with the percentage calculation which is for each LaCode, ignore where Group is NaN (and just put NaN (or blank) and calculate percentage of the frequency's where Group is known.
Should output as such:
Percentage
NaN
100
28.571
71.428
Can anyone help with this? My code doesn't take into account the change in LaCode and I can't work out the correct syntax to incorporate that issue.
Thanks.
Edit: For completeness, I have converted the NaN to an integer that stands out so I can see it (in this instance 0 as that isn't a valid group in the survey)
The code I'm using for calculation was provided to me and I tweaked a little. Works ok when just one LaCode:
df['Percentage'] = df[df['Value'] != 0]['Count'].apply(lambda x: x/sum(df[df['Value'] != 0]['Count']))

What is happening when pandas.Series converts int64s into NaNs?

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64
First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]
I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

Pandas converting column of strings and NaN (floats) to integers, keeping the NaN [duplicate]

This question already has answers here:
Convert Pandas column containing NaNs to dtype `int`
(27 answers)
Closed 3 years ago.
I have problems in converting a column which contains both numbers of 2 digits in string format (type: str) and NaN (type: float64). I want to obtain a new column made this way: NaN where there was NaN and integer numbers where there was a number of 2 digits in string format.
As an example: I want to obtain column Yearbirth2 from column YearBirth1 like this:
YearBirth1 #numbers here are formatted as strings: type(YearBirth1[0])=str
34 # and NaN are floats: type(YearBirth1[2])=float64.
76
Nan
09
Nan
91
YearBirth2 #numbers here are formatted as integers: type(YearBirth2[0])=int
34 #NaN can remain floats as they were.
76
Nan
9
Nan
91
I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']).astype(int)
And as I expected i got this error:
ValueError: cannot convert float NaN to integer
So I tried this:
csv['YearBirth2'] = (csv['YearBirth1']!=NaN).astype(int)
And got this error:
NameError: name 'NaN' is not defined
Finally I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']!='NaN').astype(int)
NO error, but when I checked the column YearBirth2, this was the result:
YearBirth2:
1
1
1
1
1
1
Very bad.. I think the idea is right but there is a problem to make Python able to understand what I mean for NaN.. Or maybe the method I tried is wrong..
I also used pd.to_numeric() method, but this way i obtain floats, not integers..
Any help?!
Thanks to everyone!
P.S: csv is the name of my DataFrame;
Sorry if I am not so clear, I am on improving with English language!
You can use to_numeric, but is impossible get int with NaN values - they are always converted to float: see na type promotions.
df['YearBirth2'] = pd.to_numeric(df.YearBirth1, errors='coerce')
print (df)
YearBirth1 YearBirth2
0 34 34.0
1 76 76.0
2 Nan NaN
3 09 9.0
4 Nan NaN
5 91 91.0

Is this a Pandas bug with notnull() or a fundamental misunderstanding on my part (probably misunderstanding)

I have a pandas dataframe with two columns and default indexing. The first column is a string and the second is a date. The top date is NaN (though it should be NaT really).
index somestr date
0 ON NaN
1 1C 2014-06-11 00:00:00
2 2C 2014-07-09 00:00:00
3 3C 2014-08-13 00:00:00
4 4C 2014-09-10 00:00:00
5 5C 2014-10-08 00:00:00
6 6C 2014-11-12 00:00:00
7 7C 2014-12-10 00:00:00
8 8C 2015-01-14 00:00:00
9 9C 2015-02-11 00:00:00
10 10C 2015-03-11 00:00:00
11 11C 2015-04-08 00:00:00
12 12C 2015-05-13 00:00:00
Call this dataframe df.
When I run:
df[pd.notnull(df['date'])]
I expect the first row to go away. It doesn't.
If I remove the column with string by setting:
df=df[['date']]
Then apply:
df[pd.notnull(df['date'])]
then the first row with the null does go away.
Also, the row with the null always goes away if all columns are number/date types. When a column with a string appears, this problem occurs.
Surely this is a bug, right? I am not sure if others will be able to replicate this.
This was on my Enthought Canopy for Windows (I am not smart enough for UNIX/Linux command line noise)
Per requests below from Jeff and unutbu:
#ubuntu -
df.dtypes
somestr object
date object
dtype: object
Also:
type(df.iloc[0]['date'])
pandas.tslib.NaTType
In the code this column was specifically assigned as pd.NaT
I also do not understand why it says NaN when it should say NaT. The filtering I used worked fine when I used this toy frame:
df=pd.DataFrame({'somestr' : ['aa', 'bb'], 'date' : [pd.NaT, dt.datetime(2014,4,15)]}, columns=['somestr', 'date'])
It should also be noted that although the table above had NaN in the output, the following output NaT:
df['date'][0]
NaT
Also:
pd.notnull(df['date'][0])
False
pd.notnull(df['date'][1])
True
but....when evaluating the array, they all came back True - bizarre...
np.all(pd.notnull(df['date']))
True
#Jeff - this is 0.12. I am stuck with this. The frame was created by concatenating two different frames that were grabbed from database queries using psql. The date and some other float columns were then added by calculations I did. Of course, I filtered to the two relevant columns that made sense here until I pinpointed that the string valued columns were causing problems.
************ How to Replicate **********
import pandas as pd
import datetime as dt
print(pd.__version__)
# 0.12.0
df = pd.DataFrame({'somestr': ['aa', 'bb'], 'date': ['cc', 'dd']},
columns=['somestr', 'date'])
df['date'].iloc[0] = pd.NaT
df['date'].iloc[1] = pd.to_datetime(dt.datetime(2014, 4, 15))
print(df[pd.notnull(df['date'])])
# somestr date
# 0 aa NaN
# 1 bb 2014-04-15 00:00:00
df2 = df[['date']]
print(df2[pd.notnull(df2['date'])])
# date
# 1 2014-04-15 00:00:00
So, this dataframe originally had all string entries - then the date column was converted to dates with an NaT at the top - note that in the table it is NaN, but when using df.iloc[0]['date'] you do see the NaT. Using the snippet above, you can see that the filtering by not null is bizarre with and without the somestr column. Again - this is Enthought Canopy for Windows with Pandas 0.12 and NumPy 1.8.
I encountered this problem also. Here's how I fixed it. "isnull()" is a function that checks if something is NaN or empty. The "~" (tilde) operator negates the following expression. So we are saying give me a dataframe from your original dataframe but only where the 'data' rows are NOT null.
df = df[~df['data'].isnull()]
Hope this helps!

Categories