Pandas - Python Remodel the Date Column - python

I have a date column like this in my pandas data-frame.
My DataFrame looks like this,
ID SerialDate
1 2008-1-15
2 T1
3 2008-1-17
4 T1
T1 is the only text that will be found in this column and there won't be any blanks.The dtype of this column is object
I need to change this to look like,
Expected dataframe :-
ID SerialDate
1 15/01/2008
2 T1
3 17/01/2008
4 T1
Expected dtype :-Object.
How can I do this using Pandas. I would prefer a user defined function something like df[colb] = requiredfunction(df,colb)

You can easily generate the required output with basic string operations and apply function of pandas
df['SerialDate'] = df['SerialDate'].apply(lambda x:'/'.join(x.split('-')[::-1]))
In your particular case, 'T1' is also not affected by this operation.
Explanation:
what does [::-1] do?
>> [1,2,3][::-1]
>> [3,2,1]
It reverses an array
To convert x -> 0x
df['SerialDate'] = df['SerialDate'].apply(lambda x:'/'.join([y.zfill(2) for y in x.split('-')[::-1]]))

Using to_datetime with strftime then we mask back with original column
df=pd.read_clipboard()
s=pd.to_datetime(df.SerialDate,errors = 'coerce').dt.strftime('%d/%m/%Y')
df.SerialDate=s.mask(s=='NaT',df.SerialDate)
df
Out[402]:
ID SerialDate
0 1 15/01/2008
1 2 T1
2 3 17/01/2008
3 4 T1

Related

convert pandas series to a dataframe [duplicate]

I have a Series, like this:
series = pd.Series({'a': 1, 'b': 2, 'c': 3})
I want to convert it to a dataframe like this:
a b c
0 1 2 3
pd.Series.to_frame() doesn't work, it got result like,
0
a 1
b 2
c 3
How can I construct a DataFrame from Series, with index of Series as columns?
You can also try this :
df = DataFrame(series).transpose()
Using the transpose() function you can interchange the indices and the columns.
The output looks like this :
a b c
0 1 2 3
You don't need the transposition step, just wrap your Series inside a list and pass it to the DataFrame constructor:
pd.DataFrame([series])
a b c
0 1 2 3
Alternatively, call Series.to_frame, then transpose using the shortcut .T:
series.to_frame().T
a b c
0 1 2 3
you can also try this:
a = pd.Series.to_frame(series)
a['id'] = list(a.index)
Explanation:
The 1st line convert the series into a single-column DataFrame.
The 2nd line add an column to this DataFrame with the value same as the index.
Try reset_index. It will convert your index into a column in your dataframe.
df = series.to_frame().reset_index()
This
pd.DataFrame([series]) #method 1
produces a slightly different result than
series.to_frame().T #method 2
With method 1, the elements in the resulted dataframe retain the same type. e.g. an int64 in series will be kept as an int64.
With method 2, the elements in the resulted dataframe become objects IF there is an object type element anywhere in the series. e.g. an int64 in series will be become an object type.
This difference may cause different behaviors in your subsequent operations depending on the version of pandas.

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Adding new columns to DataFrame Python. SettingWithCopyWarning

I'm trying to add a new column to a data frame. I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
def addEpochTime(df):
df[7] = np.NaN # Adding empty column.
for n in range(0, len(df)): # Writing to empty column.
df[7][n] = df[0][n] - 5 # Conduct some mathematical mutations...
addEpochTime(df)
What I've written above works, but I do get an error, i.e.: SettingWithCopyWarning
My question is, how can I add a new column in a data frame and write data to it
I don't fully understand the way data frames are indexed, despite having read about the it in the pandas documentation.
Since you say -
I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
If what you are actually doing is simple like - df[7][n] = df[0][n] -5 , then you can simply use series.apply method to do the same thing, In your case -
def addEpochTime(df):
df[7] = df[0].apply(lambda x: x-5)
.apply method accepts a function as the parameter , which is passed the value of each row and it should return the value after applying the logic.
You can also pass in a function that accepts the date as parameter and returns the seconds since epoch, to .apply() , which might be what you are looking for.
Example -
In [4]: df = pd.DataFrame([[1,2],[3,4]],columns=['A','B'])
In [5]: df
Out[5]:
A B
0 1 2
1 3 4
In [6]: df['C'] = df['A'].apply(lambda x: x-5)
In [7]: df
Out[7]:
A B C
0 1 2 -4
1 3 4 -2
You can do it in a single line and avoid the warning:
df
>> a
0 1
1 2
df['b'] = df['a'] - 5
df
>> a b
0 1 -4
1 2 -3

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Categories