How would you add a time stamp column to a Pandas dataframe? - python

Every time I append a dataframe to a text file, I want it to contain a column with the same timestamp for each row. The timestamp could be any arbitrary time as long as its different from next time when I append a new dataframe to the existing text file. Below code inserts a column named TimeStamp, but doesn't actually insert datetime values. The column is simply empty. I must be overlooking something simple. What am I doing wrong?
t = [datetime.datetime.now().replace(microsecond=0) for i in range(df.shape[0])]
s = pd.Series(t, name = 'TimeStamp')
df.insert(0, 'TimeStamp', s)

I think simpliest is use insert only:
df = pd.DataFrame({'A': list('AAA'), 'B': range(3)}, index=list('xyz'))
print (df)
A B
x A 0
y A 1
z A 2
df.insert(0, 'TimeStamp', pd.to_datetime('now').replace(microsecond=0))
print (df)
TimeStamp A B
x 2018-02-15 07:35:35 A 0
y 2018-02-15 07:35:35 A 1
z 2018-02-15 07:35:35 A 2
Your working version - change range(df.shape[0]) to df.index for same indices in Series and in DataFrame:
t = [datetime.datetime.utcnow().replace(microsecond=0) for i in df.index]
s = pd.Series(t, name = 'TimeStamp')
df.insert(0, 'TimeStamp', s)

Related

Add data series content in another data series

Is there a way to concatenate the information of two data series in pandas? Not append or merge the information of two data frames, but actually combine the content of each data series in a new data series.
Example:
ColumnA (Item type) Row 2 = 1 (float64)
ColumnB (Item number) Row 2 = 1 (float64)
ColumnC (Registration Date) Row 2 = 04/07/2018 (43285) (datetime64[ns])
In excel I would concatenate the rows in Column A, B, C and have the number in each column combined, using the formula =concat(A2, B2, C2)
The result would be 1143285 in another cell D2, for example.
Is there a way for me to do that in Pandas? I could only find ways to join, combined or append the series in the data frame but not in the series itself.
You can use that
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Following your example it would be
import pandas as pd
d = {'A': [1],'B':[1],'C':[43285]}
df = pd.DataFrame(data=d)
df['D'] = df.apply(lambda row : str(row['A'])+
str(row['B']) + str(row['C']), axis = 1)
Output:
A B C D
0 1 1 43285 1143285

Merge two dataframes and keep the common values while retaining values based on another column

When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)
One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
df2[y] = [df1]
#df2.iloc[:,'y'].shape = (1,)
# type(df2.iloc[:,1][0]) = pandas.core.frame.DataFrame
I want to make a df a column in an existing row. However Pandas wraps this df in a Series object so that I cannot access it with dot notation such as df2.y.a to get the value 1. Is there a way to make this not occur or is there some constraint on object type for df elements such that this is impossible?
the desired output is a df like:
x y
0 100 a b
0 1 2
and type(df2.y) == pd.DataFrame
You can combine two DataFrame objects along the columns axis, which I think achieves what you're trying to. Let me know if this is what you're looking for
import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
a b x
0 1 2 100

Parsing Column names as DateTime

Is there a way of parsing the column names themselves as datetime.? My column names look like this:
Name SizeRank 1996-06 1996-07 1996-08 ...
I know that I can convert values for a column to datetime values, e.g for a column named datetime, I can do something like this:
temp = pd.read_csv('data.csv', parse_dates=['datetime'])
Is there a way of converting the column names themselves? I have 285 columns i.e my data is from 1996-2019.
There's no way of doing that immediately while reading the data from a file afaik, but you can fairly simply convert the columns to datetime after you've read them in. You just need to watch out that you don't pass columns that don't actually contain a date to the function.
Could look something like this, assuming all columns after the first two are dates (as in your example):
dates = pd.to_datetime(df.columns[2:])
You can then do whatever you need to do with those datetimes.
You could do something like this.
df.columns = df.columns[:2] + pd.to_datetime (df.columns[2:])
It seems pandas will accept a datetime object as a column name...
import pandas as pd
from datetime import datetime
import re
columns = ["Name", "2019-01-01","2019-01-02"]
data = [["Tom", 1,0], ["Dick",1,1], ["Harry",0,0]]
df = pd.DataFrame(data, columns = columns)
print(df)
newcolumns = {}
for col in df.columns:
if re.search("\d+-\d+-\d+", col):
newcolumns[col] = pd.to_datetime(col)
else:
newcolumns[col] = col
print(newcolumns)
df.rename(columns = newcolumns, inplace = True)
print("--------------------")
print(df)
print("--------------------")
for col in df.columns:
print(type(col), col)
OUTPUT:
Name 2019-01-01 2019-01-02
0 Tom 1 0
1 Dick 1 1
2 Harry 0 0
{'Name': 'Name', '2019-01-01': Timestamp('2019-01-01 00:00:00'), '2019-01-02': Timestamp('2019-01-02 00:00:00')}
--------------------
Name 2019-01-01 00:00:00 2019-01-02 00:00:00
0 Tom 1 0
1 Dick 1 1
2 Harry 0
--------------------
<class 'str'> Name
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-02 00:00:00
For brevity you can use...
newcolumns = {col:(pd.to_datetime(col) if re.search("\d+-\d+-\d+", col) else col) for col in df.columns}
df.rename(columns = newcolumns, inplace = True)

Pandas dataframe, delete rows in between 2 rows that have same values in some columns

given a panda dataframes, how would i delete all rows that are in between 2 rows that have the same values on 2 specific columns. In my case I have columns x,y and id. I would like if a x-y pair appears twice in the dataframe to delete all rows that lay in between those 2.
Example:
import pandas as pd
df1 = pd.DataFrame({'x':[1,2,3,2,1,3,4],
'y':[1,2,3,4,3,3,4],
'id':[1,2,3,4,5,6,7]})
^ ^
As you can see the value pair x=3,y=3 appears twice in the dataframe, once at id=3, once at id=6.
How could I spot these rows and drop all rows in between?
So that I would get this for example:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,7]})
The dataframe could also be like that, so that there are more "duplicates" as in my next example the 4,2 pair. I want to spot the outer duplicates so that with the deleting the rows in between them, all other twice or more appearing rows are eliminated too. For example:
df1 = pd.DataFrame({'x':[1,2,3,4,1,4,3,4],
'y':[1,2,3,2,3,2,3,4],
'id':[1,2,3,4,5,6,7,8]})
^ ^ ^ ^
out in in out
#should become:
df1 = pd.DataFrame({'x':[1,2,3,4],
'y':[1,2,3,4],
'id':[1,2,3,8]})
For my example this should cause a kind of loop elimination of the graph that i represent with the dataframe.
How would i implement that?
One of possible solutions:
Let's start from creation of your DataFrame (here I omitted the required import):
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
Note that index values are consecutive numbers (from 0), what will be used later.
Then we have to find duplicated rows, marking all instances (keep=False):
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
These duplicates should then be groupped on x and y:
gr = dups.groupby(['x', 'y'])
Then, number of group to which belongs particular row should be added
to df as e.g. grpNo column.
df['grpNo'] = gr.ngroup()
The next step is to find the first and last index of row, which
were groupped within the first group (with group No == 0) and save them in
ind1 and ind2.
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
Then we find a list of index values to be deleted:
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
To perform actual deletion of rows, we should execute:
df.drop(indToDel, inplace=True)
And the last step is to delete grpNo column, not needed any more.
df.drop('grpNo', axis=1, inplace=True)
The result is:
id x y
0 1 1 1
1 2 2 2
2 3 3 3
7 8 4 4
So the whole script can be as follows:
import pandas as pd
d = {'id': [1,2,3,4,5,6,7,8], 'x': [1,2,3,4,1,4,3,4], 'y': [1,2,3,2,3,2,3,4]}
df = pd.DataFrame(data=d)
dups = df[df.duplicated(subset=['x', 'y'], keep=False)]
gr = dups.groupby(['x', 'y'])
df['grpNo'] = gr.ngroup()
ind1 = df[df['grpNo'] == 0].index[0]
ind2 = df[df['grpNo'] == 0].index[-1]
indToDel = df[(df.index > ind1) & (df.index <= ind2)].index
df.drop(indToDel, inplace=True)
df.drop('grpNo', axis=1, inplace=True)
print(df)
This works for both your examples, although not sure if generalizes to all examples you have in mind:
df1[df1['x']==df1['y']]

Categories