Using Pandas to join and append columns in a loop - python

I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |

Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.

Related

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

Calculate difference in days in one single column based on the values in another column (pandas)

I have a pandas df (called df2) like this:
id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |
The values of orderdate are datetime with the format '%Y%m%d'. They represent orders of a client. I want to calculate the delta time between the first order and the second one for each id (each client).
I come up with:
for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] )))
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order
df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order
df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta
else:
df2.loc[df2['id_user']==i, 'Delta orders']= None
My expected outcome is:
id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01 | 2020-11-01 | 92 days
123|2020-08-01 |2020-08-01 | 2020-11-01 | 92 days
233|2020-07-01 |2020-07-01 | 2020-11-04 | 126 days
233|2020-11-04 |2020-07-01 | 2020-11-04 | 126 days
444|2020-11-04 |2020-04-01 | 2020-05-03 | 32 days
444|2020-05-03 |2020-04-01 | 2020-05-03 | 32 days
444|2020-04-01 |2020-04-01 | 2020-05-03 | 32 days
444|2020-11-25 |2020-04-01 | 2020-05-03 | 32 days
It works but I feel like it's cumbersome. Any easier way to do it?
Slightly different from what you want, but it's a start:
import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01
123|2020-08-01
233|2020-07-01
233|2020-11-04
444|2020-11-04
444|2020-05-03
444|2020-04-01
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)

Date difference from a list in pandas dataframe

I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()
It looks like this:
chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |
I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-
|chat_duration|
|1 |
|3 |
|11 |
Copying to clipboard looks like this before the group by
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22
If I were doing the entire process
Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
convert created_at to desired string format
If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))
Final df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |
After some tries I got it:
First convert string to list.
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))
Then subtract max and min date item by converting to list
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)
Create DataFrame by split and then subtract first and last columns converted to datetimes:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days

How do i create a heatmap from two columns plus the value of those two in python

and thank you for helping!
I would like to generate a heatmap in python, from the data df.
(i am using pandas, seaborn, numpy, and matplotlib in my project)
The dataframe df looks like:
index | a | b | c | year | month
0 | | | | 2013 | 1
1 | | | | 2015 | 4
2 | | | | 2016 | 10
3 | | | | 2017 | 1
in the dataset the rows are each a ticket.
The dataset is big (51 colums and 100k+ rows), so a, b, c is just to show some random columns. (for month => 1 = jan, 2= feb...)
For the heatmap:
x-axis = year,
y-axis = month,
value: and in the heatmap, I wanted the value between the two axes to be a count of the number of rows, in which a ticket has been given in that year and month.
The result I imagine should look something like the from the seaborn documentation:
https://seaborn.pydata.org/_images/seaborn-heatmap-4.png
I am new to coding and tried a lot of random things I found on the internet and has not been able to make it work.
Thank you for helping!
This should do (with generated data):
import pandas as pd
import seaborn as sns
import random
y = [random.randint(2013,2017) for n in range(2000)]
m = [random.randint(1,12) for n in range(2000)]
df = pd.DataFrame([y,m]).T
df.columns=['y','m']
df['count'] = 1
df2 = df.groupby(['y','m'], as_index=False).count()
df_p = pd.pivot_table(df2,'count','m','y')
sns.heatmap(df_p)
You probably won't need the column count but I added it because I needed an extra column for the groupby to work.

Pandas calculate last few months data in new column

I have dataframe in the following format;
ID | 01/01/2016 | 02/03/2016 | 02/15/2016 | ........
11 | 100 | 200 | 100 | ........
I am trying to calculate the sum of e.g.: last 3 months data in the new column. Expected output should be as follows;
ID | 01/01/2016 | 02/03/2016 | 02/15/2016 | ........ | Last 3 Months
11 | 100 | 200 | 100 | ........ | 300
As a solution, I need to pick today's date and compare it with the dates in the column and sum up the values. However, I am not sure how to do that? Could you please give some tips?
Thank you.
This is not as straight forward as it may initially seem. You need to determine how you will handle year-to-year changes and having differing number of days in each month. I do this using a simple function. You can adjust the code below to meet your needs, but it should get you started.
from __future__ import division, print_function
def subtract_months(m):
'''subtracts specified number of months from current date
Parameters
----------
m : integer
how many months to subtract from today's date
Returns
-------
date : datetime value'''
yr = dt.date.today().year
mon = dt.date.today().month - m
day = dt.date.today().day
# test whether we went into another year
if mon<=0:
yr -= 1
mon = 12 + mon
# test whether we have exceeded maximum number of days in month
if day>calendar.monthrange(yr,mon)[1]:
day = calendar.monthrange(yr,mon)[1]
return dt.date(yr,mon,day)
import pandas as pd
import datetime as dt
import calendar
dates = pd.date_range('20160101','20170101',freq='1D')
data = pd.np.random.randint(0,100,(5,367))
df = pd.DataFrame(data=data,index=list('ABCDE'),columns=dates)
# now add a new column
df['Last 3 Months'] = df.T.truncate(before=subtract_months(3),after=dt.date.today()).sum(axis=0)

Categories