Calculating date differences in Excel data - python

I have a scenario where I have to read an Excel file and calculate the date difference for each status and store the output in another Excel file.
date name status
1/15/2017 ABC insert_start
1/16/2017 ABC insert_complete
1/17/2017 DEF remove_start
1/18/2017 DEF remove_complete
1/19/2017 GHI create_start
1/20/2017 GHI create_complete
I need the output in the following format:
name created inserted removed
ABC 0 1 0
DEF 0 0 1
GHI 1 0 0
Where value 1 is the date difference of ABC to complete insert status.
Any help would be greatly appreciated.

Let's say df is the dataframe created by loading the excel file (which looks like the one in your example). You might have loaded it with
df = pd.read_csv('foo.csv', sep='\s+', parse_dates=['date'])
Now, you can do this:
pivoted = df.pivot('name', 'status').fillna(0)
ops = ("create", "insert", "remove")
result = pd.concat([ pivoted['date', op + '_complete']
- pivoted['date', op + '_start']
for op in ops], axis=1)
result.columns = ops
# create insert remove
#name
#ABC 0 days 1 days 0 days
#DEF 0 days 0 days 1 days
#GHI 1 days 0 days 0 days

Related

Distinguish repeating column names by adding an integer using pandas

I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0

CSV: alternative to excel "IF" statement in python. Read column and create a new one with numpy.where or other function

I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1

Pandas Dataframe reads value of one column wrong

The csv file:
Link to Github
This is my code:
import pandas as pd
df = pd.read_csv("log_1_2018_09_07.csv", encoding="ISO-8859-1", delimiter=';')
print(df.columns.tolist())
dates = []
times = []
outputs = []
for date in df.loc[:, "Datum"]:
dates.append(date)
print("date")
print(date)
for time in df.loc[:, " Zeit"]:
times.append(time)
print("time")
print(time)
for out in df.iloc[:, 19]:
print("output")
outputs.append(out)
print(out)
It reads the dates and times correctly, but the 19th column (column T) are all 0 and the 6th value is 990, however pandas reads it as all 0 and the 9th value as 1.
Does anybody know why it's reading the wrong values?
Thank you!!
import pandas as pd
url = 'https://raw.github.com/liamrisch/helper/master/log_1_2018_09_07.csv'
df = pd.read_csv(url, encoding="ISO-8859-1", delimiter=';')
df.iloc[:,[6,19]]
Gives:
Teil 1-8 - Abstand Rasthaken MP1-MP2 Teil 1-8
0 26,764 0
1 26,787 0
2 26,792 0
3 26,788 0
4 26,771 0
5 999,990 0
6 26,786 0
7 26,785 0
8 26,780 1
9 26,783 0
10 26,798 0
take a very close look at the data, the value is actually 1 but since 999 takes more visual space it makes the illusion of 999 being the value in that cell.
printing the df in that column (before any maniulation) shows the actual values of that column, without any surprises.

Add values in a dataframe row based on a specific condition

I have a dataframe with columns. The first is filled with timestamps. I want to create a new column and add 0 or 1 based on the hour value of each timestamp. For example, if %H >= "03" -> 1 else 0.
The df looks like that:
2018-08-29T00:03:09 12 0
2018-08-23T00:08:10 2 0
And I wanted to change values in the 3rd column with "1s" as described. Thank you all in advance for the effort!
lets say you have a dataframe like following,
import pandas as pd
from datetime import datetime
d={'time':['2018-08-29T00:03:09', '2018-08-29T12:03:09', '2018-08-31T10:05:09'],
'serial':[1,2,3]}
data=pd.DataFrame(data=d)
data
time serial
0 2018-08-29T00:03:09 1
1 2018-08-29T12:03:09 2
2 2018-08-31T10:05:09 3
define a function based on which the new column values would be obtained.
def func(t):
if datetime.strptime(t, '%Y-%m-%dT%H:%M:%S').hour >= 3:
return 1
else:
return 0
Now insert a new column apply the function to get the values,
data.insert(2, 'new_column', data['time'].apply(lambda x: func(x)).tolist())
This is the updated dataframe
data
time serial new_column
0 2018-08-29T00:03:09 1 0
1 2018-08-29T12:03:09 2 1
2 2018-08-31T10:05:09 3 1

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

Categories