Rearranging with pandas melt - python

I am trying to rearrange a DataFrame. Currently, I have 1035 rows and 24 columns, one for each hour of the day. I want to make this a array with 1035*24 rows. If you want to see the data it can be extracted from the following JSON file:
url = "https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1"
svk = []
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen(url.format(i)).read())
svk.append([v["y"] for v in data_json_svk["Data"][0]["data"]])
This is the code I am using to rearrange this data, but it is not doing the job. The first obeservation is in the right place, then it starts getting messy. I have not been able to figure out where each observation goes.
svk = pd.DataFrame(svk)
date_start1 = datetime(2020, 1, 1)
date_range1 = [date_start1 + timedelta(days=x) for x in range(1035)]
date_svk = pd.DataFrame(date_range1, columns=['date'])
svk['date'] = date_svk['date']
svk.drop(24, axis=1, inplace=True)
consumption_svk_1 = (svk.melt('date', value_name='SE1_C')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))

Related

Pandas Reindex Multiindex Dataframe Replicating Index

Thank you for taking a look! I am having issues with a 4 level multiindex & attempting to make sure every possible value of the 4th index is represented.
Here is my dataframe:
np.random.seed(5)
size = 25
dict = {'Customer':np.random.choice( ['Bob'], size),
'Grouping': np.random.choice( ['Corn','Wheat','Soy'], size),
'Date':np.random.choice( pd.date_range('1/1/2018','12/12/2022', freq='D'), size),
'Data': np.random.randint(20,100, size=(size))
}
df = pd.DataFrame(dict)
# create the Sub-Group column
df['Sub-Group'] = np.nan
df.loc[df['Grouping'] == 'Corn', 'Sub-Group'] = np.random.choice(['White', 'Dry'], size=len(df[df['Grouping'] == 'Corn']))
df.loc[df['Grouping'] == 'Wheat', 'Sub-Group'] = np.random.choice(['SRW', 'HRW', 'SWW'], size=len(df[df['Grouping'] == 'Wheat']))
df.loc[df['Grouping'] == 'Soy', 'Sub-Group'] = np.random.choice(['Beans', 'Meal'], size=len(df[df['Grouping'] == 'Soy']))
df['Year'] = df.Date.dt.year
With that, I'm looking to create a groupby like the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
This works as expected. I want to reindex this dataframe so that every single month (index 3) is represented & filled with 0s. The reason I want this is later on I'll be doing a cumulative sum of a groupby.
I have tried both the following reindex & nothing happens - many months are still missing.
rere = pd.date_range('2018-01-01','2018-12-31', freq='M').month
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(rere, level=3, fill_value=0))
)
I've also tried the following:
(df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
.fillna(0)
.pipe(lambda x: x.reindex(pd.MultiIndex.from_product(x.index.levels)))
)
The issue with the last one is that the index is much too long - it's doing the cartesian product of Grouping & Sub-Group when really there are no combinations of 'Wheat' as a Grouping & 'Dry' as 'Sub-Group'.
I'm looking for a flexible way to reindex this dataframe to make sure a specific index level (3rd in this case) has every option.
Thanks so much for any help!
try this:
def reindex_sub(g: pd.DataFrame):
g = g.droplevel([0, 1, 2])
result = g.reindex(range(1, 13))
return result
tmp = (df.groupby(['Customer','Grouping','Sub-Group',df['Date'].dt.month,'Year'])
.agg(Units = ('Data','sum'))
.unstack()
)
grouped = tmp.groupby(level=[0,1,2], group_keys=True)
out = grouped.apply(reindex_sub)
print(out)

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

A column is dropped with multi-merge - Python

I'm creating columns with past data from the same columns of the database. Like for the same day, I need the Y value of the day before and the same day of week in the week before. So:
x = df.copy()
x["Date"] = pd.to_datetime(df.rename(columns={"Año":"Year","Mes":"Month","Dia":"Day"})[["Year","Month","Day"]])
y = x[["Date","Y"]]
y.rename(columns={"Y":"Y_DiaAnterior"}, inplace=True)
y["Date"] = y["Date"] + dt.timedelta(days=1)
z = pd.merge(x,y,on=["Date"], how="left")
display(y.head()) # First merge result
a = x[["Date","Y"]]
a.rename(columns={"Y":"Y_DiaSemAnterior"}, inplace=True)
a["Date"] = a["Date"] + dt.timedelta(days=7)
z = pd.merge(x,a,on=["Date"], how="left")
z.head() # Second merge result
Where y df is an auxiliar df to create the column Y with last day data, and a df is an auxiliar df to create the column Y with same-day-last week data.
When I merge them separately it works perfectly, but when I want to merge all of them (first x with y and then x with a) the merge of x with y is 'deleted', as you can see that the Y_DiaAnterior columns is not in the final df (or 'second merge result') even when I already merged them.
First merge result
Second merge result
So, how can I do that the final df have Y_DiaAnterior and Y_DiaSemAnterior variables?
Because you're overwriting your z with the new merge of x and a. Also you aren't showing the results of the first merge in your code because you're using y.head().
If you want the merge results of all 3 df's, you can chain the merges:
# prep x
x = df.copy()
x["Date"] = pd.to_datetime(df.rename(columns={"Año":"Year", "Mes":"Month", "Dia":"Day"})[["Year", "Month", "Day"]])
# prep y
y = x[["Date", "Y"]].copy()
y.rename(columns={"Y":"Y_DiaAnterior"}, inplace=True)
y["Date"] = y["Date"] + dt.timedelta(days=1)
# prep a
a = x[["Date", "Y"]].copy()
a.rename(columns={"Y":"Y_DiaSemAnterior"}, inplace=True)
a["Date"] = a["Date"] + dt.timedelta(days=7)
# now merge all
z = x.merge(y, on='Date', how='left') \
.merge(a, on='Date', how='left')

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Pandas: how to make algorithm faster

I have the task: I should find some data in big file and add this data to some file.
File, where I search data is 22 million string and I divide it using chunksize.
In other file I have column with 600 id of users and I find info about every users in big file.
The first I divide data to interval and next search information about every user in all of this files.
I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval).
I want to do it faster, but I don't know any way besides chunksize.
I add my code
el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
i['used_at'] = pd.to_datetime(i['used_at'])
df = i.sort_values(['ID', 'used_at'])
dates = df['used_at']
ids = df['ID']
urls = df['url']
for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
start = time.time()
df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
df1 = DataFrame(df1)
if df1.empty:
continue
else:
with open('3.csv', 'a') as f:
df1.to_csv(f, header=False)
end = time.time()
print(end - start)
There are some issues in your code
zip takes arguments that may be of different length
dateutil.relativedelta may not be compatible with pandas Timestamp.
With pandas 0.18.1 and python 3.5, I'm getting this:
now = pd.Timestamp.now()
now
Out[46]: Timestamp('2016-07-06 15:32:44.266720')
now + dateutil.relativedelta.relativedelta(day=5)
Out[47]: Timestamp('2016-07-05 15:32:44.266720')
So it's better to use pd.Timedelta
now + pd.Timedelta(5, 'D')
Out[48]: Timestamp('2016-07-11 15:32:44.266720')
But it's somewhat inaccurate for months:
now - pd.Timedelta(1, 'M')
Out[49]: Timestamp('2016-06-06 05:03:38.266720')
This is a sketch of code. I didn't test and I may be wrong about what you want.
The crucial part is to merge the two data frames instead of iterating row by row.
# 1) convert to datetime here
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys
# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
# not necessary
# dates1 = buys['date']
# ids1 = buys['id']
for chunk in el:
# already converted to datetime
# i['used_at'] = pd.to_datetime(i['used_at'])
# defer sorting until later
# df = i.sort_values(['ID', 'used_at'])
# merge!
# (option how='inner' selects only rows that have the same id in both data frames; it's default)
merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
selected = merged.loc[bool_idx]
# probably don't need additional columns from buys,
# so either drop them or select the ones from chunk (beware of possible duplicates in names)
selected = selected[chunk.columns]
# sort now (possibly a smaller frame)
selected = selected.sort_values(['ID', 'used_at'])
if selected.empty:
continue
with open('3.csv', 'a') as f:
selected.to_csv(f, header=False)
Hope this helps. Please double check the code and adjust to your needs.
Please, take a look at the docs to understand the options of merge.

Categories