I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')
Related
I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?
Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.
I have a particular csv for eg:
col1 col2 col3 col4
a 1 2 3
b 1 2 1
c 1 1 3
d 3 1 2
I want to count number of a particular value for eg. 1 in col2, col3 and col4
I am using the following code using pandas
import pandas as pd
fname = input('Enter the filename:')
df = pd.read_csv (fname, header='infer')
one = df.iloc[:,1:4].value_counts(normalize=False).loc[1]
It is showing error but when I am doing the same for a particular defined column the code is running properly
import pandas as pd
fname = input('Enter the filename:')
df = pd.read_csv (fname, header='infer')
one = df[col1].value_counts(normalize=False).loc[1]
I want the following output
col2 3
col3 2
col4 1
Any help or tips would be greatly appreciated! Thank you in advance. :)
Use eq with desired value i.e. 1 and then sum as:
df1[['col2', 'col3', 'col4']].eq(1).sum()
col2 3
col3 2
col4 1
dtype: int64
I reached this question when searching for a way to check how many values are actually higher/lower than zero, on the columns 'Buys' and 'Sells', on the following DataFrame (named "trade_track"):
Ticker Pre-trade Buys Sells Net Exposure Ch. Post-trade
CX 10126.0 0.0 -964.0 -0.095200 9162.0
OI 3311.0 0.0 -24.0 -0.007249 3287.0
THO 748.0 0.0 -33.0 -0.044118 715.0
WRK 1002.0 0.0 -43.0 -0.042914 959.0
TAP 646.0 0.0 -4.0 -0.006192 642.0
TRN 1987.0 0.0 -93.0 -0.046804 1894.0
SJM 312.0 6.0 0.0 0.019231 318.0
WW 1100.0 0.0 -22.0 -0.020000 1078.0
FAST -655.0 13.0 0.0 -0.019847 -642.0
CSX -301.0 6.0 0.0 -0.019934 -295.0
ODFL -123.0 0.0 0.0 -0.000000 -123.0
HELE -130.0 0.0 0.0 -0.000000 -130.0
SBUX -203.0 0.0 0.0 -0.000000 -203.0
WM -166.0 0.0 0.0 -0.000000 -166.0
HD -90.0 2.0 0.0 -0.022222 -88.0
VMC -141.0 0.0 0.0 -0.000000 -141.0
CTAS -76.0 2.0 0.0 -0.026316 -74.0
ORLY -53.0 0.0 0.0 -0.000000 -53.0
Here is a simple code that worked:
(i) to find all numbers above zero on column 'Buys':
((trade_track['Buys'])>0).sum()
(ii) to find all zeros on column 'Buys':
((trade_track['Buys'])==0).sum()
how to calculate the sum of the values (1) and the sum of the values (0) contained in each date?
or
how to calculate the sum of the values (1) divided by the sum of the values (0) in each date.
sentiment_value = log10(count_of_(1)/count_of_(0)), this is the formula I am using for.
date new_sentiment
0 2017-04-28 1.0
1 2017-04-28 1.0
2 2017-04-28 1.0
3 2017-04-27 0.0
4 2017-04-27 1.0
5 2017-04-26 0.0
6 2017-04-26 1.0
7 2017-04-26 1.0
8 2017-04-26 0.0
9 2017-04-26 1.0
result_neg = date_df.appl
You need:
g = data.groupby(['date', 'new_sentiment']).size().unstack(fill_value=0).reset_index()
g['sentiment_value'] = np.log((g[1.0])/(g[0.0]))
Output:
new_sentiment date 0.0 1.0 sentiment_value
0 2017-04-26 2 3 0.405465
1 2017-04-27 1 1 0.000000
2 2017-04-28 0 3 inf
I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
I have O(nlog(n)) solution where n is the sum of rows of df_a and df_b. Here's how it goes:
Rename value column of both dataframes to value_a and value_b repsectively. Next append df_b to df_a.
df = df_a.append(df_b)
Sort the df with respect to start column.
df = df.sort_values('start')
Resulting dataframe will look like this:
start stop value_a value_b
0 0 100 0.0 NaN
0 0 50 NaN 0.0
1 51 300 NaN 1.0
1 101 200 1.0 NaN
2 201 1000 0.0 NaN
2 301 1000 NaN 0.0
Forward fill the missing values:
df = df.fillna(method='ffill')
Compute same_value column:
df['same_value'] = df['value_a'] == df['value_b']
Recompute stop column:
df.stop = df.start.shift(-1)
You will get the dataframe you desire (except the first and last row which is pretty easy to fix):
start stop value_a value_b same_value
0 0 0.0 0.0 NaN False
0 0 51.0 0.0 0.0 True
1 51 101.0 0.0 1.0 False
1 101 201.0 1.0 1.0 True
2 201 301.0 0.0 1.0 False
2 301 NaN 0.0 0.0 True
Here is an answer which computes the overlapping intervals really quickly (which answers the question in the title):
from io import StringIO
import pandas as pd
from ncls import NCLS
c1 = StringIO("""Start Stop Value
0 100 0.0
101 200 1.0
201 1000 0.0""")
c2 = StringIO("""Start Stop Value
0 50 0.0
51 300 1.0
301 1000 0.0""")
df1 = pd.read_table(c1, sep="\s+")
df2 = pd.read_table(c2, sep="\s+")
ncls = NCLS(df1.Start.values, df1.Stop.values, df1.index.values)
x1, x2 = ncls.all_overlaps_both(df2.Start.values, df2.Stop.values, df2.index.values)
df1 = df1.reindex(x2).reset_index(drop=True)
df2 = df2.reindex(x1).reset_index(drop=True)
# print(df1)
# print(df2)
df = df1.join(df2, rsuffix="2")
print(df)
# Start Stop Value Start2 Stop2 Value2
# 0 0 100 0.0 0 50 0.0
# 1 0 100 0.0 51 300 1.0
# 2 101 200 1.0 51 300 1.0
# 3 201 1000 0.0 51 300 1.0
# 4 201 1000 0.0 301 1000 0.0
With this final df it should be simple to get to the result you need (but it is left as an exercise for the reader).
See NCLS for the interval overlap data structure.