I have a pandas series like below:
Year Month
2016 09 41
10 76
11 54
12 271
2017 01 88
02 48
03 54
04 61
05 156
06 43
07 57
08 43
09 69
10 67
11 99
12 106
2018 01 34
Name: CustomerId, dtype: int64
I just want to create a numpy array which all years match with months and values. Like this:
2016 01 0
2016 02 0
.
.
.
2016 09 41
2016 10 76
.
.
.
2018 01 34
2018 02 0
.
.
.
How can I do this?
Thanks.
unstack + stack
S.unstack().stack(dropna=False).fillna(0).astype(int)
Out[591]:
Year Month
2016 1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 41
10 76
11 54
12 271
2017 1 88
2 48
3 54
4 61
5 156
6 43
7 57
8 43
9 69
10 67
11 99
12 106
dtype: int32
Related
I have two DataFrames shown below. The DataFrames in reality are larger than the sample below.
df1
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 max min location
0 0010 20 22 21 23 26 26 20 NY
1 0011 30 25 23 31 33 33 23 CA
2 0012 67 68 68 69 65 69 67 GA
3 0013 34 33 31 30 35 35 31 MO
4 0014 44 42 40 39 50 50 39 WA
df2
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
4 0025 41 42 40 39 50 WA
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO
7 0034 40 41 39 39 50 WA
The idea is to compare each row of df2 against the appropriate max and min value specified in df1. The threshold value to be compared depends on the match in the location column. If any of the row values are outside the range defined by min and max value, they will be put in a separate dataframe. Please note the number of cost segments are vary.
Solution
# Merge the dataframes on location to append the min/max columns to df2
df3 = df2.merge(df1[['location', 'max', 'min']], on='location', how='left')
# select the cost like columns
cost = df3.filter(like='cost')
# Check whether the cost values satisfy the interval condition
mask = cost.ge(df3['min'], axis=0) & cost.le(df3['max'], axis=0)
# filter the rows where one or more values in row do not satisfy the condition
df4 = df2[~mask.all(axis=1)]
Result
print(df4)
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO
I want to split my data frame by "space" for all columns. I can do it for 1 column. How to apply it to the whole data? (maybe with loop)
df =
0 1 2 4
11 22 12 22 13 22 14 22
15 16 17 18 33 44 22 55
19 20 21 22 66 55 33 66
23 24 25 26 22 44 66 44
I am splitting in like:
df[0].str.split(' ', 1, expand=True)
Output is:
0 1
11 22
15 16
19 20
23 24
You can stack and unstack:
df.stack().str.split(' ', expand=True).unstack()
Output:
0 1
0 1 2 4 0 1 2 4
0 11 12 13 14 22 22 22 22
1 15 17 33 22 16 18 44 55
2 19 21 66 33 20 22 55 66
3 23 25 22 66 24 26 44 44
Replace a data frame values with mean using multiple grouped columns. The below snapshot is the dataframe:
Current Loan Amount DateTime Day Month Year
0 611314 1-Jan-92 1 Jan 92
1 266662 2-Jan-92 2 Jan 92
2 153494 3-Jan-92 3 Jan 92
3 176242 4-Jan-92 4 Jan 92
4 321992 5-Jan-92 5 Jan 92
5 202928 6-Jan-92 6 Jan 92
6 621786 7-Jan-92 7 Jan 92
7 266794 8-Jan-92 8 Jan 92
8 202466 9-Jan-92 9 Jan 92
9 266288 10-Jan-92 10 Jan 92
10 121110 11-Jan-92 11 Jan 92
11 258104 12-Jan-92 12 Jan 92
12 161722 13-Jan-92 13 Jan 92
13 753016 14-Jan-92 14 Jan 92
14 444664 15-Jan-92 15 Jan 92
15 172282 16-Jan-92 16 Jan 92
16 275440 17-Jan-92 17 Jan 92
17 218834 18-Jan-92 18 Jan 92
18 0 19-Jan-92 19 Jan 92
19 0 20-Jan-92 20 Jan 92
I need to replace the 0.0 values which with mean of the Current Loan Amount for the year and within the same month.
I used different methods, and the below does give me the mean, but it doesnot change the dataframe and removes the rest of the columns
data = data_loan.groupby(['Year','Month'])
def replace(group):
mask = (group==0)
group[mask] = group[~mask].mean()
return group
new_data = data.transform(replace)
import numpy as np
data_loan['current'] = data_loan['current'].replace(0, np.nan)
data_loan["current"] = data_loan.groupby(['Month','Year'])["current"].transform(lambda x: x.fillna(x.mean()))
This will replace 0 with mean of the group.
Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?
I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]
def main():
l=[]
for i in range(1981,2018):
df = pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt")
print(df[12:])
I am trying to download and read the "CONUS" row in Population.Heating.txt from 1981 to 2017.
My code seems to get the CONUS parts, but How can I actually read it like csv format with |?
Thank you!
Try this:
def main():
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
print(df[12:])
Demo:
In [14]: url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
In [15]: i = 2017
In [16]: df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
In [17]: df
Out[17]:
Region 20170101 20170102 20170103 20170104 20170105 20170106 20170107 20170108 20170109 ... 20171222 20171223 \
0 1 30 36 31 25 37 39 47 51 55 ... 40 32
1 2 28 32 28 23 39 41 46 49 51 ... 31 25
2 3 34 30 26 43 52 58 57 54 44 ... 29 32
3 4 37 34 37 57 60 62 59 54 43 ... 39 45
4 5 15 11 9 10 20 21 27 36 33 ... 12 7
5 6 16 9 7 22 31 38 45 44 35 ... 9 9
6 7 8 5 9 23 23 34 37 32 17 ... 9 19
7 8 30 32 34 33 36 42 42 31 23 ... 36 33
8 9 25 25 24 23 22 25 23 15 17 ... 23 20
9 CONUS 24 23 21 26 33 38 40 39 34 ... 23 22
20171224 20171225 20171226 20171227 20171228 20171229 20171230 20171231
0 32 34 43 53 59 59 57 59
1 30 33 43 49 54 53 50 55
2 41 47 58 62 60 54 54 60
3 47 55 61 64 57 54 62 68
4 12 20 21 22 27 26 24 29
5 22 33 31 35 37 33 32 39
6 19 24 23 28 28 23 19 27
7 34 30 32 29 26 24 27 30
8 18 17 17 15 13 11 12 15
9 26 30 34 37 38 35 34 40
[10 rows x 366 columns]
def main():
l=[]
for i in range(1981,2018):
l.append( pd.read_csv("ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/"+ str(i)+"/Population.Heating.txt"
, sep='|', skiprows=3))
Files look like:
Product: Daily Heating Degree Days
Regions: Regions::CensusDivisions
Weights: Population
[... data ...]
so you need to skip 3 rows. Afterwards you have several 'df' in your list 'l' for further processing.