Related
I have the following dataframe df to process:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 211 1.13 122 2.53
2 A 242 1.22 211 1.13
3 A 245 3.87 242 1.38
4 A 311 3.13 243 4.00
5 A 312 7.11 311 2.07
6 A NaN NaN 312 7.11
7 A NaN NaN 324 1.06
As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.
I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 122 NaN 122 2.53
2 A 211 1.13 211 1.13
3 A 242 1.22 242 1.38
4 A 243 NaN 243 4.00
5 A 245 3.87 245 NaN
6 A 311 3.13 311 2.07
7 A 312 7.11 312 7.11
8 A 324 NaN 324 1.06
In order to do so, I thought of creating 2 subsets:
left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]
and I tried to merge them, manipulating the function merge:
left.merge(right, on=..., how=..., suffixes=...)
but I got lost in the parameters that should be used to achieve the result.
What do you think would be the best way to do it?
Appendix:
In order to create the initial dataframe, one could use:
names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})
You are almost there:
left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)
Output
Name_x
C1
Value_1
Name_y
C2
Value_2
A
112
2.36
A
112
3.77
0
0
0
A
122
2.53
A
211
1.13
A
211
1.13
A
242
1.22
A
242
1.38
0
0
0
A
243
4
A
311
3.13
A
311
2.07
A
312
7.11
A
312
7.11
0
0
0
A
324
1.06
IIUC, you can perform an outer merge, then dropna on the missing values:
(df[['Name', 'C1', 'Value_1']]
.merge(df[['Name', 'C2', 'Value_2']],
left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
.dropna(subset=['C1', 'C2'], how='all')
)
output:
Name C1 Value_1 C2 Value_2
0 A 112.0 2.36 112.0 3.77
1 A 211.0 1.13 211.0 1.13
2 A 242.0 1.22 242.0 1.38
3 A 245.0 3.87 NaN NaN
4 A 311.0 3.13 311.0 2.07
5 A 312.0 7.11 312.0 7.11
8 A NaN NaN 122.0 2.53
9 A NaN NaN 243.0 4.00
10 A NaN NaN 324.0 1.06
import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]
I have a dataframe with some dates,and associated data with each date that I am reading in from a csv file (the file is relatively small, on the magnitude of 10,000s of rows, and ~10 columns):
memid date a b
10000 7/3/2017 221 143
10001 7/4/2017 442 144
10002 7/6/2017 132 145
10003 7/8/2017 742 146
10004 7/10/2017 149 147
I want to add a column, "date_diff", to this dataframe that calculates the amount of days between each date and the previously most recent date (the rows are always sorted by date):
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/4/2017 442 144 1
10002 7/6/2017 132 145 2
10003 7/8/2017 742 146 2
10004 7/11/2017 149 147 3
I am having trouble figuring out a good way to create this "date_diff" column as iterating row by row tends to be frowned upon when using pandas/numpy. Is there an easy way to create this column in python/pandas/numpy or is this job better done before the csv is read into my script?
Thanks!
EDIT: Thanks to jpp and Tai for their answer. It covers the original question but I have a follow up:
What if my dataset has multiple rows for each date? Is there a way to easily check the difference between each group of dates to produce an output like the example below? Is it easier if there are a set number of rows for each date?
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/3/2017 442 144 NaN
10002 7/4/2017 132 145 1
10003 7/4/2017 742 146 1
10004 7/6/2017 149 147 2
10005 7/6/2017 457 148 2
Edit to answer OP's new question: what if there are duplicates in date columns?
Set up: creating a df that does not contains duplicates
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df_no_dup = df.drop_duplicates("date").copy()
df_no_dup["diff"] = df_no_dup["date"].diff().dt.days
Method 1 : merge
df.merge(df_no_dup[["date", "diff"]], left_on="date", right_on="date", how="left")
memid date a b diff
0 10000 2017-07-03 221 143 NaN
1 10001 2017-07-03 442 144 NaN
2 10002 2017-07-04 132 145 1.0
3 10003 2017-07-04 742 146 1.0
4 10004 2017-07-06 149 147 2.0
5 10005 2017-07-06 457 148 2.0
Method 2 : map
df["diff"] = df["date"].map(df_no_dup.set_index("date")["diff"])
Try this.
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df.date.diff()
0 NaT
1 1 days
2 2 days
3 2 days
4 2 days
Name: date, dtype: timedelta64[ns]
To convert to integers:
df['diff'] = df['date'].diff() / np.timedelta64(1, 'D')
# memid date a b diff
# 0 10000 2017-07-03 221 143 NaN
# 1 10001 2017-07-04 442 144 1.0
# 2 10002 2017-07-06 132 145 2.0
# 3 10003 2017-07-08 742 146 2.0
# 4 10004 2017-07-10 149 147 2.0
I have a pandas dataframe pmov with columns SDRFT and DRFT containing float values. Some of the DRFT values are 0.0. When that happens, I want to replace the DRFT value with the SDRFT value. For testing purposes, I've stored the rows where DRFT = 0.0 in dataframe df.
I've tried defining the function:
def SDRFT_is_DRFT(row):
if row['SDRFT'] == row['DRFT']:
pass
elif row['SDRFT'] == 0:
row['SDRFT'] = row['DRFT']
elif ['DRFT'] == 0:
row['DRFT'] = row['SDRFT']
return row[['SDRFT','DRFT']]
and applying it with: df.apply(SDRFT_is_DRFT, axis=1)
which returns:
In []: df.apply(SDRFT_is_DRFT, axis=1)
Out[]:
SDRFT DRFT
118 29.500000 0.0
144 0.000000 0.0
212 29.166667 0.0
250 21.000000 0.0
308 21.500000 0.0
317 24.500000 0.0
327 11.000000 0.0
334 31.000000 0.0
347 29.500000 0.0
348 35.000000 0.0
Which isn't the outcome I want.
I also tried the function:
def drft_repl(row):
if row['DRFT']==0:
row['DRFT'] = row['SDRFT']
which appeared to work for df.DRFT = df.apply(drft_repl, axis=1)
but pmov.DRFT = pmov.apply(drft_repl, axis=1) resulted in 100% replacement of DRFT values with SDRFT values, except where the DRFT value was nan.
How can I conditionally replace cell values in one column with values in another column of the same row?
try this:
df.loc[df.DRFT == 0, 'DRFT'] = df.SDRFT
I think you can use mask. First is replaced column SDRFT with values of DRFT where is condition True and last is replaced column DRFT with values of SDRFT:
pmov.SDRFT = pmov.SDRFT.mask(pmov.SDRFT == 0, pmov.DRFT)
pmov.DRFT = pmov.DRFT.mask(pmov.DRFT == 0, pmov.SDRFT)
print pmov
SDRFT DRFT
118 29.500000 29.500000
144 0.000000 0.000000
212 29.166667 29.166667
250 21.000000 21.000000
308 21.500000 21.500000
317 24.500000 24.500000
327 11.000000 11.000000
334 31.000000 31.000000
347 29.500000 29.500000
348 35.000000 35.000000
Another solution with loc:
pmov.loc[pmov.SDRFT == 0, 'SDRFT'] = pmov.DRFT
pmov.loc[pmov.DRFT == 0, 'DRFT'] = pmov.SDRFT
print pmov
SDRFT DRFT
118 29.500000 29.500000
144 0.000000 0.000000
212 29.166667 29.166667
250 21.000000 21.000000
308 21.500000 21.500000
317 24.500000 24.500000
327 11.000000 11.000000
334 31.000000 31.000000
347 29.500000 29.500000
348 35.000000 35.000000
For better testing DataFrame was changed:
print pmov
SDRFT DRFT
118 29.5 29.50
144 0.0 5.98
212 0.0 7.30
250 21.0 0.00
308 21.5 0.00
317 0.0 0.00
327 11.0 0.00
334 31.0 0.00
347 29.5 0.00
348 35.0 35.00
pmov.SDRFT = pmov.SDRFT.mask(pmov.SDRFT == 0, pmov.DRFT)
pmov.DRFT = pmov.DRFT.mask(pmov.DRFT == 0, pmov.SDRFT)
print pmov
SDRFT DRFT
118 29.50 29.50
144 5.98 5.98
212 7.30 7.30
250 21.00 21.00
308 21.50 21.50
317 0.00 0.00
327 11.00 11.00
334 31.00 31.00
347 29.50 29.50
348 35.00 35.00
pmov.loc[pmov.DRFT == 0, 'DRFT'] = pmov.SDRFT
pmov.loc[pmov.SDRFT == 0, 'SDRFT'] = pmov.DRFT
print pmov
SDRFT DRFT
118 29.50 29.50
144 5.98 5.98
212 7.30 7.30
250 21.00 21.00
308 21.50 21.50
317 0.00 0.00
327 11.00 11.00
334 31.00 31.00
347 29.50 29.50
348 35.00 35.00
I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.
import StringIO
import pandas as pd
data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""
>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")
After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.
>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34 22 10 5
2014-06-23 203 HV 33 35 13 1
John 2014-06-22 221 HV 40 39 11 2
2014-07-01 218 HV 35 29 13 1
However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:
ID Collector Date Sample Type A B C D
0 Emily 2014-06-20 201 HV 34 22 10 5
2 John 2014-06-22 221 HV 40 39 11 2
1 Emily 2014-06-23 203 HV 33 35 13 1
3 John 2014-07-01 218 HV 35 29 13 1
4 mean 35.5 31.3 11.8 2.3
5 median 34.5 32.0 12.0 1.5
I tried things like calling mean or median on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum.
>>> table.mean()
ID
A 35.50
B 31.25
C 11.75
D 2.25
dtype: float64
>>> table.median()
ID
A 34.5
B 32.0
C 12.0
D 1.5
dtype: float64
You could use aggfunc=[np.mean, np.median] to compute both the means and the medians. Then you could use margins=True to also obtain the means and medians for each column and for each row.
result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"],
columns="ID", values="Value", margins=True,
aggfunc=[np.mean, np.median]).stack(level=0)
yields
ID A B C D All
Collector Date Sample Type
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
median 34.0 22.00 10.00 5.00 16.0000
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
median 33.0 35.00 13.00 1.00 23.0000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
median 40.0 39.00 11.00 2.00 25.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
median 35.0 29.00 13.00 1.00 21.0000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Yes, result contains more data than you asked for, but
result.loc['All']
has the additional values:
ID A B C D All
Date Sample Type
mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Or, you could further subselect result to get just the rows you are looking for:
result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)
yields
ID A B C D All
Collector Date Sample Type aggfunc
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
This might not be super clean, but you could assign to the new entries with .loc.
In [131]: table_mean = table.mean()
In [132]: table_median = table.median()
In [134]: table.loc['Mean', :] = table_mean.values
In [135]: table.loc['Median', :] = table_median.values
In [136]: table
Out[136]:
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34.0 22.00 10.00 5.00
2014-06-23 203 HV 33.0 35.00 13.00 1.00
John 2014-06-22 221 HV 40.0 39.00 11.00 2.00
2014-07-01 218 HV 35.0 29.00 13.00 1.00
Mean 35.5 31.25 11.75 2.25
Median 34.5 32.00 12.00 1.50