import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]
Related
I am trying to read this table
on the webpage: https://datahub.io/sports-data/german-bundesliga
I am using this code:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga"
pd.read_html(url)[2]
It reads another tables but not the tables of this type.
Also there is a link to this specific table:
https://datahub.io/sports-data/german-bundesliga/r/0.html
I also tried this:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga/r/0.html"
pd.read_html(url)
But it says that there are no tables to read
There is no necessity to use the HTML form of the table cause the table is available in CSV format.
pd.read_csv('https://datahub.io/sports-data/german-bundesliga/r/season-1819.csv').head()
output:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR ... BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
0 D1 24/08/2018 Bayern Munich Hoffenheim 3 1 H 1 0 H ... 3.55 22 -2.00 1.92 1.87 2.05 1.99 1.23 7.15 14.10
1 D1 25/08/2018 Fortuna Dusseldorf Augsburg 1 2 A 1 0 H ... 1.76 20 0.00 1.80 1.76 2.17 2.11 2.74 3.33 2.78
2 D1 25/08/2018 Freiburg Ein Frankfurt 0 2 A 0 1 A ... 1.69 20 -0.25 2.02 1.99 1.92 1.88 2.52 3.30 3.07
3 D1 25/08/2018 Hertha Nurnberg 1 0 H 1 0 H ... 1.76 20 -0.25 1.78 1.74 2.21 2.14 1.79 3.61 5.21
4 D1 25/08/2018 M'gladbach Leverkusen 2 0 H 0 0 D ... 2.32 20 0.00 2.13 2.07 1.84 1.78 2.63 3.70 2.69
5 rows × 61 columns
I have the following dataframe df to process:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 211 1.13 122 2.53
2 A 242 1.22 211 1.13
3 A 245 3.87 242 1.38
4 A 311 3.13 243 4.00
5 A 312 7.11 311 2.07
6 A NaN NaN 312 7.11
7 A NaN NaN 324 1.06
As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.
I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 122 NaN 122 2.53
2 A 211 1.13 211 1.13
3 A 242 1.22 242 1.38
4 A 243 NaN 243 4.00
5 A 245 3.87 245 NaN
6 A 311 3.13 311 2.07
7 A 312 7.11 312 7.11
8 A 324 NaN 324 1.06
In order to do so, I thought of creating 2 subsets:
left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]
and I tried to merge them, manipulating the function merge:
left.merge(right, on=..., how=..., suffixes=...)
but I got lost in the parameters that should be used to achieve the result.
What do you think would be the best way to do it?
Appendix:
In order to create the initial dataframe, one could use:
names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})
You are almost there:
left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)
Output
Name_x
C1
Value_1
Name_y
C2
Value_2
A
112
2.36
A
112
3.77
0
0
0
A
122
2.53
A
211
1.13
A
211
1.13
A
242
1.22
A
242
1.38
0
0
0
A
243
4
A
311
3.13
A
311
2.07
A
312
7.11
A
312
7.11
0
0
0
A
324
1.06
IIUC, you can perform an outer merge, then dropna on the missing values:
(df[['Name', 'C1', 'Value_1']]
.merge(df[['Name', 'C2', 'Value_2']],
left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
.dropna(subset=['C1', 'C2'], how='all')
)
output:
Name C1 Value_1 C2 Value_2
0 A 112.0 2.36 112.0 3.77
1 A 211.0 1.13 211.0 1.13
2 A 242.0 1.22 242.0 1.38
3 A 245.0 3.87 NaN NaN
4 A 311.0 3.13 311.0 2.07
5 A 312.0 7.11 312.0 7.11
8 A NaN NaN 122.0 2.53
9 A NaN NaN 243.0 4.00
10 A NaN NaN 324.0 1.06
I have 3 dataframes/series:
>>> max_q.head()
Out[16]:
130 1.00
143 2.00
146 2.00
324 10.00
327 6.00
dtype: float64|
>>> min_q.head()
Out[17]:
130 8.00
143 6.00
146 4.00
324 8.00
327 8.00
dtype: float64
>>> dirx.head()
Out[18]:
side
130 B
143 S
146 S
324 B
327 S
I want to create a new dataframe which equals max_q when dirx ='S' and min_q otherwise.
Is there a simple way to do this? the following have failed so far:
1. np.where(dirx=='S', max_q, min_q)
2. rp = pd.DataFrame(max_q, columns = ['id'])
rp.loc[dirx=='B', 'id'] = min_q
In the dataframe below:
T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 -
How do I replace all the - with NaN's. I do not want to specify column names since I do not know before hand which column will have -
If those are strings, then your floats are probably also strings.
Assuming your dataframe is df, I'd try
pd.to_numeric(df.stack(), 'coerce').unstack()
Deeper explanation
Pandas doesn't usually represent missing floats with '-'. Therefore, that '-' must be a string. Thus, the dtype of any column with a '-' in it, must be 'object'. That makes it highly likely that whatever parsed the data, left the floats as string.
setup
from io import StringIO
import pandas as pd
txt = """T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 - """
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
print(df)
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 -
1 6.31 10.46 - 5.63 -
2 - 10.66 79.38 3.63 -
3 0.79 4.45 94.24 1.85 -
4 1.45 3.99 91.71 1.17 -
What are the dtypes?
print(df.dtypes)
T2MN object
T2MX float64
RH2M object
DFP2M float64
RAIN object
dtype: object
What is the type of the first element?
print(type(df.iloc[0, 0]))
<class 'str'>
This means that any column with a '-' is like a column of strings that look like floats. You want to use pd.to_numeric with parameter errors='coerce' to force non-numeric items to np.nan. However, pd.to_numeric does not operate on a pd.DataFrame so we stack and unstack.
pd.to_numeric(df.stack(), 'coerce').unstack()
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
Just replace() the string:
In [10]: df.replace('-', 'NaN')
Out[10]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
I think you want the actual numpy.nan instead of a string NaN as you can use a lot of methods such as fillna/isnull/notnull on the pandas.Series/pandas.DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame([['-']*10]*10)
df = df.replace('-',np.nan)
It looks like you were reading this data from CSV/FWF file... If it's true the easiest way to get rid of '-' would be to explain Pandas that it's NaN's representation:
df = pd.read_csv(filename, na_values=['NaN', 'nan', '-'])
Test:
In [79]: df
Out[79]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
In [80]: df.dtypes
Out[80]:
T2MN float64
T2MX float64
RH2M float64
DFP2M float64
RAIN float64
dtype: object
I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.
import StringIO
import pandas as pd
data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""
>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")
After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.
>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34 22 10 5
2014-06-23 203 HV 33 35 13 1
John 2014-06-22 221 HV 40 39 11 2
2014-07-01 218 HV 35 29 13 1
However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:
ID Collector Date Sample Type A B C D
0 Emily 2014-06-20 201 HV 34 22 10 5
2 John 2014-06-22 221 HV 40 39 11 2
1 Emily 2014-06-23 203 HV 33 35 13 1
3 John 2014-07-01 218 HV 35 29 13 1
4 mean 35.5 31.3 11.8 2.3
5 median 34.5 32.0 12.0 1.5
I tried things like calling mean or median on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum.
>>> table.mean()
ID
A 35.50
B 31.25
C 11.75
D 2.25
dtype: float64
>>> table.median()
ID
A 34.5
B 32.0
C 12.0
D 1.5
dtype: float64
You could use aggfunc=[np.mean, np.median] to compute both the means and the medians. Then you could use margins=True to also obtain the means and medians for each column and for each row.
result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"],
columns="ID", values="Value", margins=True,
aggfunc=[np.mean, np.median]).stack(level=0)
yields
ID A B C D All
Collector Date Sample Type
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
median 34.0 22.00 10.00 5.00 16.0000
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
median 33.0 35.00 13.00 1.00 23.0000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
median 40.0 39.00 11.00 2.00 25.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
median 35.0 29.00 13.00 1.00 21.0000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Yes, result contains more data than you asked for, but
result.loc['All']
has the additional values:
ID A B C D All
Date Sample Type
mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Or, you could further subselect result to get just the rows you are looking for:
result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)
yields
ID A B C D All
Collector Date Sample Type aggfunc
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
This might not be super clean, but you could assign to the new entries with .loc.
In [131]: table_mean = table.mean()
In [132]: table_median = table.median()
In [134]: table.loc['Mean', :] = table_mean.values
In [135]: table.loc['Median', :] = table_median.values
In [136]: table
Out[136]:
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34.0 22.00 10.00 5.00
2014-06-23 203 HV 33.0 35.00 13.00 1.00
John 2014-06-22 221 HV 40.0 39.00 11.00 2.00
2014-07-01 218 HV 35.0 29.00 13.00 1.00
Mean 35.5 31.25 11.75 2.25
Median 34.5 32.00 12.00 1.50