I have data like this in a csv file
Symbol Action Year
AAPL Buy 2001
AAPL Buy 2001
BAC Sell 2002
BAC Sell 2002
I am able to read it and groupby like this
df.groupby(['Symbol','Year']).count()
I get
Action
Symbol Year
AAPL 2001 2
BAC 2002 2
I desire this (order does not matter)
Action
Symbol Year
AAPL 2001 2
AAPL 2002 0
BAC 2001 0
BAC 2002 2
I want to know if its possible to count for zero occurances
You can use this:
df = df.groupby(['Symbol','Year']).count().unstack(fill_value=0).stack()
print (df)
Output:
Action
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
You can use pivot_table with unstack:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
dtype: int64
If you need output as DataFrame use to_frame:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
.to_frame()
.rename(columns={0:'Action'})
Action
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
Datatype category
Maybe this feature didn't exist back when this thread was opened, however the datatype "category" can help here:
# create a dataframe with one combination of a,b missing
df = pd.DataFrame({"a":[0,1,1], "b": [0,1,0]})
df = df.astype({"a":"category", "b":"category"})
print(df)
Dataframe looks like this:
a b
0 0 0
1 1 1
2 1 0
And now, grouping by a and b
print(df.groupby(["a","b"]).size())
yields:
a b
0 0 1
1 0
1 0 1
1 1
Note the 0 in the rightmost column. This behavior is also documented in the pandas userguide (search on page for "groupby").
If you want to do this without using pivot_table, you can try the below approach:
midx = pd.MultiIndex.from_product([ df['Symbol'].unique(), df['Year'].unique()], names=['Symbol', 'Year'])
df_grouped_by = df_grouped_by.reindex(midx, fill_value=0)
What we are essentially doing above is creating a multi-index of all the possible values multiplying the two columns and then using that multi-index to fill zeroes into our group-by dataframe.
Step 1: Create a dataframe that stores the count of each non-zero class in the column counts
count_df = df.groupby(['Symbol','Year']).size().reset_index(name='counts')
Step 2: Now use pivot_table to get the desired dataframe with counts for both existing and non-existing classes.
df_final = pd.pivot_table(count_df,
index=['Symbol','Year'],
values='counts',
fill_value = 0,
dropna=False,
aggfunc=np.sum)
Now the values of the counts can be extracted as a list with the command
list(df_final['counts'])
All the answers above are focusing on groupby or pivot table. However, as is well described in this article and in this question, this is a beautiful case for pandas' crosstab function:
import pandas as pd
df = pd.DataFrame({
"Symbol": 2*['AAPL', 'BAC'],
"Action": 2*['Buy', 'Sell'],
"Year": 2*[2001,2002]
})
pd.crosstab(df["Symbol"], df["Year"]).stack()
yielding:
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
Related
I have the following data frame.
ID Product quantity
9626 a 1
9626 b 1
9626 c 1
6600 f 1
6600 a 1
6600 d 1
And I want to join rows by ID.
Below is an example of the results.
(The quantity column is optional. This column is not necessary.)
ID Product quantity
9626 a,b,c 3
6600 a,d,f 3
I used merge and sum, but it did not work.
Is this problem solved only with a loop statement?
I'd appreciate it if you could provide me with a solution.
Use groupby.agg:
df = (df.sort_values('Product')
.groupby('ID', as_index=False, sort=False)
.agg({'Product':','.join, 'quantity':'sum'}))
print(df)
ID Product quantity
0 9626 a,b,c 3
1 6600 a,d,f 3
I have a Pandas dataframe resulting from a groupby() operation. This dataframe has two indexes (year, month). How can I normalize a column relative to the corresponding month in a specific year?
My dataframe looks like the following:
value
year month
2000 1 1234
2 4567
2001 1 2345
2 5678
2002 1 3456
2 6789
I would like to the resulting dataframe to have each value divided by the corresponding monthly value in 2002. Thus, expressing all values relative to 2002 levels. This would result in the values for 2002 being 1.0 for both months.
What is the most efficient way of doing this? Appreciate any help!
Use DataFrame.div with a level argument.
df.div(df.xs(2002), level=1, axis=0)
value
year month
2000 1 0.357060
2 0.672706
2001 1 0.678530
2 0.836353
2002 1 1.000000
2 1.000000
Where,
df.xs(2002)
value
month
1 3456
2 6789
The division is aligned along the first level of the 0th axis.
I have the following two dataframes:
df1:
date id
2000 1
2001 1
2002 2
df2:
date id
2000 1
2002 2
I now want to extract a list of observations that are in df1 but not in df2 based on date AND id.
The result should look like this:
date id
2001 1
I know how make a command to compare a column to a list with isin like this:
result = df1[~df1["id"].isin(df2["id"].tolist())]
However, this would only compare the two dataframes based on the column id. Because it could be that the id is in df1 and df2, but for different dates it is important that I only get values where both - id and date- are present in the two dataframes. Does somebody know how to do that?
Using merge
In [795]: (df1.merge(df2, how='left', indicator='_a')
.query('_a == "left_only"')
.drop('_a', 1))
Out[795]:
date id
1 2001 1
Details
In [796]: df1.merge(df2, how='left', indicator='_a')
Out[796]:
date id _a
0 2000 1 both
1 2001 1 left_only
2 2002 2 both
In [797]: df1.merge(df2, how='left', indicator='_a').query('_a == "left_only"')
Out[797]:
date id _a
1 2001 1 left_only
Follow up to this post:
Merging two columns which don't overlap and create new columns
import pandas as pd
df1 = pd.DataFrame([["2014", "q2", 2],
["2013", "q1", 1],],
columns=('Year', 'Quarter', 'Value'))
df2 = pd.DataFrame([["2016", "q1", 3],
["2015", "q1", 3]],
columns=('Year', 'Quarter', 'Value'))
print(df1.merge(df2, on='Year', how='outer'))
Results in:
Year Quarter_x Value_x Quarter_y Value_y
0 2014 q2 2 NaN NaN
1 2013 q1 1 NaN NaN
2 2016 NaN NaN q1 3
3 2015 NaN NaN q1 3
But I want to get this:
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
2 2016 q1 3
3 2015 q1 3
Note: This doesn't produce the desired result... :(
print(df1.merge(df2, on=['Year', 'Quarter','Value'], how='outer').dropna())
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
... using 'left' or right' or inner also don't cut it.
Not sure what's happening here, but if I do
df1.merge(df2, on=['Year', 'Quarter', 'Value'], how='outer').dropna()
I get:
Year Quarter Value
0 2014 q2 2.0
1 2013 q1 1.0
2 2016 q1 3.0
3 2015 q1 3.0
You may want to take a look at the merge, join & concat docs.
The most 'intuitive' way for this is probably .append():
df1.append(df2)
Year Quarter Value
0 2014 q2 2.0
1 2013 q1 1.0
2 2016 q1 3.0
3 2015 q1 3.0
If you look into the source code, you'll find it calls concat behind the scenes.
Merge is useful and intended for cases where you have columns with overlapping values.
pandas concat is much better suited for this.
pd.concat([df1, df2]).reset_index(drop=True)
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
2 2016 q1 3
3 2015 q1 3
concat is intended to place one dataframe adjacent to another while keeping the index or columns aligned. In the default case, it keeps the columns aligned. Considering your example dataframes, the columns are aligned and your stated expected output shows df2 placed exactly after df1 where the columns are aligned. Every aspect of what you've asked for is exactly what concat was designed to provide. All I've done is point you to an appropriate function.
You're looking for the append feature:
df_final = df1.append(df2)
I want to make all column headers in my pandas data frame lower case
Example
If I have:
data =
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
....
I would like to change XRAT to xrat by doing something like:
data.headers.lowercase()
So that I get:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
3 Canada CAN 2004 1.30102 1096000.35500
....
I will not know the names of each column header ahead of time.
You can do it like this:
data.columns = map(str.lower, data.columns)
or
data.columns = [x.lower() for x in data.columns]
example:
>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
A B C
0 0 3 a
1 1 2 b
2 2 1 c
>>> data.columns = map(str.lower, data.columns)
>>> data
a b c
0 0 3 a
1 1 2 b
2 2 1 c
You could do it easily with str.lower for columns:
df.columns = df.columns.str.lower()
Example:
In [63]: df
Out[63]:
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
In [64]: df.columns = df.columns.str.lower()
In [65]: df
Out[65]:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
If you want to do the rename using a chained method call, you can use
data.rename(columns=str.lower)
If you're not chaining method calls, you can add inplace=True
data.rename(columns=str.lower, inplace=True)
df.columns = df.columns.str.lower()
is the easiest but will give an error if some headers are numeric
if you have numeric headers then use this:
df.columns = [str(x).lower() for x in df.columns]
I noticed some of the other answers will fail if a column name is made of digits (e.g. "123"). Try these to handle such cases too.
Option 1: Use df.rename
def rename_col(old_name):
return str(old_name).lower()
df.rename(rename_col)
Option 2 (from this comment):
df.columns.astype(str).str.lower()
Another convention based on the official documentation:
frame.rename(mapper=lambda x:x.lower(), axis='columns', inplace=True)
Parameters:
mapper:
Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.