I want to make all column headers in my pandas data frame lower case
Example
If I have:
data =
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
....
I would like to change XRAT to xrat by doing something like:
data.headers.lowercase()
So that I get:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
3 Canada CAN 2004 1.30102 1096000.35500
....
I will not know the names of each column header ahead of time.
You can do it like this:
data.columns = map(str.lower, data.columns)
or
data.columns = [x.lower() for x in data.columns]
example:
>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
A B C
0 0 3 a
1 1 2 b
2 2 1 c
>>> data.columns = map(str.lower, data.columns)
>>> data
a b c
0 0 3 a
1 1 2 b
2 2 1 c
You could do it easily with str.lower for columns:
df.columns = df.columns.str.lower()
Example:
In [63]: df
Out[63]:
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
In [64]: df.columns = df.columns.str.lower()
In [65]: df
Out[65]:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
If you want to do the rename using a chained method call, you can use
data.rename(columns=str.lower)
If you're not chaining method calls, you can add inplace=True
data.rename(columns=str.lower, inplace=True)
df.columns = df.columns.str.lower()
is the easiest but will give an error if some headers are numeric
if you have numeric headers then use this:
df.columns = [str(x).lower() for x in df.columns]
I noticed some of the other answers will fail if a column name is made of digits (e.g. "123"). Try these to handle such cases too.
Option 1: Use df.rename
def rename_col(old_name):
return str(old_name).lower()
df.rename(rename_col)
Option 2 (from this comment):
df.columns.astype(str).str.lower()
Another convention based on the official documentation:
frame.rename(mapper=lambda x:x.lower(), axis='columns', inplace=True)
Parameters:
mapper:
Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.
Related
i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN
First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN
Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN
A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None
Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN
I'm trying to change the structure of my dataset
Currently have:
RE id Country 0 1 2 ... n
1001 CN,TH CN TH nan ... nan
1002 UK UK nan nan ... nan
I've split the Country column out, hence the additional columns. Now I am trying to use df.melt to get this:
RE id var val
1001 0 CN
1001 0 TH
So I can eventually get to this by using a pivot
RE id Country
1001 TH
1001 CN
I've tried:
df = a.melt(id_vars=[a[[0]],a[[1]],a[[2]]], value_vars=['RE id'])
How can I select the range of columns in my dataframe to use as the identifer variables?
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.melt.html#pandas.DataFrame.melt
I think the problem was that you were referencing the column names incorrectly. Also, I believe you had id_vars (should be Re id, I think) and value_vars (column names 0 and 1) inverted in your code.
Here is how I approached this
Imports
import pandas as pd
import numpy as np
Here is a part of the data, sufficient to demonstrate the likely problem
a = [['Re id', 0, 1],[1001,'CN','TH'],[1002,'UK',np.nan]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
Re id 0 1
0 1001 CN TH
1 1002 UK NaN
Now, use pd.melt with
id_vars pointing to Re id
value_vars as the 2 columns you want to melt (namely, column names 0 and 1)
df_melt = pd.melt(df, id_vars=['Re id'], value_vars=[0,1], value_name='Country')
df_melt.sort_values(by=['Re id', 'Country'], ascending=[True,False], inplace=True)
print(df_melt)
Re id variable Country
2 1001 1 TH
0 1001 0 CN
1 1002 0 UK
3 1002 1 NaN
Also, since you have the Country names in separate columns (0,1), I do not think that you need to use the Country column at all.
I have data like this in a csv file
Symbol Action Year
AAPL Buy 2001
AAPL Buy 2001
BAC Sell 2002
BAC Sell 2002
I am able to read it and groupby like this
df.groupby(['Symbol','Year']).count()
I get
Action
Symbol Year
AAPL 2001 2
BAC 2002 2
I desire this (order does not matter)
Action
Symbol Year
AAPL 2001 2
AAPL 2002 0
BAC 2001 0
BAC 2002 2
I want to know if its possible to count for zero occurances
You can use this:
df = df.groupby(['Symbol','Year']).count().unstack(fill_value=0).stack()
print (df)
Output:
Action
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
You can use pivot_table with unstack:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
dtype: int64
If you need output as DataFrame use to_frame:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
.to_frame()
.rename(columns={0:'Action'})
Action
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
Datatype category
Maybe this feature didn't exist back when this thread was opened, however the datatype "category" can help here:
# create a dataframe with one combination of a,b missing
df = pd.DataFrame({"a":[0,1,1], "b": [0,1,0]})
df = df.astype({"a":"category", "b":"category"})
print(df)
Dataframe looks like this:
a b
0 0 0
1 1 1
2 1 0
And now, grouping by a and b
print(df.groupby(["a","b"]).size())
yields:
a b
0 0 1
1 0
1 0 1
1 1
Note the 0 in the rightmost column. This behavior is also documented in the pandas userguide (search on page for "groupby").
If you want to do this without using pivot_table, you can try the below approach:
midx = pd.MultiIndex.from_product([ df['Symbol'].unique(), df['Year'].unique()], names=['Symbol', 'Year'])
df_grouped_by = df_grouped_by.reindex(midx, fill_value=0)
What we are essentially doing above is creating a multi-index of all the possible values multiplying the two columns and then using that multi-index to fill zeroes into our group-by dataframe.
Step 1: Create a dataframe that stores the count of each non-zero class in the column counts
count_df = df.groupby(['Symbol','Year']).size().reset_index(name='counts')
Step 2: Now use pivot_table to get the desired dataframe with counts for both existing and non-existing classes.
df_final = pd.pivot_table(count_df,
index=['Symbol','Year'],
values='counts',
fill_value = 0,
dropna=False,
aggfunc=np.sum)
Now the values of the counts can be extracted as a list with the command
list(df_final['counts'])
All the answers above are focusing on groupby or pivot table. However, as is well described in this article and in this question, this is a beautiful case for pandas' crosstab function:
import pandas as pd
df = pd.DataFrame({
"Symbol": 2*['AAPL', 'BAC'],
"Action": 2*['Buy', 'Sell'],
"Year": 2*[2001,2002]
})
pd.crosstab(df["Symbol"], df["Year"]).stack()
yielding:
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
I have two Dataframes: one with columns "Name", "Year" and "Type" and the other one with different parameters. There are 4 different types and each one has his specific parameters. Now i need to merge them together.
My approach is to use a if-function to find out the "type". For example in row two of df3 i have type 'a'. The parameters for type 'a' are in row 3 of df4. I tried to connect them with the following code:
df3.ix[[2]]
s1 = df3.ix[[2]]
s2 = df4.ix[[3]]
result = pd.concat([s1, s2], axis=1)
My problem is now, that the parameters are in a seperate row and not added to row 2. Is there a chance to merge them together in one row? Thanks for your answers!
If df3 has a Type column and df4 has a type column, then the two DataFrames can be merged with
pd.merge(df3, df4, left_on='Type', right_on='type')
This is by default an inner join.
In [13]: df3
Out[13]:
Name Year Type
1 A 2012 boat
2 B 2013 car
3 C 2011 truck
4 D 2013 boat
In [14]: df4
Out[14]:
type Parameter1 Parameter2 Parameter3
0 boat 2 8 7
1 car 1 9 3
2 truck 5 4 2
In [15]: pd.merge(df3, df4, left_on='Type', right_on='type')
Out[15]:
Name Year Type type Parameter1 Parameter2 Parameter3
0 A 2012 boat boat 2 8 7
1 D 2013 boat boat 2 8 7
2 B 2013 car car 1 9 3
3 C 2011 truck truck 5 4 2
Note that if the column names matched exactly, then
pd.merge(df3, df4)
would merge on column names shared in common by default.
I am using a pandas DataFrame and I would like to pull one column value/index down by one. So the list Dataframe Length will be one less. Just like this in my example image:
The new DataFrame should be id 2-5, but of course re-index after the manipulation to 1-4. There are more than just name and place rows.
How can I quickly manipulate the DataFrame like this?
Thank you very much.
You can shift the name column and then take a slice using iloc:
In [55]:
df = pd.DataFrame({'id':np.arange(1,6), 'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']})
df
Out[55]:
id name place
0 1 john new york
1 2 bla miami
2 3 tim paris
3 4 walter rome
4 5 john sydney
In [56]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[56]:
id name place
0 1 bla new york
1 2 tim miami
2 3 walter paris
3 4 john rome
If your 'id' column is your index the above still works:
In [62]:
df = pd.DataFrame({'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']},index=np.arange(1,6))
df.index.name = 'id'
df
Out[62]:
name place
id
1 john new york
2 bla miami
3 tim paris
4 walter rome
5 john sydney
In [63]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[63]:
name place
id
1 bla new york
2 tim miami
3 walter paris
4 john rome