How to create dataframe using two dataframe using pandas - python

I have two dataframe 'df1' and 'df2'
df1= a b
1 such as
2 who I'm
df2= a keyword
1 such
1 as
2 who
2 I'm
Based on this two dataframe I want to create following dataframe
result = a keyword
such as such
such as as
who I'm who
who I'm I'm

IIUC, just perform a replacement with map:
df2['a'] = df2['a'].map(df1.set_index('a')['b'])

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Aggregate DataFrame down to one row using different functions

I have a dataframe with multiple rows, which I'd like to aggregate down, per-column, to a 1-row dataframe, using a different function per-column.
Take the following dataframe, as an example:
df = pd.DataFrame([[1,2], [2,3]], columns=['A', 'B'])
print(df)
Result:
A B
0 1 2
1 2 3
I'd like to aggregate the first column using sum and the second using mean. There is a convenient DataFrame.agg() method which can take a map of column names to aggregation function, like so:
aggfns = {
'A': 'sum',
'B': 'mean'
}
print(df.agg(aggfns))
However, this results in a Series rather than a DataFrame:
A 3.0
B 2.5
dtype: float64
Among other problems, a series has a single dtype so loses the per-column datatype. A series is well-suited to represent a single dataframe column, but not a single dataframe row.
I managed to come up with this tortured incantation:
df['dummy'] = 0
dfa = df.groupby('dummy').agg(aggfns).reset_index(drop=True)
print(dfa)
This creates a dummy column which is 0 everywhere, groups on it, does the aggregation and drops it, which produces the desired result:
A B
0 3 2.5
Certainly there is something better?
Using Series.to_frame + DataFrame.T (short for transpose):
dfa = df.agg(aggfns).to_frame().T
Output:
>>> dfa
A B
0 3.0 2.5
You could group by an empty Series instead of creating a new column:
dfa = df.assign(d=0).groupby('d').agg(aggfns).reset_index(drop=True)
Output:
>>> dfa
A B
0 3 2.5
You can explicitly create a new DataFrame()
>>> pd.DataFrame({'A': [df.A.sum()], 'B': [df.B.mean()]}
A B
0 3 2.5

Concatenate multiple pandas groupby outputs

I would like to make multiple .groupby() operations on different subsets of a given dataset and bind them all together. For example:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,2,2,2,3],"Subset":[1,1,2,2,2,3],"Value":[5,7,4,1,7,8]})
print(df)
ID Subset Value
0 1 1 5
1 1 1 7
2 2 2 4
3 2 2 1
4 2 2 7
5 3 1 9
I would then like to concatenate the following objects and store the result in a pandas data frame:
gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"]).mean()
gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"]).mean()
# Why do gr1 and gr2 have column names in different rows?
I realize that df.groupby(["ID","Subset"]).mean() would give me the concatenated object I'm looking for. Just bear with me, this is a reduced example of what I'm actually dealing with.
I think the solution could be to transform gr1 and gr2 to pandas data frames and then concatenate them like I normally would.
In essence, my questions are the following:
How do I convert a groupby result to a data frame object?
In case this can be done without transforming the series to data frames, how do you bind two groupby results together and then transform that to a pandas data frame?
PS: I come from an R background, so to me it's odd to group a data frame by something and have the output return as a different type of object (series or multi index data frame). This is part of my question too: why does .groupby return a series? What kind of series is this? How come a series can have multiple columns and an index?
The return type in your example is a pandas MultiIndex object. To return a dataframe with a single transformation function for a single value, then you can use the following. Note the inclusion of as_index=False.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr1
ID Subset Value
0 1 1 6
This however won't work if you wish to aggregate multiple functions like here. If you wish to avoid using df.groupby(["ID","Subset"]).mean(), then you can use the following for your example.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"], as_index=False).mean()
>>> pd.concat([gr1, gr2]).reset_index(drop=True)
ID Subset Value
0 1 1 6
1 2 2 4
If you're only concerned with dealing with a specific subset of rows, the following could be applicable, since it removes the necessity to concatenate results.
>>> values = [1,2]
>>> df[df['Subset'].isin(values)].groupby(["ID","Subset"], as_index=False).mean()
ID Subset Value
0 1 1 6
1 2 2 4

Python Pandas DataFrame: Rename all Column Names via Map [duplicate]

I would like to go through all the columns in a dataframe and rename (or map) columns if they contain certain strings.
For example: rename all columns that contain 'agriculture' with the string 'agri'
I'm thinking about using rename and str.contains but can't figure out how to combine them to achieve what i want.
You can use str.replace to process the columns first, and then re-assign the new columns back to the DataFrame:
import pandas as pd
df = pd.DataFrame({'A_agriculture': [1,2,3],
'B_agriculture': [11,22,33],
'C': [4,5,6]})
df.columns = df.columns.str.replace('agriculture', 'agri')
print df
Output:
A_agri B_agri C
0 1 11 4
1 2 22 5
2 3 33 6

Python Pandas: Convert ".value_counts" output to dataframe

Hi I want to get the counts of unique values of the dataframe. count_values implements this however I want to use its output somewhere else. How can I convert .count_values output to a pandas dataframe. here is an example code:
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
print(value_counts)
print(type(value_counts))
output is:
2 3
1 2
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
What I need is a dataframe like this:
unique_values counts
2 3
1 2
Thank you.
Use rename_axis for name of column from index and reset_index:
df = df.value_counts().rename_axis('unique_values').reset_index(name='counts')
print (df)
unique_values counts
0 2 3
1 1 2
Or if need one column DataFrame use Series.to_frame:
df = df.value_counts().rename_axis('unique_values').to_frame('counts')
print (df)
counts
unique_values
2 3
1 2
I just run into the same problem, so I provide my thoughts here.
Warning
When you deal with the data structure of Pandas, you have to aware of the return type.
Another solution here
Like #jezrael mentioned before, Pandas do provide API pd.Series.to_frame.
Step 1
You can also wrap the pd.Series to pd.DataFrame by just doing
df_val_counts = pd.DataFrame(value_counts) # wrap pd.Series to pd.DataFrame
Then, you have a pd.DataFrame with column name 'a', and your first column become the index
Input: print(df_value_counts.index.values)
Output: [2 1]
Input: print(df_value_counts.columns)
Output: Index(['a'], dtype='object')
Step 2
What now?
If you want to add new column names here, as a pd.DataFrame, you can simply reset the index by the API of reset_index().
And then, change the column name by a list by API df.coloumns
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = ['unique_values', 'counts']
Then, you got what you need
Output:
unique_values counts
0 2 3
1 1 2
Full Answer here
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
# solution here
df_val_counts = pd.DataFrame(value_counts)
df_value_counts_reset = df_val_counts.reset_index()
df_value_counts_reset.columns = ['unique_values', 'counts'] # change column names
I'll throw in my hat as well, essentially the same as #wy-hsu solution, but in function format:
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
pd.DataFrame(
df.groupby(['groupby_col'])['column_to_perform_value_count'].value_counts()
).rename(
columns={'old_column_name': 'new_column_name'}
).reset_index()
Example of selecting a subset of columns from a dataframe, grouping, applying value_count per group, name value_count column as Count, and displaying first n groups.
# Select 5 columns (A..E) from a dataframe (data_df).
# Sort on A,B. groupby B. Display first 3 groups.
df = data_df[['A','B','C','D','E']].sort_values(['A','B'])
g = df.groupby(['B'])
for n,(k,gg) in enumerate(list(g)[:3]): # display first 3 groups
display(k,gg.value_counts().to_frame('Count').reset_index())

Categories