I have a BIG dataframe with millions of rows & many columns and need to do GROUPBY AND COUNT OF VALUES OF DIFFERENT COLUMNS .
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I'm giving a simpler example below about my problem.
Below is my input CSV.
UID,CONTINENT,AGE_GROUP,APPROVAL_STATUS
user1,ASIA,26-30,YES
user10,ASIA,26-30,NO
user11,ASIA,36-40,YES
user12,EUROPE,21-25,NO
user13,AMERICA,31-35,not_confirmed
user14,ASIA,26-30,YES
user15,EUROPE,41-45,not_confirmed
user16,AMERICA,21-25,NO
user17,ASIA,26-30,YES
user18,EUROPE,41-45,NO
user19,AMERICA,31-35,YES
user2,AMERICA,31-35,NO
user20,ASIA,46-50,NO
user21,EUROPE,18-20,not_confirmed
user22,ASIA,26-30,not_confirmed
user23,ASIA,36-40,YES
user24,AMERICA,26-30,YES
user25,EUROPE,36-40,NO
user26,EUROPE,Above 50,NO
user27,ASIA,46-50,YES
user28,AMERICA,31-35,NO
user29,AMERICA,Above 50,not_confirmed
user3,ASIA,36-40,YES
user30,EUROPE,41-45,YES
user4,EUROPE,41-45,NO
user5,ASIA,26-30,not_confirmed
user6,ASIA,46-50,not_confirmed
user7,ASIA,26-30,YES
user8,AMERICA,18-20,YES
user9,EUROPE,31-35,NO
I Expect the output to be as below.
Output should show
CONTINENT column as the main groupby column
UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns as separate column name. And also, it should display the count of UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns for each CONTINENT under respective output columns.
Output:-
CONTINENT,18-20,21-25,26-30,31-35,36-40,41-45,46-50,Above 50,NO,YES,not_confirmed,USER_COUNT
AMERICA,1,1,1,4,0,0,0,1,3,3,2,8
ASIA,0,0,7,0,3,0,3,0,2,8,3,13
EUROPE,1,1,0,1,1,4,0,1,6,1,2,9
Below is how I'm achieving it currently, but this is NOT en efficient way.
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I've also sen that this could be achieved by using pivit table with pandas. But not too sure about it.
in_file = "/Users/user1/groupby.csv"
out_file = "/Users/user1/groupby1.csv"
df= pd.read_csv(in_file)
print(df)
df1 = df.groupby(['CONTINENT', 'AGE_GROUP']).size().unstack(fill_value=0).reset_index()
df1 = df1.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df1)
df2 = df.groupby(['CONTINENT', 'APPROVAL_STATUS']).size().unstack(fill_value=0).reset_index()
df2 = df2.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df2)
df3 = df.groupby("CONTINENT").count().reset_index()
df3 = df3[df3.columns[0:2]]
df3.columns = ["CONTINENT", "USER_COUNT"]
df3 = df3.sort_values(["CONTINENT"], axis=0, ascending=True)
df3.reset_index(drop=True, inplace=True)
# df3.to_csv(out_file, index=False)
print(df3)
df2.drop('CONTINENT', axis=1, inplace=True)
df3.drop('CONTINENT', axis=1, inplace=True)
df_final = pd.concat([df1, df2, df3], axis=1)
print(df_final)
df_final.to_csv(out_file, index=False)
Easy solution
Let us use crosstabs to calculate frequency tables then concat the tables along columns axis:
s1 = pd.crosstab(df['CONTINENT'], df['AGE_GROUP'])
s2 = pd.crosstab(df['CONTINENT'], df['APPROVAL_STATUS'])
pd.concat([s1, s2, s2.sum(1).rename('USER_COUNT')], axis=1)
18-20 21-25 26-30 31-35 36-40 41-45 46-50 Above 50 NO YES not_confirmed USER_COUNT
CONTINENT
AMERICA 1 1 1 4 0 0 0 1 3 3 2 8
ASIA 0 0 7 0 3 0 3 0 2 8 3 13
EUROPE 1 1 0 1 1 4 0 1 6 1 2 9
i have two dataframes. The second dataframe contains the values to be updated in the first dataframe. df1:
data=[[1,"potential"],[2,"lost"],[3,"at risk"],[4,"promising"]]
df=pd.DataFrame(data,columns=['id','class'])
id class
1 potential
2 lost
3 at risk
4 promising
df2:
data2=[[2,"new"],[4,"loyal"]]
df2=pd.DataFrame(data2,columns=['id','class'])
id class
2 new
4 loyal
expected output:
data3=[[1,"potential"],[2,"new"],[3,"at risk"],[4,"loyal"]]
df3=pd.DataFrame(data3,columns=['id','class'])
id class
1 potential
2 new
3 at risk
4 loyal
The code below seems to be working, but I believe there is a more effective solution.
final=df.append([df2])
final = final.drop_duplicates(subset='id', keep="last")
addition:
Is there a way for me to write the previous value in a new column?
like this:
id class prev_class modified date
1 potential nan nan
2 new lost 2022.xx.xx
3 at risk nan nan
4 loyal promising 2022.xx.xx
Your solution is good, here is alternative with concat and added DataFrame.sort_values:
df = (pd.concat([df, df2])
.drop_duplicates(subset='id', keep="last")
.sort_values('id', ignore_index=True))
print (df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
Solution is change if need add previous class values and today:
df3 = pd.concat([df, df2])
mask = df3['id'].duplicated(keep='last')
df31 = df3[mask]
df32 = df3[~mask]
df3 = (df32.merge(df31, on='id', how='left', suffixes=('','_prev'))
.sort_values('id', ignore_index=True))
df3.loc[df3['class_prev'].notna(), 'modified date'] = pd.to_datetime('now').normalize()
print (df3)
id class class_prev modified date
0 1 potential NaN NaT
1 2 new lost 2022-03-31
2 3 at risk NaN NaT
3 4 loyal promising 2022-03-31
We can use DataFrame.update
df = df.set_index('id')
df.update(df2.set_index('id'))
df = df.reset_index()
Result
print(df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
You can operate along your id's by setting them as your index, and use combine_first to perform this operation. Then assigning youre prev_class is extremely straightforward because you've properly used the Index!
df = df.set_index('id')
df2 = df2.set_index('id')
out = (
df2.combine_first(df)
.assign(
prev_class=df2["class"],
modified=lambda d:
d["prev_class"].where(
d["prev_class"].isna(), pd.Timestamp.now()
)
)
)
print(out)
class prev_class modified
id
1 potential NaN NaN
2 new new 2022-03-31 06:51:20.832668
3 at risk NaN NaN
4 loyal loyal 2022-03-31 06:51:20.832668
There are 2 files opened with Pandas. If there are common parts in the first column of two files (colored letters), I want to paste the data of the second column of second file into the matched part of the first file. And if there is no match, I want to write 'NaN'. Is there a way I can do in this situation?
File1
enter code here
0 1
0 JCW 574
1 MBM 4212
2 COP 7424
3 KVI 4242
4 ECX 424
File2
enter code here
0 1
0 G=COP d4ssd5vwe2e2
1 G=DDD dfd23e1rv515j5o
2 G=FEW cwdsuve615cdldl
3 G=JCW io55i5i55j8rrrg5f3r
4 G=RRR c84sdw5e5vwldk455
5 G=ECX j4ut84mnh54t65y
File1#
enter code here
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Use Series.str.extract for new Series for matched values by df1[0] values first and then merge with left join in DataFrame.merge:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
s = df2[0].str.extract(f'({"|".join(df1[0])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Or if need match last 3 values of column df1[0] use:
s = df2[0].str.extract(f'({"|".join(df1[0].str[-3:])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
Have a look at the concat-function of pandas using join='outer' (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). There is also this question and the answer to it that can help you.
It involves reindexing each of your data frames to use the column that is now called "0" as the index, and then joining two data frames based on their indices.
Also, can I suggest that you do not paste an image of your dataframes, but upload the data in a form that other people can test their suggestions.
I'm trying to merge two pandas dataframes but I can't figure out how to get the result I need. These are the example versions of dataframes I'm looking at:
df1 = pd.DataFrame([["09/10/2019",None],["10/10/2019",None], ["11/10/2019",6],
["12/10/2019",5], ["13/10/2019",3], ["14/10/2019",3],
["15/10/2019",5],
["16/10/2019",None]], columns = ['Date', 'A'])
df2 = pd.DataFrame([["10/10/2019",3], ["11/10/2019",5], ["12/10/2019",6],
["13/10/2019",1], ["14/10/2019",2], ["15/10/2019",4]],
columns = ['Date', 'A'])
I have checked the Pandas merging 101 but still can't find the way to do it correctly. Essentially what I need using the same graphics as in the guide is this:
i.e. I want to keep the data from df1 that falls outside the shared keys section, but within shared area I want df2 data from column 'A' to overwrite data from df1. I'm not even sure that merge is the right tool to use.
I've tried using df1 = pd.merge(df1, df2, how='right', on='Date') with different options, but in most cases it creates two separate columns - A_x and A_y in the output.
This is what I want to get as the end result:
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN
Thanks in advance!
here is a way using combine_first:
df2.set_index('Date').combine_first(df1.set_index('Date')).reset_index()
Or reindex_like:
df2.set_index('Date').reindex_like(df1.set_index('Date')).reset_index()
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN
Hi so I have two dataframes, first one is a dataframe which was created by grouping by another df by id (which is index now) and then sorting by 'due' column.
df1:
paid due
id
3 13.000000 5.000000
2 437.000000 5.000000
5 90.000000 5.000000
1 60.000000 5.000000
4 675.000000 5.000000
The other one is a normal dataframe which has 3 columns: 'id' 'name' and 'country'.
df2:
id name country
1 'AB' 'DE'
2 'CD' 'DE'
3 'EF' 'NL'
4 'HAH' 'SG'
5 'NOP' 'NOR'
So what I was trying to do is to add the 'name' column to the 1st dataframe based on the id number (which is index in first df and column in second one).
So I thought this code would work:
pd.merge(df1, df2['name'], left_index=True, right_on='id')
But I get error
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
You can use rename for map by dict:
df1['name'] = df1.rename(index=df2.set_index('id')['name']).index
print (df1)
paid due name
id
3 13.0 5.0 'EF'
2 437.0 5.0 'CD'
5 90.0 5.0 'NOP'
1 60.0 5.0 'AB'
4 675.0 5.0 'HAH'
You might find that pd.concat is a better option here because it can accept a mix of dataframe and series: http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-with-mixed-ndims.
Okay so I figured out that I can't really get one column of dataframe in that way but I can remake df2 so that it contains only one needed column:
df2=df2[['id', 'name']]
pd.merge(df1, df2, left_index=True, right_on='id')
And there is no error anymore.