Create a new dataframe from a for loop with value.counts() - python

I have a dataframe (df3) with 51 columns and managed to show the most common values in each feature with a for loop.
for col in df3.columns:
print('-' * 40 + col + '-' * 40 , end=' - ')
display(df3[col].value_counts().head(10))
Now I'd like to create a new dataframe called df4 with the results from the loop. That is the 10 most frequent values from all columns of df3. How can I do that?

I get values using
df4 = df3.apply(lambda col: col.value_counts().head(10).index)
Instead of for-loop I use apply.
Because .value_counts() creates Series which uses original IDs as index so I get .index
Minimal working example - because I have less values so I use head(2)
import pandas as pd
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
df2 = df.apply(lambda col: col.value_counts().head(2).index)
print(df2)
Result
A B C
0 6 4 1
1 3 8 7
EDIT:
If you have less then 10 results in column then you can convert to list expand with list which have 10 x NaN and after then crop it to [:10]
.head(10).index.tolist() + [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN])[:10])
Minimal working example
import pandas as pd
import numpy as np
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
NAN10 = [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]
df2 = df.apply(lambda col: (col.value_counts().head(10).index.tolist() + NAN10)[:10])
print(df2)
Result
A B C
0 6.0 4.0 1.0
1 3.0 8.0 7.0
2 5.0 6.0 2.0
3 4.0 5.0 9.0
4 2.0 3.0 8.0
5 1.0 2.0 NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
You can also try to conver to Series and it may add NaN in missing places but it will skip rows which have only NaN
import pandas as pd
import numpy as np
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
df3 = df.apply(lambda col: pd.Series(col.value_counts().head(10).index))
print(df3)
Result
A B C
0 6 4 1.0
1 3 8 7.0
2 5 6 2.0
3 4 5 9.0
4 2 3 8.0
5 1 2 NaN

Related

Python Pandas: divide a Series by a Dataframe

I have a Series and a Dataframe that share the same index:
s = pd.Series([300, 300])
df = pd.DataFrame({
'A': [10,20],
'B': [20,30]
})
When I do s.div(df), I see:
A B 0 1
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
I expect:
A B
0 30 15
1 15 10
pandas.__version__: 1.3.4.
Use DataFrame.rdiv for divide from right side:
df1 = df.rdiv(s, axis=0)
print (df1)
A B
0 30.0 15.0
1 15.0 10.0

Join/merge dataframes and preserve the row-order

I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.

Concatenating dataframes creates too many columns

I am reading a number of csv files in using a loop, all have 38 columns. I add them all to a list and then concatenate/create a dataframe. My issue is that despite all these csv files having 38 columns, my resultant dataframe somehow ends up with 105 columns.
Here is a screenshot:
How can I make the resultant dataframe have the correct 38 columns and stack all of rows on top of each other?
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('alpha-enforcement-data-engineering')
appended_data = []
for obj in bucket.objects.filter(Prefix='closed/closed_processed/year_201'):
print(obj.key)
df = pd.read_csv(f's3://alpha-enforcement-data-engineering/{obj.key}', low_memory=False)
print(df.shape)
appended_data.append(df)
df_closed = pd.concat(appended_data, axis=0, sort=False)
print(df_closed.shape)
TLDR; check your column headers.
c = appended_data[0].columns
df_closed = pd.concat([df.set_axis(
c, axis=1, inplace=False) for df in appended_data], sort=False)
This happens because your column headers are different. Pandas will align your DataFrames on the headers when concatenating vertically, and will insert empty columns for DataFrames where that header is not present. Here's an illustrative example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
df
A B
0 1 4
1 2 5
2 3 6
df2
C D
0 7 10
1 8 11
2 9 12
pd.concat([df, df2], axis=0, sort=False)
A B C D
0 1.0 4.0 NaN NaN
1 2.0 5.0 NaN NaN
2 3.0 6.0 NaN NaN
0 NaN NaN 7.0 10.0
1 NaN NaN 8.0 11.0
2 NaN NaN 9.0 12.0
Creates 4 columns. Whereas, you wanted only two. Try,
df2.columns = df.columns
pd.concat([df, df2], axis=0, sort=False)
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
Which works as expected.

When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter

I find that after using pd.concat() to concatenate two dataframes with same column name, then df.fillna() will not work correctly with the dict parameter specifying which value to use for each column.
I don't know why? Is something wrong with my understanding?
a1 = pd.DataFrame({'a': [1, 2, 3]})
a2 = pd.DataFrame({'a': [1, 2, 3]})
b = pd.DataFrame({'b': [np.nan, 20, 30]})
c = pd.DataFrame({'c': [40, np.nan, 60]})
x = pd.concat([a1,a2, b, c], axis=1)
print(x)
x = x.fillna({'b':10, 'c': 50})
print(x)
Initial dataframe:
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
Data is unchanged after df.fillna():
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
As mentioned in the comments, there's a problem assigning values to a dataframe in the presence of duplicate column names.
However, you can use this workaround:
for col,val in {'b':10, 'c': 50}.items():
new_col = x[col].fillna(val)
idx = int(x.columns.get_loc(col))
x = x.drop(col,axis=1)
x.insert(loc=idx, column=col, value=new_col)
print(x)
result:
a a b c
0 1 1 10.0 40.0
1 2 2 20.0 50.0
2 3 3 30.0 60.0

Delete values from pandas dataframe based on logical operation

I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Categories