Python DataFrame count how many different elements - python

I need to count how many different elements are in my DataFrame (df).
My df has the day of the month (as a number: 1,2,3 ... 31) in which a certain variable was measured. There are 3 columns that describe the number of the day. There are multiple measurements in one day so my columns have repeated values. I need to know how many days in a month was that variable measured ignoring how many times a day was that measurement done. So I was thinking that counting the days ignoring repeated values.
As an example the data of my df would look like this:
col1 col2 col3
2 2 2
2 2 3
3 3 3
3 4 8
I need an output that tells me that in that DataFrame the numbers are 2, 3, 4 and 8.
Thanks!

Just do:
df=pd.DataFrame({"col1": [2,2,3,3], "col2": [2,2,3,4], "col3": [2,3,3,8]})
df.stack().unique()
Outputs:
[2 3 4 8]

You can use the function drop_duplicates into your dataframe, like:
import pandas as pd
df = pd.DataFrame({'a':[2,2,3], 'b':[2,2,3], 'c':[2,2,3]})
a b c
0 2 2 2
1 2 2 2
2 3 3 3
df = df.drop_duplicates()
print(df['a'].count())
out: 2

Or you can use numpy to get the unique values in the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X' : [2, 2, 3, 3], 'Y' : [2,2,3,4], 'Z' : [2,3,3,8]})
df_unique = np.unique(np.array(df))
print(df_unique)
#Output [2 3 4 8]
#for the count of days:
print(len(df_unique))
#Output 4

How about:
Assuming this is your initial df:
col1 col2 col3
0 2 2 2
1 2 2 2
2 3 3 3
Then:
count_df = pd.DataFrame()
for i in df.columns:
df2 = df[i].value_counts()
count_df = pd.concat([count_df, df2], axis=1)
final_df = count_df.sum(axis=1)
final_df = pd.DataFrame(data=final_df, columns=['Occurrences'])
print(final_df)
Occurrences
2 6
3 3

You can use pandas.unique() like so:
pd.unique(df.to_numpy().flatten())
I have done some basic benchmarking, this method appears to be the fastest.

Related

Pandas Dataframe groupby with overlapping

I'm using a pandas dataframe to read a csv that has data points for machine learning. I'm trying to come up with a way that would allow me to index a dataframe where it would get that index and the next N number of rows. I don't want to group the data frame into bins with no overlap (i.e. index 0:4, 4:8, etc.) What I do want is to get a result like this: index 0:4, 1:5, 2:6,etc. How would this be done?
Maybe you can create a list of DataFrames, like:
import pandas as pd
import numpy as np
nrows = 7
group_size = 5
df = pd.DataFrame({'col1': np.random.randint(0, 10, nrows)})
print(df)
grp = [df.iloc[x:x+5,] for x in range(df.shape[0] - group_size + 1)]
print(grp[1])
Original DataFrame:
col1
0 2
1 6
2 6
3 5
4 3
5 3
6 8
2nd DataFrame from the list of DataFrames:
col1
1 6
2 6
3 5
4 3
5 3

Summing columns according to pattern in column names

Lets start with very simplified abstract example, I hava a dataframe like this:
import pandas as pd
d = {'1-A': [1, 2], '1-B': [3, 4], '2-A': [3, 4], '5-B': [2, 7]}
df = pd.DataFrame(data=d)
1-A 1-B 2-A 5-B
0 1 3 3 2
1 2 4 4 7
I'm looking for elegant pandastic solution to have dataframe like this:
1 2 5
0 4 3 2
1 6 4 7
To make example more concrete column 1-A, means person id=1, expenses category A. Rows are expenses every month. In result, I want to have monthly expenses per person across categories (so column 1 is sum of column 1-A and 1-B). Note that, when there is no expenses, there is no column with 0s. Of course it should be ready for more columns (ids and categories).
I'm quite sure that smart solution with good separation of column selection and summing opeation for this exist.
Use groupby with lambda function with split and select first value, for grouping by columns add axis=1:
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
#alternative
#df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
print (df1)
1 2 5
0 4 3 2
1 6 4 7

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas python - matching values

I currently have two dataframes that have two matching columns. For example :
Data frame 1 with columns : A,B,C
Data frame 2 with column : A
I want to keep all lines in the first dataframe that have the values that the A contains. For example if df2 and df1 are:
df1
A B C
0 1 3
4 2 5
6 3 1
8 0 0
2 1 1
df2
Α
4
6
1
So in this case, I want to only keep the second and third line of df1.
I tried doing it like this, but it didnt work since both dataframes are pretty big:
for index, row in df1.iterrows():
counter = 0
for index2,row2 in df2.iterrows():
if row["A"] == row2["A"]:
counter = counter + 1
if counter == 0:
df2.drop(index, inplace=True)
Use isin to test for membership:
In [176]:
df1[df1['A'].isin(df2['A'])]
Out[176]:
A B C
1 4 2 5
2 6 3 1
Or use the merge method:
df1= pandas.DataFrame([[0,1,3],[4,2,5],[6,3,1],[8,0,0],[2,1,1]], columns = ['A', 'B', 'C'])
df2= pandas.DataFrame([4,6,1], columns = ['A'])
df2.merge(df1, on = 'A')

python pandas pivot_table count frequency in one column

I am still new to Python pandas' pivot_table and would like to ask a way to count frequencies of values in one column, which is also linked to another column of ID. The DataFrame looks like the following.
import pandas as pd
df = pd.DataFrame({'Account_number':[1,1,2,2,2,3,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B']
})
For the output, I'd like to get something like the following:
Product
A B
Account_number
1 2 0
2 1 2
3 1 1
So far, I tried this code:
df.pivot_table(rows = 'Account_number', cols= 'Product', aggfunc='count')
This code gives me the two same things. What is the problems with the code above? A part of the reason why I am asking this question is that this DataFrame is just an example. The real data that I am working on has tens of thousands of account_numbers.
You need to specify the aggfunc as len:
In [11]: df.pivot_table(index='Account_number', columns='Product',
aggfunc=len, fill_value=0)
Out[11]:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
It looks like count, is counting the instances of each column (Account_number and Product), it's not clear to me whether this is a bug...
Solution: Use aggfunc='size'
Using aggfunc=len or aggfunc='count' like all the other answers on this page will not work for DataFrames with more than three columns. By default, pandas will apply this aggfunc to all the columns not found in index or columns parameters.
For instance, if we had two more columns in our original DataFrame defined like this:
df = pd.DataFrame({'Account_number':[1, 1, 2 ,2 ,2 ,3 ,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B'],
'Price': [10] * 7,
'Quantity': [100] * 7})
Output:
Account_number Product Price Quantity
0 1 A 10 100
1 1 A 10 100
2 2 A 10 100
3 2 B 10 100
4 2 B 10 100
5 3 A 10 100
6 3 B 10 100
If you apply the current solutions to this DataFrame, you would get the following:
df.pivot_table(index='Account_number',
columns='Product',
aggfunc=len,
fill_value=0)
Output:
Price Quantity
Product A B A B
Account_number
1 2 0 2 0
2 1 2 1 2
3 1 1 1 1
Solution
Instead, use aggfunc='size'. Since size always returns the same number for each column, pandas does not call it on every single column and just does it once.
df.pivot_table(index='Account_number',
columns='Product',
aggfunc='size',
fill_value=0)
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
In new version of Pandas, slight modification is required. I had to spend some time figuring out so just wanted to add that here so that someone can directly use this.
df.pivot_table(index='Account_number', columns='Product', aggfunc=len,
fill_value=0)
You can use count:
df.pivot_table(index='Account_number', columns='Product', aggfunc='count')
I know this question is about pivot_table but for the problem given in the question, we can use crosstab:
out = pd.crosstab(df['Account_number'], df['Product'])
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1

Categories