Pandas: joining items with same index - python

I have a pandas data frame that is a vector with a value and an index, say:
row1 10
row1 11
row2 9
row2 8
However, I want to create a 2x2 matrix from this, in which the row index is actually a header (column index). Like this:
row1 row2
10 9
11 8
What is the most efficient way of doing this? This example is a simplification, but I could be dealing with thousands of data points. Does pandas have a specific function for joining items with same index into a table?
Observation: all indexes would have the same number of entries.

assign another column to the index and unstack
s.index = [s.groupby(level=0).cumcount(), s.index]
s.unstack()
0 row1 row2
0 10 9
1 11 8
alternative numpy approach
still slower (blah...)
u, inv = np.unique(s.index.values, return_inverse=True)
data = dict(zip(u, [s.values[g] for g in (np.arange(len(u))[:, None] == inv)]))
pd.DataFrame(data)

You can create an id variable for each unique index and then pivot the table to wide format:
df.assign(id = df.groupby([0]).cumcount()).set_index(['id', 0]).unstack(level=1)
# 1
#0 row1 row2
#id
# 0 10 9
# 1 11 8

Use groupby on the index to get a list of elements for each index, use to_dict to get a dictionary, then use the pd.DataFrame constructor:
pd.DataFrame(df.groupby(level=0)['column_name'].apply(list).to_dict())
If you have a Series, say s, instead of a DataFrame you don't need to supply a column name:
pd.DataFrame(s.groupby(level=0).apply(list).to_dict())
The resulting output:
row1 row2
0 10 9
1 11 8
Timings
Using the following setup to produce larger sample data, assuming the input data is a DataFrame:
n = 10**6
df = pd.DataFrame(np.random.random(size=n), index=['row1', 'row2']*(n//2))
def pir2(s):
s.index = [s.groupby(level=0).cumcount(), s.index]
return s.unstack()
I get the following timings:
%timeit pd.DataFrame(df.groupby(level=0)[0].apply(list).to_dict())
1 loop, best of 3: 210 ms per loop
%timeit pir2(df.copy())
1 loop, best of 3: 486 ms per loop
%timeit df.assign(id = df.groupby([0]).cumcount()).set_index(['id', 0]).unstack(level=1)
1 loop, best of 3: 1.34 s per loop

And just a little bit faster using #root's example :)
pd.DataFrame({name:group.values for name, group in df.groupby(level=0)[0]})
Timings:
%timeit pd.DataFrame({name:group.values for name, group in df.groupby(level=0)[0]})
10 loops, best of 3: 73.6 ms per loop
%timeit pd.DataFrame(df.groupby(level=0)[0].apply(list).to_dict())
1 loop, best of 3: 249 ms per loop

Related

How do I select specific columns of a data frame, and sum them based on a condition?

So here is an analogous situation of what I am trying to do
data = pd.read_csv(data)
df = pd.DataFrame(data)
print(df)
The data frame looks as follows
... 'd1' 'd2' 'd3... 'd13'
0 ... 0 0 0... 0
1 ... 0 0.95 0... 0
2 ... 0 0.95 0.95... 0
So on and so forth, essentially I would like to select these last 13 columns of my data frame, and count how many per row are greater than a certain value, and then append that to my data frame.
I figure there must be a simple way, I have been trying to use df.iloc[:, 21:] as my first column of interest begins here, however from this point on, I feel stuck. I have been trying many different methods such as criteria and for loops. I know this is a trivial thing but I have spent hours on it. Any help would be much appreciated.
for x in df:
a = df.iloc[:,21:].values()
if a.any[:, 12] > 0.9:
a[x] = 1
else:
a[x] = 0
sumdi = sum(a)
df.append(sumdi)
I believe you need compare last 13 columns selected by iloc with gt (>), count True values by sum and cast to integers:
df['new'] = df.iloc[:,-13:].gt(0.9).sum(axis=1).astype(int)
Sample:
np.random.seed(12)
df = pd.DataFrame(np.random.rand(10, 6))
#compare last 3 columns for > 0.5
df['new'] = df.iloc[:,-3:].gt(.5).sum(axis=1).astype(int)
print (df)
0 1 2 3 4 5 new
0 0.154163 0.740050 0.263315 0.533739 0.014575 0.918747 2
1 0.900715 0.033421 0.956949 0.137209 0.283828 0.606083 1
2 0.944225 0.852736 0.002259 0.521226 0.552038 0.485377 2
3 0.768134 0.160717 0.764560 0.020810 0.135210 0.116273 0
4 0.309898 0.671453 0.471230 0.816168 0.289587 0.733126 2
5 0.702622 0.327569 0.334648 0.978058 0.624582 0.950314 3
6 0.767476 0.825009 0.406640 0.451308 0.400632 0.995138 1
7 0.177564 0.962597 0.419250 0.424052 0.463149 0.373723 0
8 0.465508 0.035168 0.084273 0.732521 0.636200 0.027908 2
9 0.300170 0.220853 0.055020 0.523246 0.416370 0.048219 1
Using apply is slow, because there are loops under the hood:
np.random.seed(12)
df = pd.DataFrame(np.random.rand(10000, 20))
In [172]: %timeit df['new'] = df.iloc[:,-13:].gt(0.9).sum(axis=1).astype(int)
3.46 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [173]: %timeit df['new'] = df[df.columns[-13:]].apply(lambda x: x > .9, axis=1).sum(axis=1)
1.57 s ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yes, you'll want to apply row-wise functions.
# Select subset of columns
cols = df1.iloc[:, -13:].columns
# Create new column based on conditions that value is greater than 1
df1['new'] = df1[cols].apply(lambda x: x > 1, axis=1).sum(axis=1)
Under the hood this is doing the same as #jezrael answer, just slightly different syntax. gt() is being replaced with an applied lambda. This just offers slightly more flexibility for other conditions/cases where your logic is more complex.
Note: axis=1 is an important condition to ensure your function is being applied per row. You can change to axis=0 to do on a column-by-column basis.

How can I group by and count similar elements in a pandas dataframe? [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1
Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]
If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.
df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()
df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0
In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.
Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326
As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df
If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.
You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992
Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}
I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.
#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!
The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]
n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)
your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

compare values in two columns of data frame

I have the following two columns in pandas data frame
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
There are around 1594 rows. '256' and 'Z' are column headers whereas 0,1,2,3,4 are row numbers (1st column above). I want to print row numbers where value in Column '256' is not equal to values in column 'Z'. Thus output in the above case will be 1, 3.
How can this comparison be made in pandas? I will be very grateful for help. Thanks.
Create the data frame:
import pandas as pd
df = pd.DataFrame({"256":[2,2,4,4], "Z": [2,3,4,9]})
ouput:
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
After subsetting your data frame, use the index to get the id of rows in the subset:
row_ids = df[df["256"] != df.Z].index
gives
Int64Index([1, 3], dtype='int64')
Another way could be to use the .loc method of pandas.DataFrame which returns the indexed location of the rows that qualify the boolean indexing:
df.loc[(df['256'] != df['Z'])].index
with an output of:
Int64Index([1, 3], dtype='int64')
This happens to be the quickest of the listed implementations as can be seen in ipython notebook:
import pandas as pd
import numpy as np
df = pd.DataFrame({"256":np.random.randint(0,10,1594), "Z": np.random.randint(0,10,1594)})
%timeit df.loc[(df['256'] != df['Z'])].index
%timeit row_ids = df[df["256"] != df.Z].index
%timeit rows = list(df[df['256'] != df.Z].index)
%timeit df[df['256'] != df['Z']].index
with an output of:
1000 loops, best of 3: 352 µs per loop
1000 loops, best of 3: 358 µs per loop
1000 loops, best of 3: 611 µs per loop
1000 loops, best of 3: 355 µs per loop
However, when it comes down to 5-10 microseconds it doesn't make a significant difference, but if in the future you have a very large data set timing and efficiency may become a much more important issue. For your relatively small data set of 1594 rows I would go with the solution that looks the most elegant and promotes the most readability.
You can try this:
# Assuming your DataFrame is named "frame"
rows = list(frame[frame['256'] != frame.Z].index)
rows will now be a list containing the row numbers for which those two column values are not equal. So with your data:
>>> frame
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
[4 rows x 2 columns]
>>> rows = list(frame[frame['256'] != frame.Z].index)
>>> print(rows)
[1, 3]
Assuming df is your dataframe, this should do it:
df[df['256'] != df['Z']].index
yielding:
Int64Index([1, 3], dtype='int64')

Count the frequency that a value occurs in a dataframe column

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1
Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]
If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.
df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()
df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0
In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.
Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326
As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df
If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.
You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992
Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}
I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.
#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!
The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]
n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)
your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

pandas python how to count the number of records or rows in a dataframe

Obviously new to Pandas. How can i simply count the number of records in a dataframe.
I would have thought some thing as simple as this would do it and i can't seem to even find the answer in searches...probably because it is too simple.
cnt = df.count
print cnt
the above code actually just prints the whole df
To get the number of rows in a dataframe use:
df.shape[0]
(and df.shape[1] to get the number of columns).
As an alternative you can use
len(df)
or
len(df.index)
(and len(df.columns) for the columns)
shape is more versatile and more convenient than len(), especially for interactive work (just needs to be added at the end), but len is a bit faster (see also this answer).
To avoid: count() because it returns the number of non-NA/null observations over requested axis
len(df.index) is faster
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(8, 3),columns=['A', 'B', 'C'])
df['A'][5]=np.nan
df
# Out:
# A B C
# 0 0 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
# 4 12 13 14
# 5 NaN 16 17
# 6 18 19 20
# 7 21 22 23
%timeit df.shape[0]
# 100000 loops, best of 3: 4.22 µs per loop
%timeit len(df)
# 100000 loops, best of 3: 2.26 µs per loop
%timeit len(df.index)
# 1000000 loops, best of 3: 1.46 µs per loop
df.__len__ is just a call to len(df.index)
import inspect
print(inspect.getsource(pd.DataFrame.__len__))
# Out:
# def __len__(self):
# """Returns length of info axis, but here we use the index """
# return len(self.index)
Why you should not use count()
df.count()
# Out:
# A 7
# B 8
# C 8
Regards to your question... counting one Field? I decided to make it a question, but I hope it helps...
Say I have the following DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
You could count a single column by
df.A.count()
#or
df['A'].count()
both evaluate to 5.
The cool thing (or one of many w.r.t. pandas) is that if you have NA values, count takes that into consideration.
So if I did
df['A'][1::2] = np.NAN
df.count()
The result would be
A 3
B 5
Simply, row_num = df.shape[0] # gives number of rows, here's the example:
import pandas as pd
import numpy as np
In [322]: df = pd.DataFrame(np.random.randn(5,2), columns=["col_1", "col_2"])
In [323]: df
Out[323]:
col_1 col_2
0 -0.894268 1.309041
1 -0.120667 -0.241292
2 0.076168 -1.071099
3 1.387217 0.622877
4 -0.488452 0.317882
In [324]: df.shape
Out[324]: (5, 2)
In [325]: df.shape[0] ## Gives no. of rows/records
Out[325]: 5
In [326]: df.shape[1] ## Gives no. of columns
Out[326]: 2
The Nan example above misses one piece, which makes it less generic. To do this more "generically" use df['column_name'].value_counts()
This will give you the counts of each value in that column.
d=['A','A','A','B','C','C'," " ," "," "," "," ","-1"] # for simplicity
df=pd.DataFrame(d)
df.columns=["col1"]
df["col1"].value_counts()
5
A 3
C 2
-1 1
B 1
dtype: int64
"""len(df) give you 12, so we know the rest must be Nan's of some form, while also having a peek into other invalid entries, especially when you might want to ignore them like -1, 0 , "", also"""
Simple method to get the records count:
df.count()[0]
I used pandas library for this. Here is the code
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(len(data[required_colum_name]))
# this also works -> data["Post test Number"].count()

Categories