Count text records in a column with pandas - python

I need to count how many times a value appears in the column. I did something similar in Excel and I want to understand how to play in pandas. Thanks

You can try something like this:
import pandas as pd
df = pd.DataFrame({'char_list':list('aabbbbssbbaaabdddcccsbcderfffrrcashhttyy')})
df = df['char_list'].value_counts().reset_index()
df.columns = ['char_list', 'count']
print(df)
Output:
char_list count
0 b 8
1 a 6
2 c 5
3 s 4
4 d 4
5 r 3
6 f 3
7 h 2
8 t 2
9 y 2
10 e 1

Do you want something like this :
df = pd.DataFrame({"a":[1,2,3,1,1,4,5,6,2,1]})
oc = df.groupby("a").size
df["count"]=df["a"].map(oc)
print(oc)
print()
print(df)
to get
a
1 4
2 2
3 1
4 1
5 1
6 1
dtype: int64
a count
0 1 4
1 2 2
2 3 1
3 1 4
4 1 4
5 4 1
6 5 1
7 6 1
8 2 2
9 1 4
or do you prefer something like that Pandas: Incrementally count occurrences in a column with an increment of occurrences ?

Clarify and describe your requirements
Count the occurrence of string X inside what?
Where to look, how to count?
What is X?
What does your Excel formula?
Your Excel formula is doing a window-based aggregation, where the aggregation function is a count (function COUNT.IF) and the window is from first row until current row (first parameter of type range). The counted (given criteria) is specified per row (second parameter as cell value).
See Excel's function COUNTIF:
Counts the number of cells within a range that meet the given criteria
Illustrate by example
Instead of "window-based" we could also say cumulative:
The formula counts the occurrence of string key123 (value in column A, current row, e.g. 1) in rows between first ($A$1) to current ($A1).
Given a column with strings where the first string is key123, then
its first occurrence should have count 1,
the second should have count 2
and so on
Equivalent functions in pandas
So your Excel formula =COUNTIF($A$1:$A1; A1) would directly translate to pandas GroupBy.cumcount like
df.groupby("Column_A").cumcount()+1
as already answered in:
Pandas: Incrementally count occurrences in a column
Terminology
The cumulative count increases the count for each occurrence. Similar to a cumulative sum also known as running total.
See also related SQL keywords/concepts:
GROUP BY: grouping records and applying aggregate-functions
COUNT: an aggregate-function like SUM, AVG, MAX, MIN
window functions: allow further fine-grained aggregation

Related

How to restrict DataFrame number of rows to the Xth unique value in certain column?

Say for example we have the following DataFrame:
A B
1 2
1 2
2 3
3 4
4 5
4 2
And we would know we wanted an x(say 3) number of unique values in column A.
Then the desired output would be:
A B
1 2
1 2
2 3
3 4
I thought about looping through the column in question, counting the number of unique values by tracking and taking the subset of the DataFrame with the right index. I am still a newbie to Python and I believe there would be a more efficient way to do this, please share your solutions. Appreciated!
You can try series.factorize which indexes the unique values starting at 0 and then select the values which is <= n-1 (because index starts at 0),hence reserves order too:
n=3
df[df['A'].factorize()[0]<=n-1]
A B
0 1 2
1 1 2
2 2 3
3 3 4
You can use np.random.choice to select the unique id, then isin to select rows with those id:
selected_ids = np.random.choice(df['A'].unique(), replace=False, size=3)
df[df['A'].isin(selected_ids)]

Top 2 products counts per day Pandas

I have dataframe like in the below pic.
First; I want the top 2 products, second I need the top 2 products frequents per day, so I need to group it by days and select the top 2 products from products column, I tried this code but it gives an error.
df.groupby("days", as_index=False)(["products"] == "Follow Up").count()
enter image description here
You need to groupby over both days and products and then use size. Once you have done this you will have all the counts in the df you require.
You will then need to sort both the day and the default 0 column which now contains your counts, this has been created by resetting your index on the initial groupby.
We follow the instructions in Pandas get topmost n records within each group to give your desired result.
A full example:
Setup:
df = pd.DataFrame({'day':[1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3],
'value':['a','a','b','b','b','c','a','a','b','b','b','c','a','a','b','b','b','c']})
df.head(6)
day value
0 1 a
1 1 a
2 1 b
3 1 b
4 1 b
5 1 c
df_counts = df.groupby(['day','values']).size().reset_index().sort_values(['day', 0], ascending = [True, False])
df_top_2 = df_counts.groupby('day').head(2)
df_top_2
day value 0
1 1 b 3
0 1 a 2
4 2 b 3
3 2 a 2
7 3 b 3
6 3 a 2
Of course, you should rename the 0 column to something more reasonable but this is a minimal example.

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

How to calculate the sum of two adjacent row from one column?

I have a data frame with the following columns, the first column is index:
para
0 223.46
1 92.26
2 66.86
3 52.14
4 69.55
5 94.20
6 129.96
7 297.48
The sum will be the two adjacent row from one column
new_index 0 will be the first value,
new_index1= old_index0+old_index1,
new_index2=old_index1 + old_index2 ,......and so on.
so I guess I need a for loop here(or maybe not)
I tried several ways, really have no idea how to do it.
The follow is what I tried:
def sum(i):
for i in range (0,i):
sum = data_10.icol[i] + data_10.icol[i+1]
return sum
I excepted to get:
para
0 223.46
1 315.72
2 159.12
3 119.00
4 121.69
5 163.75
6 224.16
7 427.38
This is rolling sum
df.rolling(2,min_periods=1).sum()

Efficient way to call previous row in python

I want to substitute the previous row value whenever a 0 value is found in the column of the dataframe in python. I used the following code,
if not a[j]:
a[j] = a[j-1]
and also
if a[j]==0:
a[j]=a[j-1]
Update:
Complete code updated:
for i in pd.unique(r.a):
sub=r[r.vehicle_id==i]
sub=DataFrame(sub,columns= ['a','b','c','d','e'])
sub=sub.drop_duplicates(["a","b","c","d"])
sub['c']=pd.to_datetime(sub['c'],unit='s')
for j in range(1, len(sub[1:])):
if not sub.d[j]:
sub.d[j] = sub.d[j-1]
if not sub.e[j]:
sub.e[j]=sub.e[j-1]
sub=sub.drop_duplicates(["lash_angle","lash_check_count"])
This is the starting of my code. the sub.d[j] line is only getting delayed
These both seem to work well when using integer values. One of the column contains decimal values. When using the code for that column, it is taking a huge time to complete(Nearly 15-20 secs) for the statement to complete. I am looping through nearly 10000 ids and wasting 15 secs at this step is making my entire code inefficient. Is there a better way, I can do this for the float (decimal) values, so that it would be much faster?
Thanks
Assuming that by "column of the dataframe" you mean you're actually talking about a column (Series) of a pandas DataFrame, then one trick is to replace the 0 by nan and then forward-fill. For example:
>>> df = pd.DataFrame(np.random.randint(0,4, 10**6))
>>> df.head(10)
0
0 0
1 3
2 3
3 0
4 1
5 2
6 3
7 2
8 0
9 3
>>> df[0] = df[0].replace(0, np.nan).ffill()
>>> df.head(10)
0
0 NaN
1 3
2 3
3 3
4 1
5 2
6 3
7 2
8 2
9 3
where you can decide for yourself how you want to handle the case of a 0 at the start, where you have no value to fill. This assumes that there aren't already NaN values you want to leave alone, but if there are, you can just use a mask with .loc to select only the ones you want to change.

Categories