Using grouby for counting - python

I have a file with the following structure (there are around 10K rows):
User Destination Country
123 34578 US
123 34578 US
345 76590 US
123 87640 MX
890 11111 CA
890 88888 CA
890 99999 CA
Each user can go to multiple destinations that are located in different countries. I need to find out the number of unique destinations users go to, median and mean of unique destinations. Same for countries. I don't know how to use groupby to achieve that. I managed to get the stats by placing everything in nested dictionary, but I feel that there may be a much easier way to the approach by using pandas dataframes and groubpy.
I am not looking for a count on each groupby section. I am looking for something like: on average, users visit X destinations and Y countries. So, I am looking for aggregate stats over all groupby results.
Edit. Here is my dict approach:
from collections import defaultdict
test=lambda: defaultdict(test)
conn_l=test()
with open('myfile') as f:
for line in f:
current=line.split(' ')
s = current[0]
d = current[1]
if conn_l[s][d]:
conn_l[s][d]+=1
else:
conn_l[s][d]=1
lengths=[]
for k,v in conn_l.items():
lengths.append(len(v))

I think this one might be a little harder than it looks at first glance (or perhaps there is a simpler approach than what I do below).
ser = df.groupby('User')['Destination'].value_counts()
123 34578 2
87640 1
345 76590 1
890 11111 1
99999 1
88888 1
The output of value_counts() is a series, you can then do groupby a second time to get a count of the unique destinations.
ser2 = ser.groupby(level=0).count()
User
123 2
345 1
890 3
That's for clarity but you could do it all on one line.
df.groupby('User')['Destination'].value_counts().groupby(level=0).count()
With ser2 you ought to be able to do all the other things.
ser2.median()
ser2.mean()

Agree with JohnE that counting the number of entries for User is not obvious.
I found that:
df2 = df.groupby(['User','Destination'])
df3 = df2.size().groupby(level=0).count()
also works, the only difference being that df2 is a Dataframe.groupby rather than a series.groupby, so potentially has a bit more functionality since it retains the Country information.
A trivial example:
for name, group in df2:
print name, group
(123, 34578) User Destination Country
0 123 34578 US
1 123 34578 US
(123, 87640) User Destination Country
3 123 87640 MX
(345, 76590) User Destination Country
2 345 76590 US
(890, 11111) User Destination Country
4 890 11111 CA
(890, 88888) User Destination Country
5 890 88888 CA
(890, 99999) User Destination Country
6 890 99999 CA
ser = df.groupby('User')['Destination']
for name, group in ser:
print name, group
123 0 34578
1 34578
3 87640
Name: Destination, dtype: int64
345 2 76590
Name: Destination, dtype: int64
890 4 11111
5 88888
6 99999
Name: Destination, dtype: int64

Related

Pandas filtering based on minimum data occurrences across multiple columns

I have a dataframe like this
country data_fingerprint organization
US 111 Tesco
UK 222 IBM
US 111 Yahoo
PY 333 Tesco
US 111 Boeing
CN 333 TCS
NE 458 Yahoo
UK 678 Tesco
I want those data_fingerprint for where the organisation and country with top 2 counts exists
So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .
So based on that the output of data_fingerprint should be having
data_fingerprint
111
678
What i have tried for organization to exist in my complete dataframe is this
# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
But i am not getting any data here.Once i get data for this I can do it along with country
Can someone please help to get me the output.I have less data so using Pandas
here is one way to do it
df[
df['organization'].isin(df['organization'].value_counts().head(2).index) &
df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()
array([111, 678], dtype=int64)
Annotated code
# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]
# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)
# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()
Result
data_fingerprint
0 111
7 678
You can do
# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
Output - Only Tesco and Yahoo left
df[new]
country data_fingerprint organization
0 US 111 Tesco
2 US 111 Yahoo
3 PY 333 Tesco
6 NE 458 Yahoo
7 UK 678 Tesco
You can do the same for country

How to search a substring from one df in another df?

I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St

Finding earliest date after groupby a specific column

I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!
Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]

Create a new column using str.contains and where the condition fails, set it to null (NaN)

I am trying to create a new column in my pandas dataframe, but only with a value if another column contains a certain string.
My dataframe looks something like this:
raw val1 val2
0 Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456
2 13445 07708-20-2019 US 432 676
3 79935 19028808-15-2019 US 444 234
4 Vendor: company Name 2 234 234
I am trying to create a new column, vendor that transforms the dataframe into:
raw val1 val2 vendor
0 Vendor Invoice Numbe Inv Date Vendor Invoice Numbe Inv Date
1 Vendor: Company Name 1 123 456 Vendor: Company Name 1
2 13445 07708-20-2019 US 432 676 NaN
3 79935 19028808-15-2019 US 444 234 NaN
4 Vendor: company Name 2 234 234 company Name 2
5 Vendor: company Name 2 928 528 company Name 2
However, whenever I try,
df['vendor'] = df.loc[df['raw'].str.contains('Vendor', na=False), 'raw']
I get the error
ValueError: cannot reindex from a duplicate axis
I know that at index 4 and 5 it's the same value for the company, but what am I doing wrong and how to I add the new column to my dataframe?
The problem is df.loc[df['raw'].str.contains('Vendor', na=False), 'raw'] as different length than df.
You can try np.where, which assigns a new columns by an np.array of the same size, so it doesn't need index alignment.
df['vendor'] = np.where(df['raw'].str.contains('Vendor'), df['raw'], np.NaN)
You could .extract() the part of the string that comes after Vendor: using a positive lookbehind:
df['vendor'] = df['raw'].str.extract(r'(?<=Vendor:\s)(.*)')

Filtering a DataFrame index by another index or a value

I have a multi-indexed dataframe that looks like this:
status value
id country
1234 US Complete 54
2345 US Ongoing 3
UK Complete 343
JP Complete 54
IT Complete 32
3456 CA Ongoing 20
UK Complete 123
FR Complete 245
I'm not sure how to make it so that I can filter the ID column by either the presence of a field in the other index (country) or one of the values.
Essentially it would be great if say I wanted all columns for all indexes that don't contain "US" and could get something back like this:
status value
id country
3456 CA Ongoing 20
UK Complete 123
FR Complete 245
Or additionally be able to say "Filter out each ID in which at least 1 Status is Ongoing" and get this back:
status value
id country
1234 US Complete 54
Eventually I'd like to be able to combine this but learning how to do each individually is probably a good first step.
You 2nd question No Ongoing
sliceidx=~df.index.get_level_values(0).isin(df.loc[df.status=='Ongoing'].index.get_level_values(0))
df[sliceidx]
Out[474]:
status value
id country
1234 US Complete 54
Your 1st question no US
sliceidx=df.index.get_level_values(0)[df.index.get_level_values(1)=='US']
df[~df.index.get_level_values(0).isin(sliceidx)]
Out[478]:
status value
id country
3456 CA Ongoing 20
UK Complete 123
FR Complete 245
More info : What I usually do is reset_index
df1=df.copy().reset_index()
df[df1.country.ne('US').groupby(df1['id']).transform('all').values]
Out[486]:
status value
id country
3456 CA Ongoing 20
UK Complete 123
FR Complete 245
df[df1.status.ne('Ongoing').groupby(df1['id']).transform('all').values]
Out[487]:
status value
id country
1234 US Complete 54

Categories