I am trying to use 2 datasets to do a count and after I want to remove the duplicates from Col 1 but i want to keep the number of calls.
Basically I have a df like this:
Client Number
Call Count
Bob
3
Bob
3
John
1
Bob
3
So what happens is the duplicates get removed but the call count also changes and turns to 1. How do I stop this from occurring?
If anyone can please help
#Count the number of times a account number comes up CallCount['Call Count'] = CallCount.groupby('Client Number').cumcount() + 1
# Remove the duplicates df2 = df1.drop_duplicates(subset=["Client Number"], keep=False)
I have tried these but its the same outcome
import pandas as pd
df1 = pd.DataFrame({'c': ['Bob', 'Bob', 'John', 'Bob'], 'n': [3, 3, 1, 0]})
df1 = df1.groupby(['c'], as_index = False).count()
df1
c
n
0
Bob
3
1
John
1
This is what you are trying to achieve right?
Related
I have two dataframes.
DF1
DF2
I want to add a column to DF1, 'Speed', that references the track category, and the LocationFrom and LocationTo range, to result in the below.
I have looked at merge_asof, and IntervalIndex, but unable to figure out how to reference the category before the range.
Thanks.
Check Below code: SQLITE
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
DF1.to_sql('DF1', con = conn, index = False)
DF2.to_sql('DF2', con = conn, index = False)
pd.read_sql("""Select DF1.*, DF2.Speed
From DF1
join DF2 on DF1.Track = Df2.Track
AND DF1.Location BETWEEN DF2.LocationFrom and DF2.LocationTo""", con=conn)
Output:
As hinted in your question, this is a perfect use case for merge_asof:
pd.merge_asof(df1, df2, by='Track',
left_on='Location', right_on='LocationTo',
direction='forward'
)#.drop(columns=['LocationFrom', 'LocationTo'])
output:
Track Location LocationFrom LocationTo Speed
0 A 1 0 5 45
1 A 2 0 5 45
2 A 6 5 10 50
3 B 24 20 50 100
NB. uncomment the drop to remove the extra columns.
It works, but I would like to see someone do this without a for loop and without creating mini dataframes.
import pandas as pd
data1 = {'Track': list('AAAB'), 'Location': [1, 2, 6, 24]}
df1 = pd.DataFrame(data1)
data2 = {'Track': list('AABB'), 'LocationFrom': [0, 5, 0, 20], 'LocationTo': [5, 10, 20, 50], 'Speed': [45, 50, 80, 100]}
df2 = pd.DataFrame(data2)
speeds = []
for k in range(len(df1)):
track = df1['Track'].iloc[k]
location = df1['Location'].iloc[k]
df1_track = df1.loc[df1['Track'] == track]
df2_track = df2.loc[df2['Track'] == track]
speeds.append(df2_track['Speed'].loc[(df2_track['LocationFrom'] <= location) & (location < df2_track['LocationTo'])].iloc[0])
df1['Speed'] = speeds
print(df1)
Output:
Track Location Speed
0 A 1 45
1 A 2 45
2 A 6 50
3 B 24 100
This approach is probably not viable if your tables are large. It creates an intermediate table which has a merge of all pairs of matching Tracks between df1 and df2. Then it removes rows where the location is not between the boundaries. Thanks #Aeronatix for the dfs.
The all_merge intermediate table gets really big really fast. If a1 rows of df1 are Track A, a2 in df2 etc.. then the total rows in all_merge will be a1*a2+b1*b2+c1*c2...+z1*z2 which might or might not be gigantic depending on your dataset
all_merge = df1.merge(df2)
results = all_merge[all_merge.Location.between(all_merge.LocationFrom,all_merge.LocationTo)]
print(results)
Say I have a simple dataframe with the names of people. I perform a groupby on name
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3,4,5,6,7], 'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
df1 = df.groupby('name')
Question: How can I select out a table of specific names out of a list which contains a string subset of the names, either 2 or 3 characters?
e.g say I have the following list where both Tim & Geo are the first 3 characters of some entries in the name column and Jo is the first 2 characters of a certain entry in the name column.
list = ['Jo', 'Tim', 'Geo']
Attempted: My initial thought was to create new columns in the original dataframe which were either a 2 or 3 character subset of the name column and then try grouping by that however since 2 and 3 string characters are different the grouping wouldn't output the correct result.
Not sure whether it would be better to use some if condition such as if v in list is len(2) groupby(2char) else groupby(3char) and output the result as 1 dataframe.
list
df1['name_2char_subset] = df1['name'].str[0:2]
df1['name_3char_subset] = df1['name'].str[0:3]
if v in list is len(2):
df2 = df1.groupby('name_2char_subset')
else:
df2 = df1.groupby('name_3char_subset')
Desired Output: Since there are 2 counts of each of Jo, Geo & Tim. The output should group by each case. ie for Jo there are both John & Joe hence a count of 2 in the groupby.
df3 = pd.DataFrame({'name': ['Jo', 'Tim', 'Geo'], 'col1': [2,2,2]})
How could we group by name and output the entries in name which have the given initial characters as in the list?
Any alternative ways of doing this will be helpful. For example, can perform this in the group by of extract values in the list after the group by has been performed.
First dont use list for variable, because python code word. Then use Series.str.extract for test if match by starting of strings by ^ and count in Series.value_counts:
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df = (df['name'].str.extract('('+ pat + ')', expand=False)
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Your solution:
L = ['Jo', 'Tim', 'Geo']
s1 = df['name'].str[:2]
s2 = df['name'].str[:3]
df = (s1.where(s1.isin(L)).fillna(s2.where(s2.isin(L)))
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Solution from deleted answer with change by Series.str.startswith for test if starting string by list:
L = ['Jo', 'Tim', 'Geo']
df3 = pd.DataFrame({'name': L})
df3['col1'] = df3['name'].apply(lambda x: sum(df['name'].str.startswith(x)))
print (df3)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
EDIT: If need groupby more columns use first or second solution, assign columns back and aggregate by names aggregation in GroupBy.agg:
df = pd.DataFrame({'age' : [1,2,3,4,5,6,7],
'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
print (df)
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df['name'] = df['name'].str.extract('('+ pat + ')', expand=False)
df = df.groupby('name').agg(sum_age=('age','sum'), col1=('name', 'count'))
print (df)
sum_age col1
name
Geo 7 2
Jo 6 2
Tim 10 2
I'm new to pandas and trying to figure out how to add two different variables values in the same column.
import pandas as pd
import requests
from bs4 import BeautifulSoup
itemproducts = pd.DataFrame()
url = 'https://www.trwaftermarket.com/en/catalogue/product/BCH720/'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
code_name = soup.find_all('div',{'class':'col-sm-6 intro-section reset-margin'})
for head in code_name:
item_code = head.find('span',{'class':'heading'}).text
item_name = head.find('span',{'class':'subheading'}).text
for tab_ in tab_4:
ab = tab_.find_all('td')
make_name1 = ab[0].text.replace('Make','')
code1 = ab[1].text.replace('OE Number','')
make_name2 = ab[2].text.replace('Make','')
code2 = ab[3].text.replace('OE Number','')
itemproducts=itemproducts.append({'CODE':item_code,
'NAME':item_name,
'MAKE':[make_name1,make_name2],
'OE NUMBER':[code1,code2]},ignore_index=True)
OUTPUT (Excel image)
What actually I want
In pandas you must specify all the data in the same length. So, in this case, I suggest that you specify each column or row as a fixed length list. For those that have one member less, append a NaN to match.
I found a similar question here on stackoverflow that can help you. Another approach is to use explode function from Pandas Dataframe.
Below I put an example from pandas documentation.
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1
I couldn't reproduce the results from your script. However, based on your end dataframe, perhpas you can make use of explode together with apply the dataframe in the end:
#creating your dataframe
itemproducts = pd.DataFrame({'CODE':'BCH720','MAKE':[['HONDA','HONDA']],'NAME':['Brake Caliper'],'OE NUMBER':[['43019-SAA-J51','43019-SAA-J50']]})
>>> itemproducts
CODE MAKE NAME OE NUMBER
0 BCH720 ['HONDA', 'HONDA'] Brake Caliper ['43019-SAA-J51', '43019-SAA-J50']
#using apply method with explode on 'MAKE' and 'OE NUMBER'
>>> itemproducts.apply(lambda x: x.explode() if x.name in ['MAKE', 'OE NUMBER'] else x)
CODE MAKE NAME OE NUMBER
0 BCH720 HONDA Brake Caliper 43019-SAA-J51
0 BCH720 HONDA Brake Caliper 43019-SAA-J50
I have a dataframe with a list in one column and want to match all items in this list with a second dataframe. The matched values should then be added (as a list) to a new column in the first dataframe.
data = {'froots': [['apple','banana'], ['apple','strawberry']]
}
df1 = pd.DataFrame(data)
data = {'froot': ['apple','banana','strawberry'],
'age': [2,3,5]
}
df2 = pd.DataFrame(data)
DF1
index fruits
1 ['apple','banana']
2 ['apple','strawberry']
DF2
index fruit age
1 apple 2
2 banana 3
3 strawberry 5
New DF1
index froots age
1 ['apple','banana'] [2,3]
2 ['apple','strawberry'] [2,5]
I have a simple solution that takes way too long:
age = list()
for index,row in df1.iterrows():
numbers = row.froots
tmp = df2[['froot','age']].apply(lambda x: x['age'] if x['froot'] in numbers else None, axis=1).dropna().tolist()
age.append(tmp)
df1['age'] = age
Is there maybe a faster solution to this problem?
Thanks in Advance!
Use lsit comprehension with dictionary created by df2 and add new values to list if exist in dictionary tested by if:
d = df2.set_index('froot')['age'].to_dict()
df1['ag1e'] = df1['froots'].apply(lambda x: [d[y] for y in x if y in d])
print (df1)
froots ag1e
0 [apple, banana] [2, 3]
1 [apple, strawberry] [2, 5]
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...