Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to groupby and sum on a dataframe. The standard groupby function groups exactly same strings in rows together but I need this to be done on similar strings. For example-:
United States | 10
Germnay | 23
Unaited Staetes | 20
Germany | 21
Germanny | 32
Uniited Staites | 30
This should result as -:
United States 60
Germnay 76
The order of names are not that important. The sum of the values are.
Thanks a lot :)
EDIT:
Perhaps it would be simpler to create an ID column that gives the same id for similar countries. I can just groupby on that then.
not the solution but a hack that might help if you are doing something quick and dirty
lowercase the country names
remove vowels from the country names
remove consecutive occurrences of consonants
after you transform data in this way that you can use normal groupby and it should work pretty well.
I suggest this since your data appears to be country names entered by users.
another idea:
preprocessing step:
use spelling corrector trained on country names to guess country name from the wrong spelling (https://norvig.com/spell-correct.html)
transform each row of the data using that.
then use groupby to group.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I need to create a dataframe which lists all patients and their matching doctors.
I have a txt file with doctor/patient records organized in the following format:
Doctor_1: patient23423,patient292837,patient1232423...
Doctor_2: patient456785,patient25363,patient23425665...
And a list of all unique patients.
To do this, I imported the txt file into a doctorsDF dataframe, separated by a colon. I also created a patientsDF dataframe with 2 columns: 'Patients' filled from the patient list, and 'Doctors' column empty.
I then ran the following:
for pat in patientsDF['Patient']:
for i, doc in enumerate(doctorsDF[1]):
if doctorsDF[1][i].find(str(pat)) >= 0 :
patientsDF['Doctor'][i] = doctorsDF.loc[i,0]
else:
continue
This worked fine, and now all patients are matched with the doctors, but the method seems clumsy. Is there any function that can more cleanly achieve the result? Thanks!
(First StackOverflow post here. Sorry if this is a newb question!)
If you use Pandas, try:
df = pd.read_csv('data.txt', sep=':', header=None, names=['Doctor', 'Patient'])
df = df[['Doctor']].join(df['Patient'].str.strip().str.split(',')
.explode()).reset_index(drop=True)
Output:
>>> df
Doctor Patient
0 Doctor_1 patient23423
1 Doctor_1 patient292837
2 Doctor_1 patient1232423
3 Doctor_2 patient456785
4 Doctor_2 patient25363
5 Doctor_2 patient23425665
How to search:
>>> df.loc[df['Patient'] == 'patient25363', 'Doctor'].squeeze()
'Doctor_2'
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am working on a dataframe. Data in the image
Q. I want the number of shows released per year but if I'm applying count() function, it's giving me 6 instead of 3. Could anyone suggest how do I get the correct value count.
To get unique value of single year, you can use
count = len(df.loc[df['release_year'] == 1945, 'show_id'].unique())
# or
count = df.loc[df['release_year'] == 1945, 'show_id'].nunique()
To summarize unique value of dataframe by year, you can drop_duplicates() on column show_id first.
df.drop_duplicates(subset=['show_id']).groupby('release_year').count()
Or use value_counts() on column after dropping duplicates.
df.drop_duplicates(subset=['show_id'])['release_year'].value_counts()
df['show_id'].nunique().count()
should do the job.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I wanted to know if there's a way for me to remove an entire row of a dataframe given a condition (An outlier) in one column?
Df
Name
Value
Tree
300
Orange
50
Apple
75
Mango
60
Cherry
1
In this case I would want to remove Cherry and Tree since they're outliers.
Quite a subjective question. There is not a single definate way to do it. In fact there are multiple ways to do it. Use descriptive stats to build logic and then code it.
In this case I define the upper and lower limit and then filter using pandas mask. Code below
lower_limit=1
Upper_limit=100
df=df.assign(Value=df['Value'].where(df['Value'].between(lower_limit,Upper_limit))).dropna()
Name Value
1 Orange 50.0
2 Apple 75.0
3 Mango 60.0
4 Cherry 1.0
df[df.apply(lambda row: row['Value'] > lower_limit and row['Value'] < upper_limit, axis=1)]
apply() returns a boolean Series representing whether each df['Value'] value is outliers. df['Value'].between(lower_limit, upper_limit) has the same effect.
Dataframe can accept a boolean indexing. The effect is to drop the coresponding False row, keep the coresponding True row.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The data is like this and it is in a data frame.
PatientId Payor
0 PAT10000 [Cash, Britam]
1 PAT10001 [Madison, Cash]
2 PAT10002 [Cash]
3 PAT10003 [Cash, Madison, Resolution]
4 PAT10004 [CIC Corporate, Cash]
I want to remove the square brackets and filter all patients who used at least a certain mode of payment eg madison then obtain their ID. Please help.
This will generate a list of tuples (id, payor). (df is the dataframe)
payment = 'Madison'
ids = [(id, df.Payor[i][1:-1]) for i, id in enumerate(df.PatientId) if payment in df.Payor[i]]
let's say, your data frame variable initialized as "df" and after removing square brackets, you want to filter all elements containing "Madison" under "Payor" column
df.replace({'[':''}, regex = True)
df.replace({']':''}, regex = True)
filteredDf = df.loc[df['Payor'].str.contains("Madison")]
print(filteredDf)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Pandas:I have a dataframe given below which contains the same set of banks twice..I need to slice the data from 0th index that contains a bank name, upto the index that contains the same bank name..here in the problem -DEUTSCH BANK AG..I need to apply same logic to any such kind of dataframes.ty..
I tried with logic:- df25.iloc[0,1]==df25[1].any().. but it returns nly true but not the index position.
DataFrame:-[1]:https://i.stack.imgur.com/iJ1hJ.png, https://i.stack.imgur.com/J2aDX.png
You need to get the index of all the rows that has the value you are looking for (in this case the bank name) and get the slice the data frame using the indices.
Example:
df = pd.DataFrame({'Col1':list('abcdeafgbfhi')})
search_str = 'b'
idx_list = list(df[(df['Col1']==search_str)].index.values)
print(df[idx_list[0]:idx_list[1]])
Output:
Col1
1 b
2 c
3 d
4 e
5 a
6 f
7 g
Note that the assumption is that there will be only 2 rows with the same value. If there are more than 2, you have to play with the index list values and get what you need. Hope this helps.
Keep in mind that posting a sample data set will always help you get more answers as people will move away to another question when they see images or screenshots, because it involves additional steps to reproduce the issue