Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Pandas:I have a dataframe given below which contains the same set of banks twice..I need to slice the data from 0th index that contains a bank name, upto the index that contains the same bank name..here in the problem -DEUTSCH BANK AG..I need to apply same logic to any such kind of dataframes.ty..
I tried with logic:- df25.iloc[0,1]==df25[1].any().. but it returns nly true but not the index position.
DataFrame:-[1]:https://i.stack.imgur.com/iJ1hJ.png, https://i.stack.imgur.com/J2aDX.png
You need to get the index of all the rows that has the value you are looking for (in this case the bank name) and get the slice the data frame using the indices.
Example:
df = pd.DataFrame({'Col1':list('abcdeafgbfhi')})
search_str = 'b'
idx_list = list(df[(df['Col1']==search_str)].index.values)
print(df[idx_list[0]:idx_list[1]])
Output:
Col1
1 b
2 c
3 d
4 e
5 a
6 f
7 g
Note that the assumption is that there will be only 2 rows with the same value. If there are more than 2, you have to play with the index list values and get what you need. Hope this helps.
Keep in mind that posting a sample data set will always help you get more answers as people will move away to another question when they see images or screenshots, because it involves additional steps to reproduce the issue
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 days ago.
This post was edited and submitted for review yesterday.
Improve this question
ms_dict = {'Mz':[200.035234, 200.035523, 200.035125,200.042546],
'Type':['Blank', 'Sample','Sample','Sample'],
'Well': ['E17','E18','A04','H12'],'Intensity':[21655,56415,56456,546454]}
df = pd.DataFrame.from_dict(ms_dict)
df
Mz
Type
Well
Intensity
200.035234
Blank
E17
21655
200.035523
Sample
E18
56415
200.035125
Sample
A04
56456
200.042546
Sample
H12
546454
In our data, we often have row features that are within a given tolerance range of each other that do not collapse to a single bucket based on pre-determined limits, such as rounding (2 places behind decimal instead of 5). In the provided example, we want to group the first three rows into a single bucket with the values in columns Mz and Intensity combined in some "meaningful" way (summed or averaged), but not the fourth row. We also want to show the distribution of the new bucket across the Well positions in which it is found.
The tolerance is determined by the ppm difference (ppm_diff) of the values in each row to find others that are within that same tolerance window.
ppm_diff = ((value1 - value2)/ value1 )*1e6
Is there a way in python or pandas to group these rows based on the difference in the single column?
We have tried simply rounding, but that splits values inappropriately where the distribution matrix of individual features is inaccurate (edge effects of the rounding: 200.0355 and 200.0354 are well within a given tolerance of 2 ppm, but result in different buckets). Larger grouping based on rounding to a lower number of places behind the decimal can inappropriately group features that should remain distinct.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I need to create a dataframe which lists all patients and their matching doctors.
I have a txt file with doctor/patient records organized in the following format:
Doctor_1: patient23423,patient292837,patient1232423...
Doctor_2: patient456785,patient25363,patient23425665...
And a list of all unique patients.
To do this, I imported the txt file into a doctorsDF dataframe, separated by a colon. I also created a patientsDF dataframe with 2 columns: 'Patients' filled from the patient list, and 'Doctors' column empty.
I then ran the following:
for pat in patientsDF['Patient']:
for i, doc in enumerate(doctorsDF[1]):
if doctorsDF[1][i].find(str(pat)) >= 0 :
patientsDF['Doctor'][i] = doctorsDF.loc[i,0]
else:
continue
This worked fine, and now all patients are matched with the doctors, but the method seems clumsy. Is there any function that can more cleanly achieve the result? Thanks!
(First StackOverflow post here. Sorry if this is a newb question!)
If you use Pandas, try:
df = pd.read_csv('data.txt', sep=':', header=None, names=['Doctor', 'Patient'])
df = df[['Doctor']].join(df['Patient'].str.strip().str.split(',')
.explode()).reset_index(drop=True)
Output:
>>> df
Doctor Patient
0 Doctor_1 patient23423
1 Doctor_1 patient292837
2 Doctor_1 patient1232423
3 Doctor_2 patient456785
4 Doctor_2 patient25363
5 Doctor_2 patient23425665
How to search:
>>> df.loc[df['Patient'] == 'patient25363', 'Doctor'].squeeze()
'Doctor_2'
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
i want a range of values from two columns From and To since the number in To column should be included in range of values so i'm adding 1 to that as shown in below
df.apply(lambda x : range(x['From'],x['To']+1),1)
df.apply(lambda x : ','.join(map(str, range(x['From'],x['To']))),1)
i need output some thing like this
if from value starts from 5 and To value ends with 11
myoutput should be like this
5,6,7,8,9,10,11
i'm getting till 10 only even i have added +1 to range of end value
df:
----
From To
15887 16251
15888 16252
15889 16253
15890 16254
and range should be written in new column
Try this:
df=pd.DataFrame({'From':[15887,15888,15889,15890],'To':[16251,16252,16253,16254]})
df['Range']=[list(range(i,k+1)) for i,k in zip(df['From'],df['To'])]
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to groupby and sum on a dataframe. The standard groupby function groups exactly same strings in rows together but I need this to be done on similar strings. For example-:
United States | 10
Germnay | 23
Unaited Staetes | 20
Germany | 21
Germanny | 32
Uniited Staites | 30
This should result as -:
United States 60
Germnay 76
The order of names are not that important. The sum of the values are.
Thanks a lot :)
EDIT:
Perhaps it would be simpler to create an ID column that gives the same id for similar countries. I can just groupby on that then.
not the solution but a hack that might help if you are doing something quick and dirty
lowercase the country names
remove vowels from the country names
remove consecutive occurrences of consonants
after you transform data in this way that you can use normal groupby and it should work pretty well.
I suggest this since your data appears to be country names entered by users.
another idea:
preprocessing step:
use spelling corrector trained on country names to guess country name from the wrong spelling (https://norvig.com/spell-correct.html)
transform each row of the data using that.
then use groupby to group.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to drop NaN values using the dropna() method, provided by Panda. I've read the document and looked at other StackOverflow posts, but I still could not fix the error.
For my code, I will first read an excel file. If the rows have value “-“, I will change it to a NaN value. After that, I will use the method dropna() to drop the NaN values. I will then reassign the result of the dropna() method to a new variable called mydf2. Below are my codes and screenshots
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx',
na_values='-')
mydf = mydf.set_index(['Variables'])
print(mydf.head(5)) # Original data
mydf2 = mydf.dropna()
print(mydf2)
dropna() has worked correctly. You have two print statements. The first one has printed five rows as asked for by print(mydf.head(5)).
The output of your second print statement print(mydf2) is an empty dataframe [0 rows and 37 columns] because you have apparently got an NaN in each and every row. (see the bottom of your screenshot)
Sounds like here that NaN is a string, so do:
mydf2 = mydf.replace('-',np.nan).dropna()
I wrote a piece of code here, it works fine with my data, so try this out.
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx')
to_del = []
for i in range(mydf.shape[0]):
if "-" in list(mydf.iloc[i]):
to_del.append(i)
out_df = mydf.drop(to_del, axis=0)
As you have not posted your data, I'm not sure if every row has NaN values or not. If this is so, df.dropna() will simply drop every row. For example, the columns 1981 and 1982 are all NaN values in your image. use df.dropna(axis=1) will drop these two columns, and will not return you an empty df.
df = pd.DataFrame({'Variables':['Total','Single','Married','Widowed','Divorced/Separated'],
'1980':range(5),
'1981':[np.nan]*5})
df.set_index('Variables')
df.dropna(axis=1)