Formatting a string in pandas dataframe [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe that (simplified) looks something like:
col1 col2
1 a
2 b
3 c,ddd,ee,f,5,hfsf,a
In col2, I need to be able to remove everything after the last 2 commas, and if it doesn't have commas just keep the value as is:
col1 col2
1 a
2 b
3 c,ddd,ee
again, this is simplified and the solution needs to scale up to something that has 1000's of rows, and the space between each comma will not always be the same
edit:
This is got me on the right track
df.col2 = df.col2.str.split(',').str[:2].str.join(',')

Pandas provides access to many familiar string functions, including slicing and selection, through the .str attribute:
df.col2.str.split(',').str[:3].str.join(',')
#0 a
#1 b
#2 c,ddd,ee

Related

How to remove duplicates from data frame using python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
dframe= pd.DataFrame({'col1':['A']*3 + ['B']*4 + ['C','B','A'],'col2':[2,3,4,2,4,2,1,3,4,4]})
I want to remove duplicates from both columns and final result should look like this:
pd.DataFrame({'col1':['A'] + ['B'] + ['C'],'col2':[2,4,3]})
I tried following but the result was not as per the expectations
dframe.drop_duplicates(subset=['col1'], keep='first')
Please help.
Thanks
try:
via agg() and dropna() method:
out=dframe.agg(lambda x:pd.Series(pd.unique(x))).dropna()
OR
via apply() and dropna() method:
out=dframe.apply(lambda x:pd.Series(pd.unique(x))).dropna()
output of out:
col1 col2
0 A 2
1 B 3
2 C 4

how to get Unique count from a DataFrame in case of duplicate index [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am working on a dataframe. Data in the image
Q. I want the number of shows released per year but if I'm applying count() function, it's giving me 6 instead of 3. Could anyone suggest how do I get the correct value count.
To get unique value of single year, you can use
count = len(df.loc[df['release_year'] == 1945, 'show_id'].unique())
# or
count = df.loc[df['release_year'] == 1945, 'show_id'].nunique()
To summarize unique value of dataframe by year, you can drop_duplicates() on column show_id first.
df.drop_duplicates(subset=['show_id']).groupby('release_year').count()
Or use value_counts() on column after dropping duplicates.
df.drop_duplicates(subset=['show_id'])['release_year'].value_counts()
df['show_id'].nunique().count()
should do the job.

How do I remove squared brackets from data that is saved as a list in a Dataframe in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
The data is like this and it is in a data frame.
PatientId Payor
0 PAT10000 [Cash, Britam]
1 PAT10001 [Madison, Cash]
2 PAT10002 [Cash]
3 PAT10003 [Cash, Madison, Resolution]
4 PAT10004 [CIC Corporate, Cash]
I want to remove the square brackets and filter all patients who used at least a certain mode of payment eg madison then obtain their ID. Please help.
This will generate a list of tuples (id, payor). (df is the dataframe)
payment = 'Madison'
ids = [(id, df.Payor[i][1:-1]) for i, id in enumerate(df.PatientId) if payment in df.Payor[i]]
let's say, your data frame variable initialized as "df" and after removing square brackets, you want to filter all elements containing "Madison" under "Payor" column
df.replace({'[':''}, regex = True)
df.replace({']':''}, regex = True)
filteredDf = df.loc[df['Payor'].str.contains("Madison")]
print(filteredDf)

How to Read A CSV With A Variable Number of Columns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My csv file looks like this:
5783,145v
g656,4589,3243,tt56
6579
How do I read this with pandas (or otherwise)?
(the table should contain empty cells)
You could pass a dummy separator, and then use str.split (by ",") with expand=True:
df = pd.read_csv('path/to/file.csv', sep=" ", header=None)
df = df[0].str.split(",", expand=True).fillna("")
print(df)
Output
0 1 2 3
0 5783 145v
1 g656 4589 3243 tt56
2 6579
I think that the solution proposed by #researchnewbie is good. If you need to replace the NaN values for say, zero, you could add this line after the read:
dataFrame.fillna(0, inplace=True)
Try doing the following:
import pandas as pd
dataFrame = pd.read_csv(filename)
Your empty cells should contain the NaN value, which essentially null.

Python :Select the rows for the most recent entry from multiple users [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe df with 3 columns :
df=pd.DataFrame({
'User':['A','A','B','A','C','B','C'],
'Values':['x','y','z','p','q','r','s'],
'Date':[14,11,14,12,13,10,14]
})
I want to create a new dataframe that will contain the rows corresponding to highest values in the 'Date' columns for each user. For example for the above dataframe I want the desired dataframe to be as follows ( its a jpeg image):
Can anyone help me with this problem?
This answer assumes that there is different maximum values per user in Values column:
In [10]: def get_max(group):
...: return group[group.Date == group.Date.max()]
...:
In [12]: df.groupby('User').apply(get_max).reset_index(drop=True)
Out[12]:
Date User Values
0 14 A x
1 14 B z
2 14 C s

Categories