df.dropna() modify rows indices - python

I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020

You can simply add df.reset_index(drop=True)

By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020

Related

Getting max row from multi-index table

I have a table that looks similar to this:
user_id
date
count
1
2020
5
2021
7
2
2017
1
3
2020
2
2019
1
2021
3
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id
date
count
1
2021
7
2
2017
1
3
2021
3
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns
You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.
Demo
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df (user_id is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3

Ordering columns according to columns values in Python

SO I have the below
Name Tues Mon Tues Mon
col 0 0 1 1 <-
bill 2 1 2 1
jon 4 3 4 3
and i want to order the dataframe columns according to the "col" row to group 0's and 1's
in order but also in order according the days of the week so below is the result.
Name Mon Tues Mon Tues
col 0 0 1 1
bill 1 2 1 2
jon 3 4 3 4
It's easier to sort by rows, so you can use .T to transpose the dataframe and then use .T to transpose it back after running some operations.
The first thing it looks like you need to do is sort by the day of the week? You can crate a new column that replaces the partial Weekday strings to numbers, so you can sort in order by day of week, and then drop that column after sorting.
df = df.set_index('Name').T.reset_index()
df['day'] = df['index'].replace(['Su.*', 'M.*', 'Tu.*', 'W.*', 'Th.*', 'F.*', 'Sa.*'],
[1,2,3,4,5,6,7], regex=True)
df = df.sort_values(['col', 'day']).drop('day', axis=1).set_index('index').T.reset_index()
df
Out[1]:
index Name Mon Tues Mon.1 Tues.1
0 col 0 0 1 1
1 bill 1 2 1 2
2 jon 3 4 3 4
You can chnage the column names with:
df.columns = [col.split('.')[0] for col in df.columns]
Name Mon Tues Mon Tues
0 col 0 0 1 1
1 bill 1 2 1 2
2 jon 3 4 3 4
You can sort df by the following simple self-explained line
df_sorted = df.T.sort_values(['col', 'bill']).T

How to convert one row two columns dataframe into multiple rows two columns dataframe

I am new to Python.
I have a dataframe with two columns. One is ID column and the other is the
year and count information related to the ID.
I want to convert this format into multiple rows with the same ID.
The current dataframe looks like:
ID information
1 2014:Total:0, 2015:Total:1, 2016:Total:2
2 2017:Total:3, 2018:Total:1, 2019:Total:2
I expect the converted dataframe should like this:
ID Year Value
1 2014 0
1 2015 1
1 2016 2
2 2017 3
2 2018 1
2 2019 2
I tried to use the str.split method of pandas dataframe, but no luck.
Any suggestions would be appreciated.
Let us using explode :-) (New in pandas 0.25.0)
df.information=df.information.str.split(', ')
Yourdf=df[['ID']].join(df.information.explode().str.split(':',expand=True).drop(1,axis=1))
Yourdf
ID 0 2
0 1 2014 0
0 1 2015 1
0 1 2016 2
1 2 2017 3
1 2 2018 1
1 2 2019 2
Try using the below code, unlike #WenYoBen's answer this works for much lower versions as well:
df2 = pd.DataFrame(df['information'].str.split(', ', expand=True).apply(lambda x: x.str.split(':')).T.values.flatten().tolist(), columns=['Year', '', 'Value']).iloc[:, [0, 2]]
print(pd.DataFrame(sorted(df['ID'].tolist() * (len(df2) // 2)), columns=['ID']).join(df2))
Output:
ID Year Value
0 1 2014 0
1 1 2017 3
2 1 2015 1
3 2 2018 1
4 2 2016 2
5 2 2019 2

Pandas Time-Series: Find previous value for each ID based on year and semester

I realize this is a fairly basic question, but I couldn't find what I'm looking for through searching (partly because I'm not sure how to summarize what I want). In any case:
I have a dataframe that has the following columns:
* ID (each one represents a specific college course)
* Year
* Term (0 = fall semester, 1 = spring semester)
* Rating (from 0 to 5)
My goal is to create another column for Previous Rating. This column would be equal to the course's rating the last time the course was held, and would be NaN for the first offering of the course. The goal is to use the course's rating from the last time the course was offered in order to predict the current semester's enrollment. I am struggling to figure out how to find the last offering of each course for a given row.
I'd appreciate any help in performing this operation! I am working in Pandas but could move my data to R if that'd make it easier. Please let me know if I need to clarify my question.
I think there are two critical points: (1) sorting by Year and Term so that the order corresponds to temporal order; and (2) using groupby to collect on IDs before selecting and shifting the Rating. So, from a frame like
>>> df
ID Year Term Rating
0 1 2010 0 2
1 2 2010 0 2
2 1 2010 1 1
3 2 2010 1 0
4 1 2011 0 3
5 2 2011 0 3
6 1 2011 1 4
7 2 2011 1 0
8 2 2012 0 4
9 2 2012 1 4
10 1 2013 0 2
We get
>>> df = df.sort(["ID", "Year", "Term"])
>>> df["Previous_Rating"] = df.groupby("ID")["Rating"].shift()
>>> df
ID Year Term Rating Previous_Rating
0 1 2010 0 2 NaN
2 1 2010 1 1 2
4 1 2011 0 3 1
6 1 2011 1 4 3
10 1 2013 0 2 4
1 2 2010 0 2 NaN
3 2 2010 1 0 2
5 2 2011 0 3 0
7 2 2011 1 0 3
8 2 2012 0 4 0
9 2 2012 1 4 4
Note that we didn't actually need to sort by ID -- the groupby would have worked equally well without it -- but this way it's easier to see that the shift has done the right thing. Reading up on the split-apply-combine pattern might be helpful.
Use this function to create the new column...
DataFrame.shift(periods=1, freq=None, axis=0, **kwds)
Shift index by desired number of periods with an optional time freq
Lets say you have a dataframe like this...
ID Rating Term Year
1 1 0 2002
2 2 1 2003
3 3 0 2004
2 4 0 2005
where ID is the course ID and you have multiple entries for each id based on year and semester. Your goal is to find the row based on an ID and recent year and term.
For that you can do this...
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))]
Where we are finding the course by given ID and term and last offering of the course. If you want the rating, then you can do
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))].Rating
Hope you were trying to accomplish this result.
Thanks.

Grouping by value in column and getting another columns value

This is the seed DataSet:
In[1]: my_data =
[{'client':'A','product_s_n':'1','status':'in_store','month':'Jan'},
{'client':'A','product_s_n':'1','status':'sending', 'month':'Feb'},
{'client':'A','product_s_n':'2','status':'in_store','month':'Jan'},
{'client':'A','product_s_n':'2','status':'in_store','month':'Feb'},
{'client':'B','product_s_n':'3','status':'in_store','month':'Jan'},
{'client':'B','product_s_n':'3','status':'sending', 'month':'Feb'},
{'client':'B','product_s_n':'4','status':'in_store','month':'Jan'},
{'client':'B','product_s_n':'4','status':'in_store','month':'Feb'},
{'client':'C','product_s_n':'5','status':'in_store','month':'Jan'},
{'client':'C','product_s_n':'5','status':'sending', 'month':'Feb'}]
df = pd.DataFrame(my_data)
df
Out[1]:
client month product_s_n status
0 A Jan 1 in_store
1 A Feb 1 sending
2 A Jan 2 in_store
3 A Feb 2 in_store
4 B Jan 3 in_store
5 B Jan 4 in_store
6 B Feb 4 in_store
8 C Jan 5 sending
The question I want to ask this data is: what's the client for each product_serial_number? From the data in this example, this is how the resulting DataFrame would look like (I need a new DataFrame as a result):
product_s_n client
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C
As you may have noticed, the 'status' and 'month' fields are just for 'giving sense' and structure to the data in this sample dataset. Tried using groupby, with no success. Any ideas?
Thanks!
After calling df.groupby(['product_s_n']) you can restrict attention to a particular column by indexing with ['client']. You can then select the first value of client from each group by calling first().
>>> df.groupby(['product_s_n'])['client'].first()
product_s_n
1 A
2 A
3 B
4 B
5 C
Name: client, dtype: object

Categories