How to get the last occurrance of all items on a column (pandas) [duplicate] - python

This question already has answers here:
Python Pandas Dataframe select row by max value in group
(2 answers)
Closed 18 days ago.
Let's suppose I have a dataset like this:
item_id | date | cat |
----------------------------
0 | 2020-01-01 | A |
0 | 2020-02-01 | B |
1 | 2020-04-01 | A |
2 | 2020-02-01 | C |
2 | 2021-01-01 | B |
So, I need to get the last category (column cat), that means that the result dataframe would be the following:
item_id | cat |
---------------
0 | B |
1 | A |
2 | B |
I know I could sort the values by the date and then iterate over the itens, but that would be too much consuming. Is there another method on pandas to achieve that?

Use drop_duplicates after sort_values:
>>> df.sort_values('date').drop_duplicates('item_id', keep='last')
item_id date cat
1 0 2020-02-01 B
2 1 2020-04-01 A
4 2 2021-01-01 B
Comment by #mozway:
Sorting is O(n*logn)
>>> df.loc[pd.to_datetime(df['date']).groupby(df['item_id'], sort=False).idxmax()]
item_id date cat
1 0 2020-02-01 B
2 1 2020-04-01 A
4 2 2021-01-01 B

Related

Pandas Python convert list data in record to iterate row [duplicate]

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 1 year ago.
I have pandas in format like this
| Group | ID_LIST |
| -------- | -------------- |
| A | [1,2,3] |
| B | [1,3,5] |
| C | [2,4] |
I would like to delist into separate row like this
| Group | ID_LIST |
| -------- | -------------- |
| A | 1 |
| A | 2 |
| A | 3 |
| B | 1 |
| B | 3 |
| B | 5 |
| C | 2 |
| C | 4 |
If it possible to done with pandas function ? or should I approach with convert to list instead?
If ID_LIST column contains real list:
>>> df.explode('ID_LIST')
Group ID_LIST
0 A 1
0 A 2
0 A 3
1 B 1
1 B 3
1 B 5
2 C 2
2 C 4
If ID_LIST columns contains strings (which have the appearance of a list):
>>> df.assign(ID_LIST=pd.eval(df['ID_LIST'])).explode('ID_LIST')
Group ID_LIST
0 A 1
0 A 2
0 A 3
1 B 1
1 B 3
1 B 5
2 C 2
2 C 4
Use explode
df = df.explode('ID_LIST')
Did you try to use Pandas explode() to separate list elements into separate rows() ?
df.assign(Book=df.Book.str.split(",")).explode('Book')
Try this code
pd.concat([Series(row['ID_LIST'], row['Group'].split(',')) for _, row in a.iterrows()]).reset_index()
Do let me know if it works

How to dimensionalize a pandas dataframe

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

SQL select specific column which are present for all date range

How should I write a SQL queries to find all unique value of a column that are present in all date range.
+-------------+--------+------------+
| primary_key | column | date |
+-------------+--------+------------+
| 1 | a | 2020-03-01 |
| 2 | a | 2020-03-02 |
| 3 | a | 2020-03-03 |
| 4 | a | 2020-03-04 |
| 5 | b | 2020-03-01 |
| 6 | b | 2020-03-02 |
| 7 | b | 2020-03-03 |
| 8 | b | 2020-03-04 |
| 9 | c | 2020-03-01 |
| 10 | c | 2020-03-02 |
| 11 | c | 2020-03-03 |
| 12 | d | 2020-03-04 |
+-------------+--------+------------+
In the above example if query date range is 2020-03-01 to 2020-03-04 output should be
a
b
since only a and b are present for that range
similarly if query date range is 2020-03-01 to 2020-03-03 output should be
a
b
c
I could do this in an python script by fetching all rows and using a set.
Is is possible to write a SQL query yo achieve same result?
You may aggregate by column value and then assert the distinct date count:
SELECT col
FROM yourTable
WHERE date BETWEEN '2020-03-01' AND '2020-03-04'
GROUP BY col
HAVING COUNT(DISTINCT date) = 4;
One More way to solve above Problem.
select data from (
select data,count(data) as datacnt from unique_value group by data ) a,
(select count(distinct present_date ) as cnt from unique_value) b
where a.datacnt=b.cnt;

How do I add or subtract a row to an entire pandas dataframe?

I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8

Pandas - How to dynamically get min and max value of each session in the column

So I have a dataframe similar to this:
timestamp | name
------------+------------
1 | a
1 | b
2 | c
2 | d
2 | e
3 | f
4 | g
Essentially I want to get min and max value of each timestamp session(defined by unique timestamp value, there are 4 sessions in this example), the expected result would something like this:
timestamp | name | start | end
------------+----------+--------+------
1 | a | 1 | 2
1 | b | 1 | 2
2 | c | 2 | 3
2 | d | 2 | 3
2 | e | 2 | 3
3 | f | 3 | 4
4 | g | 4 | 4
I am thinking index on timestamp column, then "move up" the index by 1, yet this approach didn't work on the forth bucket in the example above.
Any help is greatly appreciated!
try numpy.clip(), such as df['end']=numpy.clip(df['timestamp']+1, 0, 4)

Categories