Pandas: combine two dataframes with same columns by picking values - python

I have two dataframes:
The first:
id time_begin time_end
0 1938 1946
1 1991 1991
2 1359 1991
4 1804 1937
6 1368 1949
... ... ...
Second:
id time_begin time_end
1 1946 1946
3 1940 1954
5 1804 1925
6 1978 1978
7 1912 1949
Now, I want to combine the two dataframes in such a way that I get all rows from both. But since sometimes the row will be present in both dataframes (e.g. row 1 and 6), I want to pick the minimum time_begin of the two, and the maximum time_end for the two. Thus my expected result:
id time_begin time_end
0 1938 1946
1 1946 1991
2 1359 1991
3 1940 1954
5 1804 1925
4 1804 1937
6 1368 1978
7 1912 1949
... ... ...
How can I achieve this? Normal join/combine operations do not allow for this as far as I can tell.

You could first merge the dataframes and then use groupby with agg in order to pick min(time_begin) and max(time_end)
df1=pd.DataFrame({'id':[0,1,2,4,6],'time_begin':[1938,1991,1359,1804,1368],'time_end':
[1946,1991,1991,1937,1949]})
df2=pd.DataFrame({'id':[1,3,5,6,7],'time_begin':[1946,1940,1804,1978,1912],'time_end':
[1946,1954,1925,1978,1949]})
#merge
df=df1.merge(df2,how='outer')
#groupby
df=df.groupby('id').agg({'time_begin':'min','time_end':'max'})
Output:

The trick is to define different aggregation functions per column:
pd.concat([df1, df2]).groupby('id').agg({'time_begin':'min', 'time_end':'max'})

Related

Histogram of distribution of values for each day

I have a dataframe which looks like this:
df
date x1_count x2_count x3_count x4_count x5_count x6_count
0 2022-04-01 1981 0 0 0 0 0
1 2022-04-02 1434 1202 1802 1202 1102 1902
2 2022-04-03 1768 1869 1869 1869 1969 1189
3 2022-04-04 1823 1310 1210 1110 1610 1710
...
29 2022-04-30 1833 1890 1810 1830 1834 1870
I'm trying to create a histogram to see the distrubiton of values of each day, but the buckets of the histogram are too broad to see. How could I fix this?
Below is what I attempted:
df[['date','x1_count']].set_index(by='date').hist()
You should be able to set the bins of the histogram manually with
df[['date','x1_count']].set_index(by='date').hist(bins=10)
By default the number of bins equals the number of unique dates.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html
Edit:
Depending on how you want to group the dates, you could also group them as such:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
Can Pandas plot a histogram of dates?

Python: Dataframe, how to plot two lines?

result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
Year Season size
0 1896 Summer 380
1 1900 Summer 1903
2 1904 Summer 1285
3 1906 Summer 1722
4 1908 Summer 3054
5 1912 Summer 3953
6 1920 Summer 4158
7 1924 Summer 4989
8 1924 Winter 443
9 1928 Summer 4588
10 1928 Winter 549
11 1932 Summer 2622
12 1932 Winter 330
I need to have a plot with two lines, one for Winter and one for Summer, x=YEAR.
So far:
result.plot.line(x='Year')
But it plots only one.
Answer:
result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
result2 = result.pivot_table(index='Year', columns='Season', values='size')
result2.plot.line()
Please try this, this should show two lines
result.set_index("Year", inplace=True)
result.groupby("Season")["size"].plot.line(legend=True, xlabel="Year", ylabel="Size")

python pandas add new column with values grouped count

I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN

Replace the missing value NAN based on values of another columns (conditions)

Hi I would like to fill in the NaN value based on value of sources.
I have tried the np.select, but this method also overwrite the other correct values.
landline_area1['area'] = np.select(area_conditions, values)
Table overview
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN NaN
7 1225 1225 Bath
8 1142 NaN NaN
conditions
area_conditions = [
(landline_area1['source'].str.startswith('20')),
(landline_area1['source'].str.startswith('23')),
(landline_area1['source'].str.startswith('24'))]
values
values = [
'London',
'Southampton / Portsmouth',
'Coventry']
Expected result
source codes area
4 1304 1304 Dover
5 1768 1768 Penrith
6 2077 NaN London
7 1225 1225 Bath
8 1142 NaN Sheffield
Let us try np.select and adding astype str
#landline_area1['source'].astype(str).str.startswith('20')
s = np.select(area_conditions, values)
landline_area1['area'].fillna(pd.Series(s, index=landline_area1.index),inplace=True)

Pandas: how to drop rows that contain invalid month/day column combinations, such as February 30th?

I have source data that uses 31 columns for day values, with a row for each month. I've melted the 31 day columns into a single day column, and now I want to combine the year, month, and day columns into a datetime(?) column so I can sort the rows by year/month/day.
After the melt, my dataframe looks like so:
year month day prcp
0 1893 1 01 0.0
1 1893 2 01 0.0
2 1893 3 01 0.0
3 1893 4 01 NaN
4 1893 5 01 NaN
5 1893 6 01 NaN
6 1893 7 01 NaN
7 1893 8 01 0.0
8 1893 9 01 10.0
9 1893 10 01 0.0
10 1893 11 01 0.0
11 1893 12 01 NaN
12 1894 1 01 NaN
13 1894 2 01 0.0
14 1894 3 01 NaN
...
Next I'm trying to create a 'time' column that I can sort on, using the year, month, and day columns as arguments to the datetime constructor. I've tried doing this using this approach:
def make_datetime(y, m, d):
return(datetime(year=y, month=m, day=d))
df['time'] = np.vectorize(make_datetime)(df['year'].astype(int), df['month'].astype(int), df['day'].astype(int))
The above isn't going to get me there since it fails in cases where the month/day columns don't make sense together, such as February 29th during non-leap years, April 31st, etc. What I think I want to do next is to somehow wrap the datetime() call in a try/catch, and when it croaks due to incompatible month/day combinations I should drop the row within the catch block. How would I go about doing that without doing a for loop over all the rows? Or is there a better way to solve this?
You can pass your df derictly to to_datetime
pd.to_datetime(df,errors='coerce')
Out[905]:
# NaT
# NaT
# 1892-02-29
# NaT
# NaT
# NaT
# 1896-02-29
# NaT
# NaT
dtype: datetime64[ns]
df['New']=pd.to_datetime(df,errors='coerce')
df.dropna()
Out[907]:
year month day New
# 1892 2 29 1892-02-29
# 1896 2 29 1896-02-29
Here is one way using your suggestion of wrapping in a try / except clause.
from datetime import datetime
def dater(x):
try:
return datetime(year=x['year'], month=x['month'], day=x['day'])
except ValueError:
return None
df['date'] = df.apply(dater, axis=1)
# year month day date
# 0 1890 2 29 NaT
# 1 1891 2 29 NaT
# 2 1892 2 29 1892-02-29
# 3 1893 2 29 NaT
# 4 1894 2 29 NaT
# 5 1895 2 29 NaT
# 6 1896 2 29 1896-02-29
# 7 1897 2 29 NaT
# 8 1898 2 29 NaT
df = df.dropna(subset=['date'])
# year month day date
# 2 1892 2 29 1892-02-29
# 6 1896 2 29 1896-02-29

Categories