result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
Year Season size
0 1896 Summer 380
1 1900 Summer 1903
2 1904 Summer 1285
3 1906 Summer 1722
4 1908 Summer 3054
5 1912 Summer 3953
6 1920 Summer 4158
7 1924 Summer 4989
8 1924 Winter 443
9 1928 Summer 4588
10 1928 Winter 549
11 1932 Summer 2622
12 1932 Winter 330
I need to have a plot with two lines, one for Winter and one for Summer, x=YEAR.
So far:
result.plot.line(x='Year')
But it plots only one.
Answer:
result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
result2 = result.pivot_table(index='Year', columns='Season', values='size')
result2.plot.line()
Please try this, this should show two lines
result.set_index("Year", inplace=True)
result.groupby("Season")["size"].plot.line(legend=True, xlabel="Year", ylabel="Size")
Related
I have a dataframe which looks like this:
df
date x1_count x2_count x3_count x4_count x5_count x6_count
0 2022-04-01 1981 0 0 0 0 0
1 2022-04-02 1434 1202 1802 1202 1102 1902
2 2022-04-03 1768 1869 1869 1869 1969 1189
3 2022-04-04 1823 1310 1210 1110 1610 1710
...
29 2022-04-30 1833 1890 1810 1830 1834 1870
I'm trying to create a histogram to see the distrubiton of values of each day, but the buckets of the histogram are too broad to see. How could I fix this?
Below is what I attempted:
df[['date','x1_count']].set_index(by='date').hist()
You should be able to set the bins of the histogram manually with
df[['date','x1_count']].set_index(by='date').hist(bins=10)
By default the number of bins equals the number of unique dates.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html
Edit:
Depending on how you want to group the dates, you could also group them as such:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
Can Pandas plot a histogram of dates?
I have the following pandas datframe
For each country I wish to create as many rows as the number of years it exists.
For instance, the US will have 201 rows, Canada 95 and so forth.
I thought of doing something like:
for row in df.iterrows():
for range(row['styear'], row['endyear']):
df.append(row)
Any ideas how to make this work?
You can create a new column with the range of years, and then explode that column
# sample dataframe
df = pd.DataFrame({
'country': ['United States', 'Canada', 'Bahamas', 'Cuba'],
'styear': [1816, 1920, 1973, 1902],
'endyear': [2016, 2016, 2016, 1906]
})
df['allyears'] = [range(start, end+1)
for start, end in zip(df.styear, df.endyear)]
df = df.explode('allyears')
print(df)
Output
country styear endyear allyears
0 United States 1816 2016 1816
0 United States 1816 2016 1817
0 United States 1816 2016 1818
0 United States 1816 2016 1819
0 United States 1816 2016 1820
.. ... ... ... ...
3 Cuba 1902 1906 1902
3 Cuba 1902 1906 1903
3 Cuba 1902 1906 1904
3 Cuba 1902 1906 1905
3 Cuba 1902 1906 1906
[347 rows x 4 columns]
I have two dataframes:
The first:
id time_begin time_end
0 1938 1946
1 1991 1991
2 1359 1991
4 1804 1937
6 1368 1949
... ... ...
Second:
id time_begin time_end
1 1946 1946
3 1940 1954
5 1804 1925
6 1978 1978
7 1912 1949
Now, I want to combine the two dataframes in such a way that I get all rows from both. But since sometimes the row will be present in both dataframes (e.g. row 1 and 6), I want to pick the minimum time_begin of the two, and the maximum time_end for the two. Thus my expected result:
id time_begin time_end
0 1938 1946
1 1946 1991
2 1359 1991
3 1940 1954
5 1804 1925
4 1804 1937
6 1368 1978
7 1912 1949
... ... ...
How can I achieve this? Normal join/combine operations do not allow for this as far as I can tell.
You could first merge the dataframes and then use groupby with agg in order to pick min(time_begin) and max(time_end)
df1=pd.DataFrame({'id':[0,1,2,4,6],'time_begin':[1938,1991,1359,1804,1368],'time_end':
[1946,1991,1991,1937,1949]})
df2=pd.DataFrame({'id':[1,3,5,6,7],'time_begin':[1946,1940,1804,1978,1912],'time_end':
[1946,1954,1925,1978,1949]})
#merge
df=df1.merge(df2,how='outer')
#groupby
df=df.groupby('id').agg({'time_begin':'min','time_end':'max'})
Output:
The trick is to define different aggregation functions per column:
pd.concat([df1, df2]).groupby('id').agg({'time_begin':'min', 'time_end':'max'})
Is there an easy way to convert from Type A to Type B.
Note : Kutools (Plugin in Excel) provides a solution for it but that is not robust and does not seem scalable.
Any workaround for this ?
Considering you can make the df look like below : (just remove the top row which says Type A)
GDP per capita 1950 1951 1952 1953
0 Antigua and Barbuda 3544 3633 3723 3817
1 Argentina 7540 7612 7019 7198
2 Armenia 1862 1834 1914 1958
3 Aruba 3897 3994 4094 4196
4 Australia 12073 12229 12084 12228
5 Austria 6919 7382 7386 7692
Using pd.melt()
>>pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
GDP per capita Year GDP Value
0 Antigua and Barbuda 1950 3544
1 Argentina 1950 7540
2 Armenia 1950 1862
3 Aruba 1950 3897
4 Australia 1950 12073
5 Austria 1950 6919
6 Antigua and Barbuda 1951 3633
7 Argentina 1951 7612
8 Armenia 1951 1834
9 Aruba 1951 3994
10 Australia 1951 12229
11 Austria 1951 7382
12 Antigua and Barbuda 1952 3723
13 Argentina 1952 7019
14 Armenia 1952 1914
15 Aruba 1952 4094
16 Australia 1952 12084
17 Austria 1952 7386
18 Antigua and Barbuda 1953 3817
19 Argentina 1953 7198
20 Armenia 1953 1958
21 Aruba 1953 4196
22 Australia 1953 12228
23 Austria 1953 7692
To get the exact look like the image you have posted use:
df1=pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
df1.rename(columns={'GDP per capita':'Country'},inplace=True)
df1['GDP'] = 'GDP per capita'
df1 = df1[['GDP','Country','Year','GDP Value']]
df1.to_csv('filepath+filename.csv,index=False)
I have source data that uses 31 columns for day values, with a row for each month. I've melted the 31 day columns into a single day column, and now I want to combine the year, month, and day columns into a datetime(?) column so I can sort the rows by year/month/day.
After the melt, my dataframe looks like so:
year month day prcp
0 1893 1 01 0.0
1 1893 2 01 0.0
2 1893 3 01 0.0
3 1893 4 01 NaN
4 1893 5 01 NaN
5 1893 6 01 NaN
6 1893 7 01 NaN
7 1893 8 01 0.0
8 1893 9 01 10.0
9 1893 10 01 0.0
10 1893 11 01 0.0
11 1893 12 01 NaN
12 1894 1 01 NaN
13 1894 2 01 0.0
14 1894 3 01 NaN
...
Next I'm trying to create a 'time' column that I can sort on, using the year, month, and day columns as arguments to the datetime constructor. I've tried doing this using this approach:
def make_datetime(y, m, d):
return(datetime(year=y, month=m, day=d))
df['time'] = np.vectorize(make_datetime)(df['year'].astype(int), df['month'].astype(int), df['day'].astype(int))
The above isn't going to get me there since it fails in cases where the month/day columns don't make sense together, such as February 29th during non-leap years, April 31st, etc. What I think I want to do next is to somehow wrap the datetime() call in a try/catch, and when it croaks due to incompatible month/day combinations I should drop the row within the catch block. How would I go about doing that without doing a for loop over all the rows? Or is there a better way to solve this?
You can pass your df derictly to to_datetime
pd.to_datetime(df,errors='coerce')
Out[905]:
# NaT
# NaT
# 1892-02-29
# NaT
# NaT
# NaT
# 1896-02-29
# NaT
# NaT
dtype: datetime64[ns]
df['New']=pd.to_datetime(df,errors='coerce')
df.dropna()
Out[907]:
year month day New
# 1892 2 29 1892-02-29
# 1896 2 29 1896-02-29
Here is one way using your suggestion of wrapping in a try / except clause.
from datetime import datetime
def dater(x):
try:
return datetime(year=x['year'], month=x['month'], day=x['day'])
except ValueError:
return None
df['date'] = df.apply(dater, axis=1)
# year month day date
# 0 1890 2 29 NaT
# 1 1891 2 29 NaT
# 2 1892 2 29 1892-02-29
# 3 1893 2 29 NaT
# 4 1894 2 29 NaT
# 5 1895 2 29 NaT
# 6 1896 2 29 1896-02-29
# 7 1897 2 29 NaT
# 8 1898 2 29 NaT
df = df.dropna(subset=['date'])
# year month day date
# 2 1892 2 29 1892-02-29
# 6 1896 2 29 1896-02-29