Histogram of distribution of values for each day - python

I have a dataframe which looks like this:
df
date x1_count x2_count x3_count x4_count x5_count x6_count
0 2022-04-01 1981 0 0 0 0 0
1 2022-04-02 1434 1202 1802 1202 1102 1902
2 2022-04-03 1768 1869 1869 1869 1969 1189
3 2022-04-04 1823 1310 1210 1110 1610 1710
...
29 2022-04-30 1833 1890 1810 1830 1834 1870
I'm trying to create a histogram to see the distrubiton of values of each day, but the buckets of the histogram are too broad to see. How could I fix this?
Below is what I attempted:
df[['date','x1_count']].set_index(by='date').hist()

You should be able to set the bins of the histogram manually with
df[['date','x1_count']].set_index(by='date').hist(bins=10)
By default the number of bins equals the number of unique dates.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html
Edit:
Depending on how you want to group the dates, you could also group them as such:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
Can Pandas plot a histogram of dates?

Related

How to get the body of the table using Python?

I am self-lerning webscraping and I am trying to get tbody from a table with beautifulSoups.
My attempt:
url ='https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm'
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
table = soup.findAll('table', class_='hover')
print(table)
Thats what I get:
<table class="hover"></table>
Any hints highly appreciated
'table', class_='hover' that contains table data aka tbody, tr, td and so on are dynamic thats why you are not getting tbody but you can mimic dat selenium with pandas/bs4. I use selenium with pandas.
Script:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm')
driver.maximize_window()
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
df = pd.read_html(str(soup))[0]
d=df.rename(columns=df.iloc[0]).drop(df.index[0])
print(d)
Output:
7/4/2022 1410 1380 343.25 4.7002 1613 1640
1 7/1/2022 1410 1300 334.50 4.7176 1578 1630
2 6/30/2022 1410 1320 350.25 4.6806 1639 1650
3 6/29/2022 1500 1380 358.50 4.6809 1678 1710
4 6/28/2022 1450 1360 356.75 4.7004 1677 1690
5 6/27/2022 1450 1360 350.00 4.6965 1644 1690
6 6/24/2022 1450 1360 357.25 4.7094 1682 1700
7 6/23/2022 1450 1360 359.00 4.7096 1691 1690
8 6/22/2022 1470 1410 370.50 4.6590 1726 1750
9 6/21/2022 1500 1370 372.50 4.6460 1731 1730
10 6/20/2022 1540 1460 388.25 4.6731 1814 1780
11 6/15/2022 1560 1460 392.75 4.6642 1832 1780
12 6/14/2022 1560 1460 392.25 4.6548 1826 1780
13 6/13/2022 1540 1460 394.50 4.6313 1827 1800
14 6/10/2022 1530 1450 391.75 4.6030 1803 1760
15 6/9/2022 1540 1500 386.25 4.5826 1770 1730
16 6/8/2022 1550 1520 381.75 4.5817 1749 1730
17 6/7/2022 1500 1540 385.50 4.5855 1768 1700
18 6/6/2022 1600 1510 397.50 4.5880 1824 1760
19 6/3/2022 1560 1490 378.25 4.5908 1736 1700
20 6/2/2022 1590 1490 382.50 4.5876 1755 1710
21 6/1/2022 1590 1490 380.50 4.5891 1746 1700
22 5/31/2022 1650 1560 392.25 4.5756 1795 1750
23 5/30/2022 1670 1590 406.75 4.5869 1866 1800
24 5/27/2022 1670 1580 414.75 4.6102 1912 1700
25 5/26/2022 1650 1580 409.50 4.6135 1889 1700
26 5/25/2022 1670 1600 404.50 4.5955 1859 1700
27 5/24/2022 1690 1630 410.50 4.6107 1893 1800
28 5/23/2022 1700 1600 426.00 4.6171 1966 1860
29 5/20/2022 1700 1630 420.75 4.6366 1951 1840
30 5/19/2022 1700 1640 422.25 4.6429 1960 1850
31 5/18/2022 1700 1640 430.50 4.6528 2003 1850
32 5/17/2022 1690 1640 438.25 4.6558 2040 1850
33 5/16/2022 1690 1640 438.25 4.6724 2048 1880
34 5/13/2022 1670 1560 416.50 4.6679 1944 1800
35 5/12/2022 1670 1540 414.25 4.6841 1940 1790
36 5/11/2022 1670 1540 403.25 4.6700 1883 1790
37 5/10/2022 1680 1560 396.50 4.6761 1854 1780
38 5/9/2022 1670 1560 394.50 4.7059 1856 1780
39 5/6/2022 1600 1580 406.25 4.6979 1909 1760
40 5/5/2022 1660 1610 401.00 4.6658 1871 1780
41 5/4/2022 1660 1630 390.50 4.6777 1827 1735
42 4/29/2022 1660 1630 400.75 4.6582 1867 1720
43 4/28/2022 1670 1640 416.50 4.6915 1954 1740
44 4/27/2022 1670 1630 418.25 4.7076 1969 1720
45 4/26/2022 1660 1640 415.25 4.6429 1928 1685
46 4/25/2022 1665 1630 408.25 4.6405 1894 1670
47 4/22/2022 1665 1650 407.00 4.6361 1887 1690
48 4/21/2022 1660 1650 405.75 4.6523 1888 1690
49 4/20/2022 1660 1660 398.50 4.6295 1845 1700
50 4/19/2022 1680 1660 399.50 4.6361 1852 1740
51 4/15/2022 1690 1660 401.00 4.6378 1860 1770
52 4/14/2022 1690 1660 401.00 4.6447 1863 1770
53 4/13/2022 1680 1630 403.00 4.6460 1872 1780
54 4/12/2022 1650 1620 399.25 4.6626 1862 1700
55 4/11/2022 1630 1590 379.50 4.6451 1763 1670
56 4/8/2022 1650 1610 372.75 4.6405 1730 1660
57 4/7/2022 1650 1610 363.75 4.6478 1691 1670
58 4/6/2022 1650 1600 364.00 4.6539 1694 1670
59 4/5/2022 1650 1620 364.50 4.6317 1688 1680
60 4/4/2022 1640 1610 363.75 4.6373 1687 1680
soup = BeautifulSoup(HTML)
# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )
rows=list()
for row in table.findAll("tr"):
rows.append(row)
# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times
for i in table:
tbody = i.find_all('tbody')

Python: Dataframe, how to plot two lines?

result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
Year Season size
0 1896 Summer 380
1 1900 Summer 1903
2 1904 Summer 1285
3 1906 Summer 1722
4 1908 Summer 3054
5 1912 Summer 3953
6 1920 Summer 4158
7 1924 Summer 4989
8 1924 Winter 443
9 1928 Summer 4588
10 1928 Winter 549
11 1932 Summer 2622
12 1932 Winter 330
I need to have a plot with two lines, one for Winter and one for Summer, x=YEAR.
So far:
result.plot.line(x='Year')
But it plots only one.
Answer:
result = df[(df['Sex']=='M')].groupby(['Year', 'Season'], as_index=False).size()
result2 = result.pivot_table(index='Year', columns='Season', values='size')
result2.plot.line()
Please try this, this should show two lines
result.set_index("Year", inplace=True)
result.groupby("Season")["size"].plot.line(legend=True, xlabel="Year", ylabel="Size")

Common values between multiple dataframes with different length

I have 3 huge dataframes that have different length of values
Ex,
A B C
2981 2952 1287
2759 2295 2952
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 1928
1287 2374
2846
2578
I want to find the common values between the three columns like this
A B C Common
2981 2952 1287 1284
2759 2295 2952 2295
1284 2235 1284
1295 1928 0887
2295 1284 1966
1567 2295
1287 2374
2846
2578
I tried (from here)
df1['Common'] = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
but I get this error, ValueError: Length of values does not match length of index
Idea is create Series with index filtered by indexing with length of array:
a = np.intersect1d(df1.A, np.intersect1d(df2.B, df3.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
If same DataFrame:
a = np.intersect1d(df1.A, np.intersect1d(df1.B, df1.C))
df1['Common'] = pd.Series(a, index=df1.index[:len(a)])
print (df1)
A B C Common
0 2981.0 2952.0 1287 1284.0
1 2759.0 2295.0 2952 2295.0
2 1284.0 2235.0 1284 NaN
3 1295.0 1928.0 887 NaN
4 2295.0 1284.0 1966 NaN
5 NaN 1567.0 2295 NaN
6 NaN 1287.0 2374 NaN
7 NaN NaN 2846 NaN
8 NaN NaN 2578 NaN

Pandas: combine two dataframes with same columns by picking values

I have two dataframes:
The first:
id time_begin time_end
0 1938 1946
1 1991 1991
2 1359 1991
4 1804 1937
6 1368 1949
... ... ...
Second:
id time_begin time_end
1 1946 1946
3 1940 1954
5 1804 1925
6 1978 1978
7 1912 1949
Now, I want to combine the two dataframes in such a way that I get all rows from both. But since sometimes the row will be present in both dataframes (e.g. row 1 and 6), I want to pick the minimum time_begin of the two, and the maximum time_end for the two. Thus my expected result:
id time_begin time_end
0 1938 1946
1 1946 1991
2 1359 1991
3 1940 1954
5 1804 1925
4 1804 1937
6 1368 1978
7 1912 1949
... ... ...
How can I achieve this? Normal join/combine operations do not allow for this as far as I can tell.
You could first merge the dataframes and then use groupby with agg in order to pick min(time_begin) and max(time_end)
df1=pd.DataFrame({'id':[0,1,2,4,6],'time_begin':[1938,1991,1359,1804,1368],'time_end':
[1946,1991,1991,1937,1949]})
df2=pd.DataFrame({'id':[1,3,5,6,7],'time_begin':[1946,1940,1804,1978,1912],'time_end':
[1946,1954,1925,1978,1949]})
#merge
df=df1.merge(df2,how='outer')
#groupby
df=df.groupby('id').agg({'time_begin':'min','time_end':'max'})
Output:
The trick is to define different aggregation functions per column:
pd.concat([df1, df2]).groupby('id').agg({'time_begin':'min', 'time_end':'max'})

Type issue with histogram plot python

I've some data as:
t1['Time Delay']
Out[175]:
746 0 days
747 0 days
873 0 days
874 0 days
906 8 days
907 0 days
908 0 days
909 0 days
Name: Time Delay, dtype: timedelta64[ns]
And another column as:
t1['Outcome']
Out[176]:
746 0.0
747 0.0
758 0.0
762 0.0
1422 1.0
1685 0.0
1909 0.0
1913 0.0
Name: Outcome, dtype: float64
I'm trying to plot them as histograms based on this pandas - histogram from two columns? but a bunch of type issues are coming up.
If I check type(t1['Time Delay']), it gives pandas.core.series.Series. Same for the other column.
Do I need to convert the timedelta to float?

Categories