in dataframe , how to merge two rows - python

in dataframe , how to merge two rows, like 148 merge 142 to be a new line and drop two them.
title collectionsCount subscribersCount entriesCount viewsCount
148 Android 697977 100213 6803 10610138
142 Java 103821 65303 1493 1590201
161 iOS 163137 65896 3601 3739843
177 JavaScript 222100 88872 2412 3548736
16 Python 45234 45100 1007 930588
162 Swift 28498 30317 1180 928488
20 PHP 15376 25143 375 329720
62 Go 5321 12881 179 145851
41 C++ 3495 18404 101 75019
17 C 2213 14870 50 52019
63 Ruby 1543 6711 40 45162

You can use the method pandas.Series.replace to replace Android to Java then use pandas.DataFrame.groupby to aggregate the data.
This should work:
rules = {'Android':'Java'}
df['title'].replace(rules,inplace=True)
df = df.groupby('title').sum().reset_index()
print(df)
Output:
title collectionsCount subscribersCount entriesCount viewsCount
0 C 2213 14870 50 52019
1 C++ 3495 18404 101 75019
2 Go 5321 12881 179 145851
3 Java 801798 165516 8296 12200339
4 JavaScript 222100 88872 2412 3548736
5 PHP 15376 25143 375 329720
6 Python 45234 45100 1007 930588
7 Ruby 1543 6711 40 45162
8 Swift 28498 30317 1180 928488
9 iOS 163137 65896 3601 3739843

Related

Pandas interpolation creating a horizontal line

This is a continuation of the following question:
Plot a line on a curve that is undersampled
I tried the solution provided but with real data and getting a straight line. The full data is pasted below:
mdcol, tvdcol = 'md_m', 'tvd_m'
df = df[[mdcol, tvdcol]].copy().set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max())))
.reset_index() # optional, you could write 'index' in the second line plot, too.
.interpolate()
)
data_intp
Dataframe is shown below:
md_m tvd_m
0 0.00 0.00
1 281.00 281.00
2 300.00 300.00
3 330.00 330.00
4 360.00 360.00
5 390.00 390.00
6 420.00 420.00
7 450.00 450.00
8 480.00 480.00
9 510.00 510.00
10 540.00 539.99
11 570.00 569.98
12 600.00 599.97
13 630.00 629.94
14 660.00 659.91
15 690.00 689.88
16 720.00 719.84
17 750.00 749.80
18 780.00 779.75
19 810.00 809.69
20 840.00 839.58
21 870.00 869.34
22 900.00 898.90
23 930.00 928.19
24 950.00 947.55
25 960.00 957.18
26 970.00 966.76
27 980.00 976.32
28 990.00 985.83
29 1000.00 995.32
30 1010.00 1004.77
31 1020.00 1014.20
32 1030.00 1023.60
33 1040.00 1032.96
34 1050.00 1042.29
35 1060.00 1051.56
36 1070.00 1060.78
37 1080.00 1069.94
38 1090.00 1079.05
39 1100.00 1088.11
40 1110.00 1097.12
41 1120.00 1106.10
42 1130.00 1115.03
43 1140.00 1123.91
44 1150.00 1132.73
45 1160.00 1141.48
46 1170.00 1150.17
47 1180.00 1158.80
48 1190.00 1167.37
49 1200.00 1175.86
50 1210.00 1184.28
51 1220.00 1192.61
52 1230.00 1200.88
53 1240.00 1209.09
54 1250.00 1217.24
55 1260.00 1225.36
56 1270.00 1233.50
57 1280.00 1241.70
58 1290.00 1249.95
59 1300.00 1258.22
60 1310.00 1266.50
61 1320.00 1274.79
62 1330.00 1283.11
63 1340.00 1291.46
64 1350.00 1299.84
65 1360.00 1308.23
66 1370.00 1316.64
67 1380.00 1325.08
68 1390.00 1333.55
69 1400.00 1342.05
70 1410.00 1350.59
71 1420.00 1359.16
72 1430.00 1367.75
73 1440.00 1376.37
74 1450.00 1385.00
75 1460.00 1393.65
76 1470.00 1402.31
77 1480.00 1411.01
78 1490.00 1419.75
79 1500.00 1428.51
80 1510.00 1437.30
81 1520.00 1446.11
82 1530.00 1454.92
83 1540.00 1463.71
84 1550.00 1472.46
85 1560.00 1481.20
86 1570.00 1489.93
87 1580.00 1498.65
88 1590.00 1507.37
89 1600.00 1516.09
90 1610.00 1524.84
91 1620.00 1533.62
92 1630.00 1542.40
93 1640.00 1551.18
94 1650.00 1559.96
95 1660.00 1568.74
96 1670.00 1577.53
97 1680.00 1586.29
98 1690.00 1595.01
99 1700.00 1603.69
100 1710.00 1612.36
101 1720.00 1621.02
102 1730.00 1629.66
103 1740.00 1638.27
104 1750.00 1646.84
105 1760.00 1655.35
106 1770.00 1663.83
107 1780.00 1672.27
108 1790.00 1680.65
109 1800.00 1688.97
110 1810.00 1697.23
111 1820.00 1705.42
112 1830.00 1713.54
113 1840.00 1721.60
114 1850.00 1729.61
115 1860.00 1737.63
116 1870.00 1745.66
117 1880.00 1753.69
118 1890.00 1761.72
119 1900.00 1769.70
120 1910.00 1777.61
121 1920.00 1785.44
122 1930.00 1793.20
123 1940.00 1800.86
124 1950.00 1808.43
125 1960.00 1815.92
126 1970.00 1823.31
127 1980.00 1830.62
128 1990.00 1837.83
129 2000.00 1844.95
130 2010.00 1851.96
131 2020.00 1858.89
132 2030.00 1865.76
133 2040.00 1872.58
134 2050.00 1879.35
135 2060.00 1886.05
136 2070.00 1892.70
137 2080.00 1899.28
138 2090.00 1905.78
139 2100.00 1912.20
140 2110.00 1918.50
141 2120.00 1924.66
142 2130.00 1930.68
143 2140.00 1936.57
144 2150.00 1942.34
145 2160.00 1947.97
146 2170.00 1953.47
147 2180.00 1958.83
148 2190.00 1964.06
149 2200.00 1969.16
150 2210.00 1974.12
151 2220.00 1978.93
152 2230.00 1983.63
153 2240.00 1988.25
154 2250.00 1992.78
155 2260.00 1997.23
156 2270.00 2001.60
157 2280.00 2005.87
158 2290.00 2010.06
159 2300.00 2014.15
160 2310.00 2018.12
161 2320.00 2021.97
162 2330.00 2025.68
163 2340.00 2029.25
164 2373.20 2039.67
165 2401.60 2047.31
166 2430.80 2054.90
167 2459.70 2062.45
168 2488.30 2069.84
169 2488.30 2069.88
170 2489.97 2070.30
171 2493.30 2071.11
172 2503.50 2073.51
173 2519.97 2077.32
174 2549.97 2083.99
175 2563.51 2086.88
176 2579.97 2090.18
177 2609.97 2095.34
178 2639.97 2099.36
179 2662.86 2101.68
180 2752.86 2109.47
181 2759.97 2110.08
182 2789.97 2112.33
183 2819.97 2114.10
184 2849.97 2115.39
185 2879.97 2116.19
186 2902.87 2116.48
187 2909.96 2116.53
188 2939.96 2116.72
189 2969.96 2116.92
190 2999.96 2117.11
191 3029.96 2117.31
192 3059.96 2117.51
193 3089.96 2117.70
194 3119.96 2117.90
195 3149.96 2118.09
196 3179.96 2118.29
197 3209.96 2118.49
198 3239.96 2118.68
199 3252.87 2118.76
200 3352.87 2119.41
201 3359.96 2119.45
202 3389.96 2119.65
203 3419.96 2119.84
204 3449.96 2120.04
205 3479.96 2120.23
206 3509.96 2120.43
207 3539.96 2120.62
208 3569.96 2120.82
209 3599.96 2121.01
210 3629.96 2121.21
211 3652.87 2121.35
212 3779.95 2122.17
213 3852.87 2122.64
Plotting shows the horizontal line:
TOOLS = ["box_zoom", "reset", "save", "crosshair", "pan", "wheel_zoom" ,"lasso_select"]
p = figure(plot_width = 1600,
plot_height = 800, tools = TOOLS,
title = 'Well Survey', toolbar_location='above')
p.line(data_intp[mdcol], data_intp[tvdcol], line_width = 2, color='red')
show(p)
Not sure where the interpolation is going wrong. I was hoping it would interpolate between the points. Anyone have an idea what I am doing wrong here?
The issue with your data (and this proposed solution) is that you have a single duplicate value in md_m:
chk = df.groupby('md_m').agg({'tvd_m':['count',lambda x: list(x)]})
print(chk[chk['tvd_m']['count']>1])
returns:
Pandas can't "reindex from a duplicate axis" which is what this approach relies on and really, linear interpolation won't really work either when you have two identical x values and two distinct y values.
An extra layer of QA could be done on the input data, inspecting it beforehand like my snippet, and doing a groupby average or something like that if appropriate.
The only other thing I'd point out is using a range (i.e. integers) for the reindexing is kinda unneccessary --> you should be able to reindex with floats to any step size you want.
Thanks for the answer gmerrit123, but I believe I remove that error by using this line:
df = df[~df.index.duplicated()]
What solved it for me was converting the md column to INT, as the reindexing was running with step = 1 meter, but the raw data had 2 decimal points.
mdcol, tvdcol = 'md_m', 'tvd_m'
df = data[[mdcol, tvdcol]].copy()#.set_index(mdcol)#.reindex(index = range(int(df.index.min()), int(df.index.max())))
df[mdcol] = df[mdcol].astype('int')
df = df.set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max()+ 1)))
.reset_index()
.interpolate()
)

How to scrape tbody from a collapsible table using BeautifulSoup library?

Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ?
My code:
The table i have to grab:
The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it:
import json
import requests
url = 'https://www.mohfw.gov.in/data/datanew.json'
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data on screen:
for d in data:
print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death']))
Prints:
Andaman and Nicobar Islands 329 548 214 5
Andhra Pradesh 75720 140933 63864 1349
Arunachal Pradesh 670 1591 918 3
Assam 9814 40269 30357 98
Bihar 17579 51233 33358 296
Chandigarh 369 1051 667 15
Chhattisgarh 2803 9086 6230 53
... and so on.
Try:
import json
import requests
import pandas as pd
data = []
row = []
r = requests.get('https://www.mohfw.gov.in/data/datanew.json')
j = json.loads(r.text)
for i in j:
for k in i:
row.append(i[k])
data.append(row)
row = []
columns = [i for i in j[0]]
df = pd.DataFrame(data, columns=columns)
df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index()
df = df.sort_values('sno',)
print(df.to_string())
prints:
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35
1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28
2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12
3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18
4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10
5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04
6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22
7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26
8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07
9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30
10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24
and so on...

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Write a pandas DataFrame mixing integers and floats in a csv file

I'm working with pandas DataFrames full of float numbers, but with integers in one every three lines (the whole line is made of integers). When I make a print df, all the values displayed are shown as floats (the integers values have a ``.000000```added) for example :
aromatics charged polar unpolar
Ac_obs_counts 712.000000 1486.000000 2688.000000 2792.000000
Ac_obs_freqs 0.092732 0.193540 0.350091 0.363636
Ac_pvalues 0.524752 0.099010 0.356436 0.495050
Am_obs_counts 10.000000 59.000000 62.000000 50.000000
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.495050 0.980198 0.356436 0.009901
Ap_obs_counts 18.000000 34.000000 83.000000 78.000000
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366
When I use df.iloc[range(0, len(df.index), 3)], I see integers displayed :
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Am_obs_counts 10 59 62 50
Ap_obs_counts 18 34 83 78
Pa_obs_counts 47 81 125 144
Pf_obs_counts 31 58 99 109
Pg_obs_counts 27 106 102 108
Ph_obs_counts 7 49 42 36
Pp_obs_counts 15 83 45 65
Ps_obs_counts 57 125 170 216
Pu_obs_counts 14 62 102 84
When I use df.to_csv("mydf.csv", sep=",", encoding="utf-8") , the integers are written as floats ; how can I force the writing as integers for these lines ? Would it be better to split the data in two DataFrames ?
Thanks in advance.
Simply call object
df.astype('object')
Out[1517]:
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Ac_obs_freqs 0.092732 0.19354 0.350091 0.363636
Ac_pvalues 0.524752 0.09901 0.356436 0.49505
Am_obs_counts 10 59 62 50
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.49505 0.980198 0.356436 0.009901
Ap_obs_counts 18 34 83 78
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366

Series of Strings to Arrays

I have some image arrays that I'm trying to run a regression on and somehow I'm importing the csv file as a series of strings instead of a series of arrays
In: image_train = pd.read_csv('image_train_data.csv')
In: image_train['image_array'].head()
Out: 0 [73 77 58 71 68 50 77 69 44 120 116 83 125 120...
1 [7 5 8 7 5 8 5 4 6 7 4 7 11 5 9 11 5 9 17 11 1...
2 [169 122 65 131 108 75 193 196 192 218 221 222...
3 [154 179 152 159 183 157 165 189 162 174 199 1...
4 [216 195 180 201 178 160 210 184 164 212 188 1...
Name: image_array, dtype: object
When I try to run the regression using image_train('image_array') I get
ValueError: could not convert string to float: '[255 255 255 255 255 255 255 255...
The array is a string.
Is there a way to transform the strings to arrays for the entire series?
You can use converters to describe how you want to read that field in. The easiest way would be to define your own converter to treat that column as a list, e.g.:
import ast
def conv(x):
return ast.literal_eval(','.join(x.split(' ')))
image_train = pd.read_csv('image_train_data.csv', converters={'image_array':conv})
While AChampion's solution looks good, I went ahead and found another solution:
image_train['image_array'].str.findall(r'\d+').apply(lambda x: map(int, x))
Which would be useful if you already had it loaded and didn't want to/couldn't load it again.
Here's another solution that works well for just evaluating a literal string representation of a list:
pd.eval(image_train['image_array'])
However, if it's separated by spaces you could do:
pd.eval(image_train['image_array'].str.replace(' ', ','))

Categories