Related
I've a list of days list = [1,5,16,29]
Considering current month september and year 2021
I've a user wise day df
user_id day month year
1 1 9 2021
1 2 9 2021
1 6 9 2021
1 14 9 2021
1 22 9 2021
1 18 9 2021
2 2 9 2021
2 17 9 2021
2 3 9 2021
2 30 9 2021
2 29 9 2021
2 28 9 2021
How can I get the user wise days of given month and year that are not present in respective users df['day'] and in the list?
Expected result
user_id remaining_days_of_month
1 3,4,7,8,9,10,11,12,13,15,17,19,20,21,23,24,25,26,27,28,30
2 4,6,7,8,9,10,11,12,13,14,15,18,19,20,21,22,23,24,25,26,27
You can use calendar.monthrange to get a range of the number of days in a year-month
df
import calendar
def get_remaining_days(group, lst):
month = group.month.unique()[0]
days_to_remove = np.unique(np.concatenate((group.day, lst)))
lst_of_days = list(range(*calendar.monthrange(2021, month)))
remaining_days = [i for i in lst_of_days if i not in days_to_remove]
return remaining_days
lst = [1,5,16,29]
result = df.groupby(by=["user_id", "month"]).apply(lambda x: get_remaining_days(x, lst))
result.name = "remaining_days_of_month"
result = result.to_frame()
result
I made it work for different months and same user. In case you happen to different year too, it won't require much change
Use calendar.monthrange to get the size of a month, then make a set difference
import pandas as pd
import calendar
df = pd.DataFrame({'user_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'day': [1, 2, 6, 14, 22, 18, 2, 17, 3, 30, 29, 28],
'month': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]})
month = df['month'].iloc[0]
values = [1, 5, 16, 29]
days_of_month = set(range(1, 1 + calendar.monthrange(2021, month)[1])).difference(values)
df: pd.DataFrame = df.groupby('user_id')['day'].apply(list).reset_index()
df['day'] = df['day'].apply(lambda cell: set(days_of_month).difference(cell))
user_id
day
0
1
{3, 4, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28}
1
2
{4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27}
I am trying to filter through 3 columns of a dataframe and have conditions for the 3 columns and return a binary value say 1 if all conditions are met and 0 if the conditions are not met. An example is show below.
data = {'PassengerId': array([2255, 2257, 2258, 2256, 2257, 2258, 2255, 2258, 2257, 2257, 2255,
2255, 2257, 2256, 2257, 2256, 2255, 2258, 2258, 2256, 2256, 2257,
2258, 2258, 2257]),
'Pclass': array([3, 2, 2, 2, 4, 3, 3, 4, 3, 1, 1, 1, 1, 2, 4, 3, 1, 2, 4, 3, 2, 3,
1, 1, 2]),
'Age': array([40, 33, 32, 40, 48, 24, 33, 29, 29, 31, 45, 47, 28, 32, 54, 39, 28,
50, 40, 31, 51, 26, 41, 46, 27]),
'SibSp': array([11, 13, 12, 19, 22, 17, 23, 12, 12, 12, 12, 24, 16, 21, 12, 15, 20,
18, 10, 17, 20, 12, 17, 17, 10]),
'Comf' : array([236.66883531, 235.46750709, 235.64574546, 241.16838089,
239.40728836, 239.95592634, 236.67806901, 237.73350635,
238.74497849, 235.17486552, 235.8457374 , 236.85133744,
240.9359547 , 236.27703374, 237.81871052, 241.62788018,
241.29185342, 235.0058136 , 240.69989317, 238.8073828 ,
238.08841364, 236.55259788, 237.58108419, 239.66916186,
241.97479544]),
'Parch': array([232.37686437, 232.39153096, 230.56566556, 232.77980061,
232.19436342, 232.2165835 , 232.28145641, 231.26988217,
230.55287196, 232.26528521, 230.45185855, 230.87525326,
231.38775744, 232.80960083, 232.33105822, 232.65782351,
231.64457366, 230.45225829, 231.05404057, 232.38229998,
232.57354117, 232.08690375, 230.40414215, 230.14361969,
231.40414745]),
'Fare': array([238.80427104, 239.32031287, 238.02212358, 238.40333494,
238.85929097, 239.51666683, 239.87771029, 238.06772515,
238.22734658, 238.54682118, 238.68880278, 239.79658425,
238.2642908 , 239.22884058, 239.84423352, 239.69438831,
238.85871719, 238.64632848, 238.7085097 , 239.5700877 ,
239.06199698, 238.37341378, 239.16126748, 239.01280153,
239.77047796])}
df = pd.DataFrame(data)
i am trying to have a condition for the first row that if the "Pclass" == 1 and 'Comf' is between "Parch" & "Fare", create a new column 'Survived' and assign 1 else assign 0.
then do the same for "Pclass" == 2, 3...
I would like to do this with pandas, however all solutions to this problem are welcomed.
If you want to do it for all rows regardless of PClass value than you can use
df["Survived"] = df["Comf"].between(df["Parch"], df["Fare"]).astype(int)
but if you want to do for specific PClass than you can use
df["Survived"] = (df["Pclass"]==1 & df["Comf"].between(df["Parch"], df["Fare"])).astype(int)
Try this.
Steps.
Get indexes based on your condition.
indexesOfTrue = df[(df["Pclass"]==1) & (df["Comf"] > df["Parch"]) & (df["Comf"] < df["Fare"])].index
Fill the indexes using loc.
df.loc[indexesOfTrue, "Survived"] = 1
Fill untrue indexes.
df.loc[~df.index.isin(ind), "Survived"] = 0
Output
PassengerId Pclass Age SibSp Comf Parch Fare Survived
5 2258 3 24 17 239.955926 232.216584 239.516667 2
6 2255 3 33 23 236.678069 232.281456 239.877710 2
7 2258 4 29 12 237.733506 231.269882 238.067725 2
8 2257 3 29 12 238.744978 230.552872 238.227347 2
9 2257 1 31 12 235.174866 232.265285 238.546821 1
10 2255 1 45 12 235.845737 230.451859 238.688803 1
11 2255 1 47 24 236.851337 230.875253 239.796584 1
12 2257 1 28 16 240.935955 231.387757 238.264291 2
13 2256 2 32 21 236.277034 232.809601 239.228841 2
14 2257 4 54 12 237.818711 232.331058 239.844234 2
Hellow Stack Overflow community,
I have a df which has a column called "native_country". However, I would like to make a new column which groups the countries into continents. For example, China would be grouped with all of the countries belonging to Aisa. The code is shown below,
First I make a ContinentDict that holds the country/continents,
ContinentDict = {'China':'Asia', 'Cambodia':'Asia', 'Hong':'Asia',
'India':'Asia', 'Japan':'Asia', 'Laos':'Asia',
'Philippines':'Asia',
'South':'Asia', 'Taiwan':'Asia', 'Thailand':'Asia',
'Vietnam':'Asia', 'Canada':'Canada', 'United States':'United
States',
'Cuba':'Caribbean', 'Dominican-Republic':'Caribbean',
'Haiti':'Caribbean', 'Jamaica':'Caribbean',
'Trinadad&Tobago':'Caribbean',
'England':'Europe', 'France':'Europe', 'Germany':'Europe',
'Greece':'Europe', 'Holand-Netherlands':'Europe',
'Hungary':'Europe',
'Ireland':'Europe', 'Italy':'Europe', 'Poland':'Europe',
'Portugal':'Europe', 'Scotland':'Europe',
'Yugoslavia':'Europe',
'Columbia':'Latin America', 'Ecuador':'Latin America',
'El-Salvador':'Latin America', 'Guatemala':'Latin America',
'Honduras':'Latin America', 'Nicaragua':'Latin America',
'Peru':'Latin America', 'Mexico':'Mexico', '?':'Unknown',
'Outlying-US(Guam-USVI-etc)':'US Territories', 'Puerto-
Rico':'US Territories'}
Next, I assgin the continents to the df
df = df.assign(continent=df['native_country'].map(ContinentDict))
However, the continents column is filled with NaN's. Does anyone know why? Is there something I am missing?
Any help would be greatly appreciated!
df = pd.DataFrame({'native_country': ContinentDict.keys()})
df = df.assign(continent=df['native_country'].map(ContinentDict))
>>> df.head()
native_country continent
0 Canada Canada
1 Honduras Latin America
2 Hong Asia
3 Dominican-Republic Caribbean
4 Italy Europe
midx = pd.MultiIndex.from_arrays([df['continent'], df['native_country']])
>>> midx
MultiIndex(levels=[[u'Asia', u'Canada', u'Caribbean', u'Europe', u'Latin America', u'Mexico', u'US Territories', u'United States', u'Unknown'], [u'?', u'Cambodia', u'Canada', u'China', u'Columbia', u'Cuba', u'Dominican-Republic', u'Ecuador', u'El-Salvador', u'England', u'France', u'Germany', u'Greece', u'Guatemala', u'Haiti', u'Holand-Netherlands', u'Honduras', u'Hong', u'Hungary', u'India', u'Ireland', u'Italy', u'Jamaica', u'Japan', u'Laos', u'Mexico', u'Nicaragua', u'Outlying-US(Guam-USVI-etc)', u'Peru', u'Philippines', u'Poland', u'Portugal', u'Puerto-Rico', u'Scotland', u'South', u'Taiwan', u'Thailand', u'Trinadad&Tobago', u'United States', u'Vietnam', u'Yugoslavia']],
labels=[[1, 4, 0, 2, 3, 4, 5, 6, 0, 3, 3, 3, 0, 3, 7, 4, 0, 3, 0, 4, 0, 0, 2, 3, 3, 2, 4, 2, 0, 6, 4, 0, 3, 2, 3, 0, 0, 3, 4, 3, 8], [2, 16, 17, 6, 21, 28, 25, 27, 29, 18, 33, 40, 1, 10, 38, 7, 39, 20, 24, 4, 36, 34, 22, 9, 31, 5, 8, 14, 19, 32, 13, 3, 15, 37, 12, 23, 35, 11, 26, 30, 0]],
names=[u'continent', u'native_country'])
Once you have country and continent in the dataframe, you could just set the index:
df = df.assign(data=1)
>>> df.set_index(['continent', 'native_country']).sort_index()
data
continent native_country
Asia Cambodia 1
China 1
Hong 1
India 1
Japan 1
Laos 1
Philippines 1
South 1
Taiwan 1
Thailand 1
Vietnam 1
Canada Canada 1
Caribbean Cuba 1
Dominican-Republic 1
Haiti 1
Jamaica 1
Trinadad&Tobago 1
Europe England 1
France 1
Germany 1
Greece 1
Holand-Netherlands 1
Hungary 1
Ireland 1
Italy 1
Poland 1
Portugal 1
Scotland 1
Yugoslavia 1
Latin America Columbia 1
Ecuador 1
El-Salvador 1
Guatemala 1
Honduras 1
Nicaragua 1
Peru 1
Mexico Mexico 1
US Territories Outlying-US(Guam-USVI-etc) 1
Puerto-Rico 1
United States United States 1
Unknown ? 1
df.iloc[df['native_country'].map(ContinentDict).argsort()]
I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.
I'm trying to extract some data from two html tables in a html file with BeautifulSoup.
This is actually the first time I'm using it and I'searched a lot of questions/example but none seem to work in my case.
The html contains two tables, the first with the headers of the first column (which are always text) and the second, containing the data of the following columns. Moreover, the table contains text, numbers and also symbols. This makes for a novice like me everything more complicated. Here's the layout of the html, copied from the browser
I was able to extract the whole html content of the rows, but only for the first tables, so in reality I am not getting any data but only the content of the first column.
The output I'm trying to obtain is a string containing the "joint" information of the tables (Col1= text, Col2=number, Col3=number, Col4=number, Col5=number), such as:
Canada, 6, 5, 2, 1
Here's the list of the Xpaths for each item:
"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1]
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3]
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]
I would be also happy with strings in "rough" html format, as long as there is one string per row, so that I'll be able to parse it further with the methods I already know. Here's the code I have so far. Thanks!
from BeautifulSoup import BeautifulSoup
html="""
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
col = row.findAll('td')
print row, col
Using bs4, but this should work:
from bs4 import BeautifulSoup as bsoup
ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()
tables = soup.find_all("tbody")
storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")
storeRank = []
for row in storeTable:
storeRank.append(row.get_text().strip())
storeMatrix = []
for row in storeValueRows:
storeMatrixRow = []
for cell in row.find_all("td")[::2]:
storeMatrixRow.append(cell.get_text().strip())
storeMatrix.append(", ".join(storeMatrixRow))
for record in zip(storeRank, storeMatrix):
print " ".join(record)
The above will print out:
# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]
Changing storeTable to tables[1] and storeValueRows to tables[3] will print out:
Country
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]
Not the best of code and can be improved further. However, the logic applies well.
Hope this helps.
EDIT:
If you want the format South Korea, 46, 27, 2, 5 instead of South Korea 46, 27, 2, 5 (note the , after the country name), just change this:
storeRank.append(row.get_text().strip())
to this:
storeRank.append(row.get_text().strip() + ",")
Looks like you are scraping data from http://www.appannie.com.
Here is the code to get the data. I am sure some parts of the code can be improved or written in a pythonic way. But it gets what you want. Also, I used Beautiful Soup 4 instead of 3.
from bs4 import BeautifulSoup
html_file = open('test2.html')
soup = BeautifulSoup(html_file)
countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
countries.append(row.div.a.text)
data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
tds = row.find_all('td')
sublist = []
for td in tds[::2]:
sublist.append(td.text)
data.append(sublist)
for element in zip(countries, data):
print element
Hope this helps :)
Just thought I would put my alternate version of this here. I don't even know why people still use Beautifulsoup for web-scraping, its much easier to directly use XPath through LXML. Here is the same problem, perhaps in an easier to read and update form:
from lxml import html, etree
tree = html.parse("sample.html").xpath('//body/div/div')
lxml_getData = lambda x: "{}, {}, {}, {}".format(lxml_getValue(x.xpath('.//td')[0]), lxml_getValue(x.xpath('.//td')[2]), lxml_getValue(x.xpath('.//td')[4]), lxml_getValue(x.xpath('.//td')[6]))
lxml_getValue = lambda x: etree.tostring(x, method="text", encoding='UTF-8').strip()
locations = tree[0].xpath('.//tbody')[1].xpath('./tr')
locations.pop(0) # Don't need first row
data = tree[1].xpath('.//tbody')[1].xpath('./tr')
data.pop(0) # Don't need first row
for f, b in zip(locations, data):
print(lxml_getValue(f), lxml_getData(b))