Imputing missing timestamp indices in Pandas? - python

Here is a snapshot of a dataframe df I'm working with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
With Timestamp indices spaced by 30 seconds. I'm trying to concatenate a number of rows populated by np.nan values, while keeping with the pattern of 30 second separated Timestamp indices, i.e. something that would look like
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN
However, when I apply
df = pd.concat(df, pd.DataFrame(np.array([np.nan, np.nan])))
I'm instead left with
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
My question is: how I can get the timestamp index pattern to continue? Is there something I should specify in the creation of the dataframe to be concatenated, or can I re-index the dataframe shown above?
For a more complete problem statement, I'm working with several time-series dataframes, each of which is hundreds of thousands of rows with the same initial time, varying ending times, and some missing values for each dataframe; I'm trying to get them to match lengths so I can interpolate NaN values by an np.nanmean() at that element's index over all dataframes, which I'm doing by stacking the associated numpy arrays for each dataframe -- applying this averaging procedure across the arrays requires them to have the same dimensions, hence I am filling them with NaNs and interpolating.

After you done concatinating:
Make use of pd.date_range() and index attribute:
df.index=pd.date_range(df.index[0],periods=len(df),freq='30S')
output of df:
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN

If it is the case that you have, say, two DF's - df_big and df_small. And the row indices in df_small match the beginning row indices of df_big you could:
Add the NaN rows as you describe above so that the number of rows in df_small matches the number of rows in df_big.
Then copy the index from df_big to df_small.
df_small.index = df_big.index
A different idea:
You could use the time delta between the last two rows to generate new index entries.
Set number of entries to add.
rows_to_add = 2
Create a new and extended index based on your original DF - before you add the NaN rows:
ext_index = list(df.index) + \
[df.index[-1] + (df.index[-1] - df.index[-2]) * x for x in range(1,rows_to_add+1)]
[Timestamp('2014-02-01 09:58:03'),
Timestamp('2014-02-01 09:58:33'),
Timestamp('2014-02-01 09:59:03'),
Timestamp('2014-02-01 09:59:33'),
Timestamp('2014-02-01 10:00:03')]
Then add your NaN rows as in your question. (The same number of rows as the constant rows_to_add).
Then set your new index:
df.index = ext_index

Another idea. (Doesn't directly answer your question but might help). This would be useful in situations where not all of your missing data is at the end of the frame.
Create a DF with date range in index:
df_nan = pd.DataFrame(
index=pd.date_range('2014-02-01 09:58:03',periods=5,freq='30S')
)
Outer join with your smaller DF:
df.join(df_nan, how='outer')
2014-02-01 09:58:03 1.576119 0.0 8.355 0.0 0.0 1.0 0.0
2014-02-01 09:58:33 1.576119 0.0 13.371 0.0 0.0 1.0 0.0
2014-02-01 09:59:03 1.576119 0.0 13.833 0.0 0.0 1.0 0.0
2014-02-01 09:59:33 NaN NaN NaN NaN NaN NaN NaN
2014-02-01 10:00:03 NaN NaN NaN NaN NaN NaN NaN

Related

How to extract or split some days from date when date is index or string?

I have 6000 rows and 8 columns, where 'Date' is like index or I can reset index and it would be like first column with string type. I need to Extract the list of 'Lake_Level' values where date of a record is second and seventh day of a month ( and provide top 3 and bottom 3 values of the 'Lake_Level' feature). Please show me how to make it. Thank you in advance.
Date Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Temp Lake_Level Flow_Rate
03/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
04/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
05/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
06/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
07/06/2003 NaN NaN NaN NaN NaN NaN 249.43 0.31
26/06/2021 0.0 0.0 0.0 0.0 0.0 22.50 250.85 0.60
27/06/2021 0.0 0.0 0.0 0.0 0.0 23.40 250.84 0.60
28/06/2021 0.0 0.0 0.0 0.0 0.0 21.50 250.83 0.60
29/06/2021 0.0 0.0 0.0 0.0 0.0 23.20 250.82 0.60
30/06/2021 0.0 0.0 0.0 0.0 0.0 22.75 250.80 0.60
Why don't you just filter rows with your ideal condition?
You can run queries on your dataset using pandas DataFrame like below:
If datetimes are in column
df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
If datetimes are as indexes
df[pd.to_datetime(df.index, dayfirst=True).day.isin([2,7])]
Here is an example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'Date': [random_date() for _ in range(100)],
...: 'Lake_Level': [random.randint(240, 260) for _ in range(100)]
...: })
In [3]: df[pd.to_datetime(df['Date'], dayfirst=True).dt.day.isin([2,7])]
Out[3]:
Date Lake_Level
2 07/08/2004 245
27 02/12/2017 249
30 02/06/2012 252
51 07/10/2013 257

How do I keep the column used as groupby criteria (column 0)?

I am grouping a pandas data frame by their value in column 0, which happens to be year, month column (formatted as a float64 like yy,mm).
Before using the groupby function, by dataframe is as follows:
0 1 2 3 4 5 6 7 8 9
0 13,09 0.00 NaN 26.0 5740.0 NaN NaN NaN NaN 26
1 13,09 0.02 NaN 26.0 5738.0 NaN NaN NaN NaN 26
2 13,09 0.00 NaN 26.0 5738.0 NaN NaN NaN NaN 26
3 13,09 0.00 NaN 29.0 NaN NaN NaN NaN NaN 29
4 13,09 0.00 NaN 25.0 NaN NaN NaN NaN NaN 25
After running my groupby code (seen here)
month_year_total = month_year.groupby(0).sum()
I am given the following dataframe
1 2 3 4 5 6 7 8 9
0
13,09 1.55 0.0 383.0 51583.0 0.0 0.0 0.0 0.0 383
13,10 12.56 0.0 2039.0 142426.0 0.0 0.0 0.0 0.0 2039
13,11 0.65 1890.0 1663.0 170038.0 0.0 0.0 0.0 0.0 3553
13,12 1.43 7014.0 1055.0 176217.0 0.0 0.0 0.0 0.0 8069
14,01 1.53 7284.0 856.0 101971.0 0.0 0.0 0.0 0.0 8140
I wish to keep column 0 when converting to numpy, as I intend it to be the x axis of my graph; however, the column is dropped when I convert data types. In fact, I cannot manipulate the column at all, even within the pandas dataframe.
How do I keep this column or add an identical column?

How to Parse the MLB Team and Player data using Pandas DataFrame?

I am still learning and could use some help. I would like to parse the starting pitchers and their respective teams.
I would like the data in a Pandas Dataframe but do not know how to parse the data correctly. Any suggestions would be very helpful. Thanks for your time!
Here is an example of the desired output:
Game Team Name
OAK Chris Bassitt
1
ARI Zac Gallen
SEA Justin Dunn
2
LAD Ross Stripling
Here is my code:
#url = https://www.baseball-reference.com/previews/index.shtml
#Data needed: 1) Team 2) Pitcher Name
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
test = pd.read_html(url)
for t in test:
name = t[1]
team = t[0]
print(team)
print(name)
I feel like I have to create a Pandas DataFrame and append the Team and Name, however, I am not sure how to parse out just the desired output.
pandas.read_html returns a list of all the tables for a given URL
dataframes in the list can be selected using normal list slicing and selecting methods
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
list_of_dataframes = pd.read_html(url)
# select and combine the dataframes for games; every other dataframe from 0 (even)
games = pd.concat(list_of_dataframes[0::2])
# display(games.head())
0 1 2
0 Cubs (13-6) NaN Preview
1 Cardinals (4-4) NaN 12:00AM
0 Cardinals (4-4) NaN Preview
1 Cubs (13-6) NaN 5:15PM
0 Red Sox (6-16) NaN Preview
# select the players from list_of_dataframes; every other dataframe from 1 (odd)
players = list_of_dataframes[1::2]
# add the Game to the dataframes
for i, df in enumerate(players, 1):
df['Game'] = i
players[i-1] = df
# combine all the dataframe
players = pd.concat(players).reset_index(drop=True)
# create a players column for the name only
players['name'] = players[1].str.split('(', expand=True)[0]
# rename the colume
players.rename(columns={0: 'Team'}, inplace=True)
# drop 1
players.drop(columns=[1], inplace=True)
# display(players.head(6))
Team Game name
0 CHC 1 Tyson Miller
1 STL 1 Alex Reyes
2 STL 2 Kwang Hyun Kim
3 CHC 2 Kyle Hendricks
4 BOS 3 Martin Perez
5 NYY 3 Jordan Montgomery
Love those sports reference.com sites. Trenton's solution is perfect, so don't change the accepted answer, but just wanted to throw this alternative data source for probable pitchers incase you were interested.
Looks like mlb.com has a publicly available api to pull that info (I'm going to assume that's possibly where baseball-reference fills their probable pitcher page). But what I like about this is you can get much more data returned to analyse, and it gives you the option to get a wider date range to get historical data, and possibly probable pitchers 2 or 3 days in advance (as well as day of). So give this code a look over too, play with it, practice with it.
But this could set you up to your first machine learning sort of thing.
PS: Let me know if you figure out what strikeZoneBottom and strikeZoneTop means here if you even bother to look into this data. I haven't been able to figure out what those mean.
I'm also wondering too, if there's data regarding the ballpark. Like in the pitchers stats there's the fly ball:ground ball ratio. If there was data on the ballparks like if you have flyball pitcher in a venue that yields lots of homeruns, that you might see a different situation for that same pitcher in a ballpark where flyballs don't quite travel as far, or the stadium has deeper fences (essentially homeruns turn into warning track fly out and vice versa)??
Code:
import requests
import pandas as pd
from datetime import datetime, timedelta
url = 'https://statsapi.mlb.com/api/v1/schedule'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
today = datetime.strftime(datetime.now(), '%Y-%m-%d')
tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d')
#To get 7 days earlier; notice the minus sign
#pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
#To get 3 days later; notice the plus sign
#futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
#hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
#But without hydrate, it doesn't return probable pitchers
payload = {
'sportId': '1',
'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
jsonData = requests.get(url, headers=headers, params=payload).json()
dates = jsonData['dates']
rows = []
for date in dates:
games = date['games']
for game in games:
dayNight = game['dayNight']
gameDate = game['gameDate']
city = game['venue']['location']['city']
venue = game['venue']['name']
teams = game['teams']
for k, v in teams.items():
row = {}
row.update({'dayNight':dayNight,
'gameDate':gameDate,
'city':city,
'venue':venue})
homeAway = k
teamName = v['team']['name']
if 'probablePitcher' not in v.keys():
row.update({'homeAway':homeAway,
'teamName':teamName})
rows.append(row)
else:
probablePitcher = v['probablePitcher']
fullName = probablePitcher['fullName']
pitchHand = probablePitcher['pitchHand']['code']
strikeZoneBottom = probablePitcher['strikeZoneBottom']
strikeZoneTop = probablePitcher['strikeZoneTop']
row.update({'homeAway':homeAway,
'teamName':teamName,
'probablePitcher':fullName,
'pitchHand':pitchHand,
'strikeZoneBottom':strikeZoneBottom,
'strikeZoneTop':strikeZoneTop})
stats = probablePitcher['stats']
for stat in stats:
if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
playerStats = stat['stats']
row.update(playerStats)
rows.append(row)
df = pd.DataFrame(rows)
Output: First 10 rows
print (df.head(10).to_string())
airOuts atBats balks baseOnBalls blownSaves catchersInterference caughtStealing city completeGames dayNight doubles earnedRuns era gameDate gamesFinished gamesPitched gamesPlayed gamesStarted groundOuts groundOutsToAirouts hitBatsmen hitByPitch hits hitsPer9Inn holds homeAway homeRuns homeRunsPer9 inheritedRunners inheritedRunnersScored inningsPitched intentionalWalks losses obp outs pickoffs pitchHand probablePitcher rbi runs runsScoredPer9 sacBunts sacFlies saveOpportunities saves shutouts stolenBasePercentage stolenBases strikeOuts strikeZoneBottom strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn teamName triples venue walksPer9Inn whip wildPitches winPercentage wins
0 15.0 44.0 0.0 9.0 0.0 0.0 0.0 Baltimore 0.0 day 2.0 8.0 6.00 2020-08-19T17:05:00Z 0.0 3.0 3.0 3.0 9.0 0.60 0.0 0.0 10.0 7.50 0.0 away 3.0 2.25 0.0 0.0 12.0 0.0 1.0 .358 36.0 0.0 R Tanner Roark 0.0 8.0 6.00 0.0 0.0 0.0 0.0 0.0 1.000 1.0 10.0 1.589 3.467 1.11 7.50 Toronto Blue Jays 0.0 Oriole Park at Camden Yards 6.75 1.58 0.0 .500 1.0
1 18.0 74.0 0.0 3.0 0.0 0.0 0.0 Baltimore 0.0 day 5.0 8.0 4.00 2020-08-19T17:05:00Z 0.0 4.0 4.0 4.0 18.0 1.00 1.0 1.0 22.0 11.00 0.0 home 1.0 0.50 0.0 0.0 18.0 0.0 2.0 .329 54.0 1.0 L Tommy Milone 0.0 11.0 5.50 1.0 1.0 0.0 0.0 0.0 1.000 1.0 18.0 1.535 3.371 6.00 9.00 Baltimore Orioles 1.0 Oriole Park at Camden Yards 1.50 1.39 1.0 .333 1.0
2 14.0 59.0 0.0 2.0 0.0 0.0 0.0 Boston 0.0 day 3.0 7.0 4.02 2020-08-19T17:35:00Z 0.0 3.0 3.0 3.0 14.0 1.00 0.0 0.0 17.0 9.77 0.0 away 2.0 1.15 0.0 0.0 15.2 0.0 2.0 .311 47.0 0.0 R Jake Arrieta 0.0 7.0 4.02 0.0 0.0 0.0 0.0 0.0 .--- 0.0 14.0 1.627 3.549 7.00 8.04 Philadelphia Phillies 0.0 Fenway Park 1.15 1.21 2.0 .333 1.0
3 2.0 14.0 1.0 3.0 0.0 0.0 0.0 Boston 0.0 day 1.0 5.0 22.50 2020-08-19T17:35:00Z 0.0 1.0 1.0 1.0 1.0 0.50 0.0 0.0 7.0 31.50 0.0 home 2.0 9.00 0.0 0.0 2.0 0.0 1.0 .588 6.0 0.0 L Kyle Hart 0.0 7.0 31.50 0.0 0.0 0.0 0.0 0.0 .--- 0.0 4.0 1.681 3.575 1.33 18.00 Boston Red Sox 0.0 Fenway Park 13.50 5.00 0.0 .000 0.0
4 8.0 27.0 0.0 0.0 0.0 0.0 0.0 Chicago 0.0 day 0.0 2.0 2.57 2020-08-19T18:20:00Z 0.0 1.0 1.0 1.0 7.0 0.88 0.0 0.0 6.0 7.71 0.0 away 0.0 0.00 0.0 0.0 7.0 0.0 0.0 .222 21.0 0.0 R Jack Flaherty 0.0 2.0 2.57 0.0 0.0 0.0 0.0 0.0 .--- 0.0 6.0 1.627 3.549 -.-- 7.71 St. Louis Cardinals 0.0 Wrigley Field 0.00 0.86 0.0 1.000 1.0
5 13.0 65.0 0.0 6.0 0.0 0.0 1.0 Chicago 0.0 day 2.0 6.0 2.84 2020-08-19T18:20:00Z 0.0 3.0 3.0 3.0 28.0 2.15 1.0 1.0 10.0 4.74 0.0 home 2.0 0.95 0.0 0.0 19.0 0.0 1.0 .236 57.0 0.0 R Alec Mills 0.0 6.0 2.84 0.0 0.0 0.0 0.0 0.0 .000 0.0 14.0 1.627 3.549 2.33 6.63 Chicago Cubs 0.0 Wrigley Field 2.84 0.84 0.0 .667 2.0
6 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN away NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Chicago Cubs NaN Wrigley Field NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN home NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN St. Louis Cardinals NaN Wrigley Field NaN NaN NaN NaN NaN
8 13.0 92.0 0.0 8.0 0.0 0.0 1.0 Kansas City 0.0 day 6.0 10.0 3.91 2020-08-19T21:05:00Z 0.0 4.0 4.0 4.0 24.0 1.85 0.0 0.0 25.0 9.78 0.0 away 1.0 0.39 0.0 0.0 23.0 0.0 2.0 .327 69.0 0.0 R Luis Castillo 0.0 12.0 4.70 0.0 1.0 0.0 0.0 0.0 .000 0.0 31.0 1.589 3.467 3.88 12.13 Cincinnati Reds 1.0 Kauffman Stadium 3.13 1.43 0.0 .000 0.0
9 10.0 36.0 0.0 5.0 0.0 0.0 0.0 Kansas City 0.0 day 0.0 0.0 0.00 2020-08-19T21:05:00Z 0.0 2.0 2.0 2.0 11.0 1.10 1.0 1.0 5.0 4.09 0.0 home 0.0 0.00 0.0 0.0 11.0 0.0 0.0 .262 33.0 0.0 R Brad Keller 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 .--- 0.0 10.0 1.681 3.575 2.00 8.18 Kansas City Royals 0.0 Kauffman Stadium 4.09 0.91 0.0 1.000 2.0

How to i transform a very large dataframe to get the count of values in all columns (without using df.stack or df.apply)

I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])
I also have a solution via apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014
abc xyz qwe qwe ... xyz xyz abc qwe
id ...
1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN
2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN
3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN
4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!
You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()
Out[497]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 1.0 4.0 0.0 2.0 0.0 0.0 0.0
3 0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
4 0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})
Out[501]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4.0 1.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN NaN 1.0 4.0 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN 2.0 NaN NaN NaN NaN 2.0 NaN NaN
4 NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0 1.0

How to combine dataframe rows

I have the following code:
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
fileName= input("Enter file name here (Case Sensitve) > ")
df = pd.read_excel(fileName +'.xlsx', sheetname=None, ignore_index=True)
xl = pd.ExcelFile(fileName +'.xlsx')
SystemCount= len(xl.sheet_names)
df1 = pd.DataFrame([])
for y in range(1, int(SystemCount)+ 1):
df = pd.read_excel(xl,'System ' + str(y))
df['System {0}'.format(y)] = "1"
df1 = df1.append(df)
df1 = df1.sort_values(['Email'])
df = df1['Email'].value_counts()
df1['Count'] = df1.groupby('Email')['Email'].transform('count')
print(df1)
Which prints something like this:
Email System 1 System 2 System 3 System 4 Count
test_1_#test.com NaN 1 NaN NaN 1
test_2_#test.com NaN NaN 1 NaN 3
test_2_#test.com 1 NaN NaN NaN 3
test_2_#test.com NaN NaN NaN 1 3
test_3_#test.com NaN 1 NaN NaN 1
test_4_#test.com NaN NaN 1 NaN 1
test_5_#test.com 1 NaN NaN NaN 3
test_5_#test.com NaN NaN 1 NaN 3
test_5_#test.com NaN NaN NaN 1 3
How do I combine this, so the email only shows once, with all marked systems?
I would like the output to look like this:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
If I understand it clearly
df1=df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
d=dict(zip(df1.columns[1:],['sum']*df1.columns[1:].str.contains('System').sum()+['first']))
df1.fillna(0).groupby('Email').agg(d)
Out[95]:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
It'd be easier to get help if you would post code to generate your input data.
But you probably want a GroupBy:
df2 = df1.groupby('Email').sum()

Categories