I am working on a project in which I scraped NBA data from ESPN and created a DataFrame to store it. One of the columns of my DataFrame is Team. Certain players that have been traded within a season have a value such as LAL/LAC under team, rather than just having one team name like LAL. With these rows of data, I would like to make 2 entries instead of one. Both entries would have the same, original data, except for 1 of the entries the team name would be LAL and for the other entry the team name would be LAC. Some team abbreviations are 2 letters while others are 3 letters.
I have already managed to create a separate DataFrame with just these rows of data that have values in the form team1/team2. I figured a good way of getting the data the way I want it would be to first copy this DataFrame with the multiple team entries, and then with one DataFrame, keep everything in the Team column up until the /, and with the other, keep everything in the Team column after the slash. I'm not quite sure what the code would be for this in the context of a DataFrame. I tried the following but it is invalid syntax:
first_team = first_team['TEAM'].str[:first_team[first_team['TEAM'].index("/")]]
where first_team is my DataFrame with just the entries with multiple teams. Perhaps this can give you a better idea of what I'm trying to accomplish!
Thanks in advance!
You're probably better off using split first to separate the teams into columns (also see Pandas DataFrame, how do i split a column into two), something like this:
d = pd.DataFrame({'player':['jordan','johnson'],'team':['LAL/LAC','LAC']})
pd.concat([d, pd.DataFrame(d.team.str.split('/').tolist(), columns = ['team1','team2'])], axis = 1)
player team team1 team2
0 jordan LAL/LAC LAL LAC
1 johnson LAC LAC None
Then if you want separate rows, you can use append.
Related
I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.
The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?
You have pretty much explained the most efficient method already.
Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
Extract unique values of 2 letter column from df2.
Inner join the result from #2 on to df1.
Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
Left join result of #4 onto df2.
I am new to Python and dataframes. I have a big panda dataframe I need to extract information from, I will try to explain my problem in a small example.
Say my dataframe looks like this:
name city number
Hana NYC 23
Fred London 12
Ben Paris 90
Lisa Berlin 3
Now I have a list with entries that relate to the column "number"
numbers = [3,12,23]
and I want to have the corresponding entries in another list from the "name" column
names = ['Lisa', 'Fred', 'Hana']
Is there an existing function for this problem?
df[df.number.isin(numbers)].name.tolist()
and, if you want exactly in same order:
df[df.number.isin(numbers)].sort_values("number").name.tolist()
You did not explain your desired output, but:
If you want to filter by one of the cirteria:
df[df['number'].isin(numbers)]
will leave you with rows within the numbers array.
If you want both:
df[(df['number'].isin(numbers)) & (df['name'].isin(names))]
Don't forget the names array to strings:
names = ['Lisa', 'Fred', 'Hana']
I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.
I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.
I'm new to python and I would appreciate if you give me an answer as soon as possible.
I'm processing a file containing reviews for products that can belong to more than 1 category. What I need is to group the review ratings by the categories, and date at the same time. Since I don't know the exact number of categories, or dates in advance, I need to add rows and columns as I'm processing the reviews data (50 GB file).
I've seen how I can add columns, however my trouble is adding a row without knowing how many columns are currently in the dataframe.
Here is my code:
list1=['Movies & TV', 'Books'] #categories so far
dfMain=pandas.DataFrame(index=list1,columns=['2002-09']) #only one column at the beginnig
print(dfMain)
This is what dfMain looks like:
If I want to add a column, I simply do this:
dfMain.insert(0, date, 0) #where date is in format like '2002-09'
But if I want to add a new category(row) and fill all the dates(columns) with zeros? How do I do that? I've tried with method append, but it asks for all the columns as parameters. Method Insert doesn't seem to work either..
Here's a possible solution:
dfMain.append(pd.Series(index=dfMain.columns, name='NewRow').fillna(0))
2002-09
Movies & TV NaN
Books NaN
NewRow 0.0