Say I have the dataframe below:
import pandas as pd
rankings = {'Team': ['A', 'A', 'B',
'B', 'C', 'C', 'C'],
'Description': ['Aggressive', 'Strong', 'Passive',
'Strong', 'Bad', 'Loser Streak', 'Injured'],
'Description Continued': ['Aggressive ...', 'Strong ...', 'Passive ...',
'Strong ...', 'Bad ...', 'Loser Streak ...', 'Injured ...']}
rankings_pd = pd.DataFrame(rankings)
There are multiple rows for each team. I want to have one row against each team and add additional columns in the dataframe to contain the extra information. This is the desired output:
How can I achieve this?
How to pivot a dataframe?
This doesn't work for multiple columns as in the example above.
The key is to create numeric index per group before pivoting
rankings_pd["count"] = rankings_pd.groupby("Team").cumcount() + 1
df = rankings_pd.pivot(
index="Team", columns="count", values=["Description", "Description Continued"]
)
df.columns = [f"{x}{y}" for x, y in df.columns]
Related
I am resetting my flow by '000' when the patient sees that value in the pattern.But my summary table is mixing all patterns and giving only one value like shown in the out data frame. However, I like to show the same patient in all patterns individually shown the 'desired' data frame. Please help.
df2 = pd.DataFrame({'patient': ['one', 'one', 'one', 'one','one', 'one','one','one','one','one','one','one'],
'pattern': ['A', 'B', '000', 'B', 'B', '000','D','A','C','000','A','B'],
'date': ['11/1/2022', '11/2/2022', '11/3/2022', '11/4/2022', '11/5/2022', '11/6/2022','11/7/2022', '11/8/2022', '11/9/2022','11/10/2022', '11/11/2022','11/12/2022']})
m = df2['pattern'] == '000'
display(df2)
out = (
df2[~m].sort_values(['patient','date'],ascending=True)
.groupby(["patient"])
.agg(pattern= ("pattern", ",".join),
patients=("patient", "nunique"))
.reset_index(drop=True)
.groupby(["pattern"]).agg({'patients':'sum'}).reset_index())
display(out)
I like to tweak my output to like below desired data frame:
desired = pd.DataFrame({'pattern': ['A,B', 'B,B', 'D,A,C'],
'patients': [2, 1, 1]})
desired.head()
You atre close, need new Series for grouping by mask with cumulative sum:
out = (df2[~m].sort_values(['patient','date'],ascending=True)
.groupby(["patient", m.cumsum()])
.agg(pattern= ("pattern", ",".join),
patients=("patient", "nunique"))
.reset_index(drop=True)
.groupby(["pattern"])
.agg({'patients':'sum'})
.reset_index())
print(out)
pattern patients
0 A,B 2
1 B,B 1
2 D,A,C 1
Is there a way to filter Pandas DataFrame rows using wildcard patterns?
Example initial state of the data.
df = pd.DataFrame([
['noun','nominative','singular','m','',''],
['noun','nominative','singular','f','',''],
['noun','nominative','singular','n','',''],
['noun','accusative','singular','n','',''],
['noun','accusative','singular','n','',''],
['noun','accusative','singular','n','',''],
['verb','','singular','','present','1per'],
['verb','','singular','','present','2per'],
['verb','','singular','','present','3per'],
['verb','','plural','','present','1per'],
['verb','','plural','','present','2per'],
['verb','','plural','','present','3per'],
],columns=['pos', 'case', 'number', 'gender', 'tense', 'person'])
mask = pd.Series(['noun','nominative','singular','*','',''])
Objective end state of data:
['noun','nominative','singular','m','',''],
['noun','nominative','singular','f','',''],
['noun','nominative','singular','n','',''],
You can just leave out the wildcard column when you do the comparison:
pattern = ['noun', 'nominative', 'singular', '', '']
cols_to_match = ['pos', 'case', 'number', 'tense', 'person']
mask = (df[cols_to_match] == pattern).all(axis=1)
df_filtered = df[mask]
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I have two data frames and I want for each line in one data frame to locate the matching line in the other data frame by a certain column (containing some id). I thought to go over the lines in the df1 and use the loc function to find the matching line in df2.
The problem is that some of the id's in df2 has some extra information except the id itself.
For example:
df1 has the id: 1234,
df2 has the id: 1234-KF
How can I locate this id for example with loc? Can loc somehow match only by prefixes?
Extra information can be removed using e.g. regular expression (or substring):
import pandas as pd
import re
df1 = pd.DataFrame({
'id': ['123', '124', '125'],
'data': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'id': ['123-AA', '124-AA', '125-AA'],
'data': ['1', '2', '3']
})
df2.loc[df2.id.apply(lambda s : re.sub("[^0-9]", "", s)) == df1.id]
I currently have a csv file as follows. The first part just shows the columns names.
"f","p","g"
"foo","in","void"
"foo","out","void"
"foo","length","void"
...
The g column values are the same for every f value. The only unique part is p.
Using python, how could I combine this as follows:
"foo","in","out","length","void"
One thing to note is that the csv file is much larger and that some f values might have more p values. For example, it could be like this:
"goo","a","int"
"goo","b","int"
"goo","c","int"
"goo","d","int"
"goo","e","int"
"goo","f","int"
...
I hope I've understood your question right. You can group by "f", "g" column and then aggregate the rows:
x = df.groupby(["f", "g"], as_index=False)["p"].agg(list)
for vals in x.apply(lambda x: [x["f"], *x["p"], x["g"]], axis=1):
print(vals)
Prints:
['foo', 'in', 'out', 'length', 'void']
['goo', 'a', 'b', 'c', 'd', 'e', 'f', 'int']