Add Several Columns to Pandas DataFrame Based on Existing Columns - python

How can I label my x-axis with multiple columns? Here's an example that works:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df["label"] = df["player_name"]+"-"+df["jersey_number"]
df.plot(x="label", y=["hits", "at_bats"])
plt.show()
But this has an couple weaknesses. First, the example line to create the label column is tedious. Second, string concat is finicky. If the jersey_numbers aren't strings (e.g. ints instead), the concat fails. I can write a subroutine to take a list of columns, cast all as strings, and concat them. That seems like it should be unnecessary though, that there should be some built-in way to do this, something like:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df.plot(x=["player_name","jersey_number"], y=["hits", "at_bats"])
plt.show()
This doesn't work; it throws ValueError: x must be a label or position.
My googlefu hasn't been strong enough to discover the correct syntax. Does it exist, and if yes what is it? Thanks

One option is to set those column as index then plot:
df.set_index(["player_name","jersey_number"]).plot( y=["hits", "at_bats"])
which gives
Although I would prefer your first approach since it gives better representation:
df["label"] = df[["player_name","jersey_number"]].astype(str).agg('-'.join)
or
df['label'] = [f'{x}-{y}' for x,y in zip(df["player_name"],df["jersey_number"]) ]

Related

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

Json normalizing with pandas json_normalize for dynamic records and inner arrays

I have the following problems: I am reading records from mongo collection, which are in fact json records, for example the following (test) record:
d = {"accounts" : [{"Email" : [{"testclass1":"6b296577-d437-4209-9e91-2fd67b5e7f1e#gmail.com","testclass2" : "6b296577-d437-4209-9e91-2fd67b5e7f1e__PI_1_1","testclass3" : "6b296577-d437-4209-9e91-2fd67b5e7f1e__PI_1_1"}],"bank_id" : "test14-bank","views_available" : [{"is_public" : "True","short_name" : "HHH","id" : "1"}],"Data_Type_test_Boolean" : "True","Data_Type_test_Number" : "44444444","pi_values" : ["6b296577-d437-4209-9e91-2fd67b5e7f1e#gmail.com","6b296577-d437-4209-9e91-2fd67b5e7f1e__PI_1_1","6b296577-d437-4209-9e91-2fd67b5e7f1e__PI_1_1"],"id" : "e01e1118-6143-428d-881c-b04a20b54076","label" : "My account label","sensitivity" : "sensitivity_data"}]}
Problem #1:
I want to ultimately receive a completely flattened pandas dataframe with columns such as: account.Email.testclass1, .., accounts.views_available.is_public...
I tried:
>> pd.json_normalize(d,record_path="accounts").columns
>> Index: Index(['Email', 'bank_id', 'views_available', 'Data_Type_test_Boolean',
'Data_Type_test_Number', 'pi_values', 'id', 'label', 'sensitivity',
'accounts'],
dtype='object')
but it does not suffice, as it misses the entire accounts.views_available.is_public hierarchy, which is in-fact an inner-array object.
I search quite a lot and did not find any way to handle this scenario.
Is there a way to pass in a one-line to the json_normalize function all the record_paths ("outer" and "inner") that are necessary for the result I need?
I am aware of solutions such as suggested here Normalizing json list as values, yet wondering if there is a cleaner (as in fewer-lines) option.
Problem #2:
As described, I am reading multiple records from a mongo collection to a dataframe, where the json structure from one record to another may change. That is, the dict d as in the example may have a different structure in a different record (part of the reason for representing as a json...).
Ideally, I would like to be able to use json_normalize function on the entire mongo-records dataframe (and provide it with a list of dictionaries), but since the json structure may vary across records, i'm not sure its feasible.
Any advice will be much appreciated
I don't think a one-liner for json_normalize() is going to cut it, unfortunately. It's not clear what would change from one record to the next. This particular record has this structure:
list(d['accounts'][0].keys())
['Email',
'bank_id',
'views_available',
'Data_Type_test_Boolean',
'Data_Type_test_Number',
'pi_values',
'id',
'label',
'sensitivity']
where Email and views_available are both dictionaries, but pi_values is a list which if flattened takes more work to get into one line. You can play around with the following to see if anything sticks.
dfe = pd.json_normalize(d['accounts'], record_path=['Email'],
meta=['bank_id', 'Data_Type_test_Boolean', 'Data_Type_test_Number', 'id', 'label', 'sensitivity'],
record_prefix='accounts.Email.')
print(dfe)
dfv = pd.json_normalize(d['accounts'], record_path=['views_available'],
record_prefix='accounts.views_available.')
print(dfv)
df = pd.json_normalize(d['accounts'], record_path=['pi_values'],
record_prefix='accounts.pi_values.').T.reset_index(drop=True)
new_cols = ['accounts.pi_values.'+str(v) for v in dfp.columns.tolist()]
df.columns = new_cols
print(df)
df_final = pd.concat([dfe, dfv, df], axis=1)
print(df_final)
Column names in final flattened dataframe
Index(['accounts.Email.testclass1', 'accounts.Email.testclass2',
'accounts.Email.testclass3', 'bank_id', 'Data_Type_test_Boolean',
'Data_Type_test_Number', 'id', 'label', 'sensitivity',
'accounts.views_available.is_public',
'accounts.views_available.short_name', 'accounts.views_available.id',
'accounts.pi_values.0', 'accounts.pi_values.1', 'accounts.pi_values.2'],
dtype='object')

Why do I get a series inside an apply/assign function in pandas. Want to use each value to look up a dict

I have a dict of countries and population:
population_dict = {"Germany": 1111, .... }
In my df (sort_countries) I have a column called 'country' and I want to add another column called 'population' from the dictionary above (matching 'country' with 'population'):
population_df = sort_countries.assign(
population=lambda x: population_dict[x["country"]], axis = 1)
population_df.head()
which gives the error: TypeError: 'Series' objects are mutable, thus they cannot be hashed.
Why is x["country"] a Series when I would imagine it should return just the name of the country.
This bit of pandas always confuses me. In my lambdas I would expect x to be a row and I just select the country from that row. Instead len(x["country"]) gives me 192 (the number of my countries, the whole Series).
How else can I match them using lambdas and not a separate function?
Note that x["country"] is a Series, albeit a single element one, this cannot be used to index the dictionary. If you want just the value associated with it, use x["country"].item().
However, a better approach tailor made for this kind of thing is using df.map:
population_df["population"] = population_df["country"].map(population_dict)
map will automatically map keys taken from population_df["country"] and map them to their appropriate values in population_dict.
Also:
population_df["population"] = population_df.apply(lambda x: population_dict[x["country"]], axis=1)
works.
Or:
population_df["population"] = population_df[["country"]].applymap(lambda x: population_dict[x])

Stacked Plot With Python

I'm using Python module like Pandas, Matplotlib to make charts for a university Project.
I got some problems ordering the result in the pivot Table.
This is the body of a function, that takes 3 lists in input ([2017-03-03, ...], ['Username1', 'Username2',...], [1012020,103024,...]), analyze data and makes chart about it.
data = [date_list,username,field]
username_no_dup = list(set(username))
rows = zip(data[0], data[1], data[2])
headers = ['Date', 'Username', 'Value']
df = pd.DataFrame(rows, columns=headers)
df = df.sort_values('Value', ascending=False)
#*sort_values works but it is not sorting when converting to Pivot Table*
pivot_df = pd.pivot_table(df ,index='Date', columns='Username', values='Value')
pivot_df.loc[:,username_no_dup].plot(kind='bar', stacked=True, color=color_list, figsize=(15,7))
I would like to order by values with the greater value near the X-line of the chart. Everyone solved this problem??? Thank you
Here is the top rows of df sorted by value:
[['2017-03-15','SSL1_APP',1515091]
['2017-03-16','SSL1_APP',1373827]
['2017-03-18','SSL1_APP',1136483]
['2017-03-21','SSL1_APP',601810]
['2017-03-17','SSL1_APP',325561]
['2017-03-15','KE77_APP',284971]
['2017-03-16','AF77_APP',222588]
['2017-03-16','MI77_APP',222148]
['2017-03-15','AF77_APP',202224]
['2017-03-15','MI77_APP',191791]
['2017-03-17','AF77_APP',187709]
['2017-03-16','PC77_APP',185766]
['2017-03-15','NE77_APP',177475]
['2017-03-18','FBW2_APP',175156]
['2017-03-16','NE77_APP',174570]
['2017-03-17','BFD1_APP',164238]
['2017-03-15','BFD1_APP',162931]
['2017-03-20','AF77_APP',152186]
['2017-03-17','PC77_APP',148727]
['2017-03-18','MI77_APP',147460]
['2017-03-16','BFD1_APP',145815]
['2017-03-20','BFD1_APP',145449]
['2017-03-15','PC77_APP',144959]
['2017-03-20','SSL1_APP',141719]]
The first pic is the plot I have created. The second one is the result I want, plotted with Excel:
Note: This is a Python answer on the subject sorting your input.
One way of doing this would be using a bidimensional list(A lists of lists) and then sorting it.
This is how you've been using it:
data = [date0,username0,randint0,date1,username1, ....
Try a bidimensional list instead:
data = [[date0,username0,randint0], [date1,username1,randint1]...
Use the .sort() method and change the syntax to look like this:
data.sort() #Sort it, by default it will be a decreasing list.
rows = zip(data[0][0], data[0][1], data[0][2])
The standard .sort() method has its limitations(floats for one) so if doesn't return a desirable output try using .sort() parameters, here is a insight on the subject: How to use .sort()
If you are having trouble with floats, check a answer that will help you here.

Pandas MultiIndex with integer labels

I have a MultiIndex with some levels labeled with strings, and others with integers:
import pandas as pd
metrics = ['PT', 'TF', 'AF']
n_replicates = 3
n_nodes = 6
cols = [(r,m,n) for r in range(n_replicates) for m in metrics for n in range(n_nodes)]
cols = pd.MultiIndex.from_tuples(cols,names = ['Replicates', 'Metrics', 'Nodes'])
ind = range(5)
df = pd.DataFrame(columns=cols, index=ind)
df.sortlevel(level=0, axis=1, inplace=True)
If I want to select a single column with an integer label, no problem:
df[2,'AF',10]
If I try to select a range, though:
df[1:4,'AF',10]
TypeError:
(No message given)
If I leave out the last level, I get a different error:
df.sortlevel(level=0,axis=1,inplace=True)
df[1:4,'AF']
TypeError: unhashable type
I suspect I'm playing with fire when I'm using integers as column labels. Is the "safe" route to simply have them all as strings? Or are there other ways of indexing MuliIndex dataframes with integer labels?
Edit:
It's now clear to me that I should be using .loc. Good. However, it's still not clear to me out to interact with the lower levels of the MultiIndex.
df.loc[:,:] #Good
df.loc[:,1:2] #Good
df.loc[:,[1:2, 'AF']]
SyntaxError: invalid syntax
df.loc[:,1:2].xs('AF', level='Metrics', axis=1) #Good
Is the last line just what I need to use? If so, fine. It's just sufficiently long that it makes me feel I'm ignorant of a better way. Thanks for the help!

Categories