Pandas apply method: get an index label from a pivot_table - python

I am using a well-known dataset as an example. These are the most popular baby names given to a newborn baby in New York City, based on ethnicity. This well-known dataset is available at this address:"https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
I have been using Pandas for a few months, and still have issues with the pivot_table.
I wanted to know for each year, what is the most popular first name, and I did this (it works):
import pandas as pd
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
df = pd.read_csv (url)
pt = df.pivot_table (index = "Child's First Name", columns = "Year of Birth", values ​​= "Gender", aggfunc = "count", fill_value = 0, margins = True)
The variable pt gives me the list of first names, in line, and in column the years. And the values ​​are the number of times the first name has been given for a year.
Now, I want to do the opposite: from a value over one year, find the corresponding index (therefore the first name).
For example, I want to know which first names were given 4 times in 2015.
To do this I do:
condition = pt [2015] == 4
result = pt [condition]
print(result)
Now I will want to use an apply function which will return me for each row, the first name in question.
I did this, but it doesn't work:
pt ["First Name First Letter"] = pt.apply (lambda x: x.index [0], axis = 1)
I definitely want to use the apply function because I feel that there is always something that works differently when dealing with a pivot_table...
Who could help me please ?
Here is the false result I get

Related

Splitting a large pandas datafile based on the data in one colimn

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.
ie. [name, color, number, state;
bob, green, 21, TX;
joe, red, 33, TX;
sue, blue, 22, NY;
....]
I'd like to have it put each states worth of data in to its own data sub file
df[1] = [bob, green, 21, TX] [joe, red, 33, TX]
df[2] = [sue, blue, 22, NY]
Pandas seems like the best option for this as the csv file given is about 500 lines long
You could try something like:
import pandas as pd
for state, df in pd.read_csv("file.csv").groupby("state"):
df.to_csv(f"file_{state}.csv", index=False)
Here file.csv is your base file. If it looks like
name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY
the output would be 2 files:
file_TX.csv:
name,color,number,state
bob,green,21,TX
joe,red,33,TX
file_NY.csv:
name,color,number,state
sue,blue,22,NY
There are different methods for reading csv files. You may find all methods in following link:
(https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)
Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:
import pandas as pd
df = pd.read_csv(r"file_path")
Now let's assume after these lines, you have the following dataframe:
name
color
number
state
bob
green
21
TX
joe
red
33
TX
sue
blue
22
NY
...
...
...
...
From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:
df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:
Method 1- Making List of Lists for Different States
First, let's get state list without duplicates:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
Now we need to use list comprehension.
lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]
lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.
Method 2- Making Different Dataframes for Different States
In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)
r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]
Let's make different dataframes for different states: (and reset index)
dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)
Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.
Method 3 - My Recommendation
Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:
import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]
for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)
def call(df):
if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}
data = {}
for i in range(len(states)): data.update(call(dfs[i]))
I may have done some typos, but I hope you understand the idea.
This code gives a nested dictionary such as:
first choice is state (TX,NY...)
next choice is state number index (0,1,2...)
next choice is name or color or number
Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day's data for one state. As many people know, the pandemic is hitting different states differently -- New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not "NY". What I want to do, though, is exclude multiple states with something like a "not in" function (this doesn't work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the 'positive' column, with 'NY' standing in for 'tristate'):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
Any help would be welcome.
to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
or with pivot_table, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300
You can exclude multiple values of states by using isin with a NOT(~) sign:
all_states[~(all_states['state'].isin(["NY", "NJ", "CT"]))]
So, your code would be:
not_tristate = all_states[~(all_states['state'].isin(['NY','NJ','CT']))].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.
why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)
Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.
You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

Categories