Splitting a large pandas datafile based on the data in one colimn

Splitting a large pandas datafile based on the data in one colimn - python

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.
ie. [name, color, number, state;
bob, green, 21, TX;
joe, red, 33, TX;
sue, blue, 22, NY;
....]
I'd like to have it put each states worth of data in to its own data sub file
df[1] = [bob, green, 21, TX] [joe, red, 33, TX]
df[2] = [sue, blue, 22, NY]
Pandas seems like the best option for this as the csv file given is about 500 lines long

You could try something like:
import pandas as pd
for state, df in pd.read_csv("file.csv").groupby("state"):
df.to_csv(f"file_{state}.csv", index=False)
Here file.csv is your base file. If it looks like
name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY
the output would be 2 files:
file_TX.csv:
name,color,number,state
bob,green,21,TX
joe,red,33,TX
file_NY.csv:
name,color,number,state
sue,blue,22,NY

There are different methods for reading csv files. You may find all methods in following link:
(https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)
Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:
import pandas as pd
df = pd.read_csv(r"file_path")
Now let's assume after these lines, you have the following dataframe:
name
color
number
state
bob
green
21
TX
joe
red
33
TX
sue
blue
22
NY
...
...
...
...
From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:
df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:
Method 1- Making List of Lists for Different States
First, let's get state list without duplicates:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
Now we need to use list comprehension.
lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]
lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.
Method 2- Making Different Dataframes for Different States
In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:
s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))
We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)
r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]
Let's make different dataframes for different states: (and reset index)
dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)
Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.
Method 3 - My Recommendation
Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:
import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]
for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)
def call(df):
if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}
data = {}
for i in range(len(states)): data.update(call(dfs[i]))
I may have done some typos, but I hope you understand the idea.
This code gives a nested dictionary such as:
first choice is state (TX,NY...)
next choice is state number index (0,1,2...)
next choice is name or color or number
Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

Related

Comparing column in data frame with many items in a single sting to a separate list and picking out common elements

I'm relatively new to python and am really having trouble working with lists.
I have a dataframe (df1) with a column for 'actors' with many actors in a string and I have a separate dataframe (df2) that lists actors that have received an award.
I want to add a column to df1 that will indicate whether an actor has received an award or not, so for example 1=award, 0=no award.
I am trying to use for loops but it is not iterating in the way I want.
In my example, only 'Betty' has an award, so the 'actors_with_awards' column should display a 0 for the first row and 1 for the second, but the result is a 1 for both rows.
I suspect this is because it's looking at the string in its entirety for example is 'Alexander, Ann' in the list vs. is 'Alexander' or 'Ann' in the list, I thought splitting the stings would solve this (maybe I did that step wrong?) so I'm not sure how to fix this.
My full code is below:
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x)
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
df1['actors_with_awards']=actors_with_awards
df1
Current Output Df1
cast
actors_with_awards
Alexander, Ann
1
Bob, Bill, Benedict, Betty
1
Expected Output Df1
cast
actors_with_awards
Alexander, Ann
0
Bob, Bill, Benedict, Betty
1

When trying out your program, a couple of things popped up. First was your comparison of "x" to see if it was contained in the awards database.
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
The issue here was that x contains the value of "Betty" left over from populating the awards array. It is not the "x" value for each split actor array. The other issue was when performing a check whether an item existed or not existed in the awards array, leading and/or trailing spaces for the actor names was throwing off the comparison.
With that in mind I made a few tweaks to your code to address those situations as follows in the code snippet.
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x.strip()) # Make sure no leading or trailing spaces exist for subsequent test
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
y = 0
for x in item: # Reworked this so that "x" is associated with the selected actor set
if x.strip() not in award: # Again, make sure no leading or trailing spaces are in the comparison
y = 0
else:
y = 1
actors_with_awards.append(y)
df1['actors_with_awards']=actors_with_awards
print(df1) # Changed this so as to print out the data to a terminal
To insure that leading or trailing spaces would not trip of comparisons or list checks, I added in the ".strip()" function where needed to store just the name value and nothing more. Secondly, so that the proper name value was placed into variable "x", an additional for loop was added along with a work variable to be populated with the proper "0" or "1" value. Adding those tweaks resulted in the following raw data output on the terminal.
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1
You may want to give that a try. Please note that this may be just one way to address this issue.

One possible solution is to convert actors with awards from df2 to set, split column df1['cast'] and check intersection between each for and the set:
awards = set(df2["awards"].values)
df1["actors_with_awards"] = [
int(bool(awards.intersection(a)))
for a in df1["cast"].str.split(r"\s*,\s*", regex=True)
]
print(df1)
Prints:
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!

Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']

This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.

Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

How to create separate data frames based on row numbers in a loop

I am reading in data from an excel file. And currently I am breaking down it several different DFs based on the row numbers.
What I want to do is create a loop which will iterate over the imputed row numbers and create different Dfs with the appropriate suffixes.
Currently I am creating separate Dfs by passing in row numbers in each line.
NHE_17= NHE_data.parse('NHE17')
#Slice DataFrame for only Total National Health Expenditure data, from
row 0 to 37(Population): total_nhe
total_nhe = NHE_17.iloc[0:37]
print(total_nhe.iloc[0,-1])
#Slice DataFrame for only Health Consumption Expenditures, from row 38 to
70(Total CMS Programs (Medicaid, CHIP and Medicare): total_hce
total_hce = NHE_17.iloc[38:70]
I want to be able call the function with the row numbers and suffix to create the specific DF.

that function would look like:
def row_slicer(slice_tuple):
#This will slice the NHE_17 according to slice_parameters parameters
# Input slice_tuple = [x1,x1
df_temp = NHE_17.iloc[slice_tuple[0]:slice_tuple[1]]
return df_temp
dict_dataframes = {}
#assuming this is a dictionary, else you can zip lists with pandas columns
name_list_row = [['total_nhe',[0,37]],['total_hce',[38,70]]...]
for name,slice_tuple in name_list_row:
df = row_slicer(slice_tuple)
dict_dataframes[name] = df
Hope this helps!

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.

why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)

Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.

You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a large pandas datafile based on the data in one colimn - python

Related

Comparing column in data frame with many items in a single sting to a separate list and picking out common elements

assigning list of strings as name for dataframe

Given an index label, how would you extract the index position in a dataframe?

How to create separate data frames based on row numbers in a loop

python pandas how to merge/join two tables based on substring?

Categories

Resources