apply function in pandas to create two columns - python

I have a Pandas DataFrame called ebola as seen below. variable column has two pieces of information status whether it is Cases or Deaths and country which consists of country names. I try to create two new columns status and country out of that variable column by using .apply() function. However, since there are two values I am trying to extract, it does not work.
# let's create a splitter function
def splitter(column):
status, country = column.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].apply(splitter)
The error I get is
ValueError: Must have equal len keys and value when setting with an iterable
I want my output to be like this

Use Series.str.split
ebola[['status','country']]=ebola['variable'].str.split(pat='_',expand=True)

This is very late post to original question. Thanks to #ansev, the solution was great and it worked out great. While I was going through my question, I was trying to develop a solution based on my first approach. I was able to work it out and I wanted to share for anyone who might want to see a different perspective on this.
update to my code:
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
Two updates to my code, so it could work.
Instead of going through Series, I converted it to dataframe using .to_frame() method.
In my splitter function, I had to iterate through each row since it was a DataFrame. Therefore, I added for row in column line.
To replicate all of this:
import numpy as np
import pandas as pd
# create the data
ebola_dict = {'Date':['3/24/2014', '3/22/2014', '1/15/2015', '1/4/2015'],
'variable': ['Cases_Guinea', 'Cases_Guinea', 'Cases_Liberia', 'Cases_Liberia']}
ebola = pd.DataFrame(ebola_dict)
print(ebola)
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
# check if it worked
print(ebola)

Related

Using .str.contains to filter a df of FRED series

I am trying to download a data series for each state from the FRED api. i have loaded all the data series containing 'Housing Inventory: Active Listing Count state' into a df however there are still over 1000+ rows. Is there a way i can search the title of each series to see if it contains the name of a state?
i have tried
df=df.loc[df['title'].str.contains(["Alaska","Alabama",...,"Wyoming"])]
Series ID = ACTLISCOU
Assuming you have a list with all the states, you can define a custom function to filter your title column and use it calling pd.Series.apply:
state_list = ["Alaska","Alabama",...,"Wyoming"]
def my_filter(value):
# return True if any state is in the value
return any(state in value for state in state_list)
# Call apply to filter DF based on True|False by your filter
df_filtered = df[df['title'].apply(my_filter)]
The following code returns the country contained in the ACTLISCOUXX dataset, in this case California:
df = pd.read_csv('ACTLISCOUCA.csv',sep=';',header=None)
us_country_list=["Arizona","California","Oregon"]
country=[i for i in us_country_list if i in df.dropna().iloc[0][1]][0]
print(country)
How it works
The CSV file is imported as a Pandas dataframe
a list comprehension is used to build an array of involved countries by matching a list of US countries with the second column of the first row of the dataframe with both columns. This array should contain only one element if only one country is mentioned. Only the first element of the array is saved in the country variable.

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

Return Name of first record df.iloc[0] using Python numpy

I am new to Python and am trying to create a function that returns the Name of a record in a dataset using numpy. I have sorted the data in descending order based on the column 'silver medals'to retrieve the record that I would like to return. I'm sure there is an easier way to get the top value but I'm new to this and am trying to learn one step at a time....
df.sort_values(by=['silver medals'], inplace =True, ascending=False)
when I use
df.iloc[0]
to return the record details i can see at the bottom of the information it says:
Detail.....
Name: Country Name, dtype: object
I can use the below to return the abbreviated country name
df['ID'].iloc[0]
however I am trying to return the full name.... I believe the column that has the full name in it is index 0 and does not have any header data... so I'm not sure how to reference the column
I have tried the following but none of them seems to work.... what am i doing incorrectly? Any help would be appreciated
df[0].iloc[0]
df[''].iloc[0]
df[' '].iloc[0]
Your index contains the country names.
The index cannot be accessed like a column because it's not a column.
You can put the index back into the columns by doing: df = df.reset_index()

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

pandas srt.lower() not working on dataframe column

I'm working with the Titanic dataset available from Kaggle. I have it in a dataframe and i want to change the case of the "sex" column to lowercase. I'm using the following code
import pandas as pd
df = pd.read_csv('titanic.csv')
print dfFull['sex'].unique()
df.sex.str.lower()
#check if it worked
print df['sex'].unique()
and also trying
df['sex'].str.lower()
but when I run df['sex'].unique() I get three unique values [male, female, Female].
Why does my code not lower the case of the strings and save it back to the dataframe so i get [male, female] of from the .unique method?
str.lower() does not modify the existing column. It just returns a new Series with the lowercase transform applied. If you want to overwrite the original column, you need to assign the result back to it:
df['sex'] = df.sex.str.lower()

Categories