Is there a way to rename multiple df columns in Python? - python

I'm trying to rename multiple columns in a dataframe to certain dates with Python.
Currently, the columns are as such: 2016-04, 2016-05, 2016-06....
I would like the columns to read: April2016, May2016, June2016...
There are around 40 columns. I am guessing a for loop would be the most efficient way to do this, but I'm relatively new to Python and not sure how to concatenate the column names correctly.

You can use loops or comprehensions along with a month dictionary to split, reorder, and replace the string column names
#in your case this would be cols=df.columns
cols=['2016-04', '2016-05', '2016-06']
rpl={'04':'April','05':'May','06':'June'}
cols=[\
''.join(\
[rpl[i.split('-')[1]],
i.split('-')[0]]) \
for i in cols]
cols
['April2016', 'May2016', 'June2016']
#then you would assign it back with df.columns = cols

You didn't share your dataframe so I used basic dataframe to explain how to get month is given date.I supposed your dataframe likes:
d = {'dates': ['2016-04', '2016-05','2016-06']} #just 3 of them
so all code :
import datetime
import pandas as pd
d = {'dates': ['2016-04', '2016-05','2016-06']}
df = pd.DataFrame(d)
for index, row in df.iterrows():
get_date= row['dates'].split('-')
get_month = get_date[1]
month = datetime.date(1900, int(get_month), 1).strftime('%B')
print (month+get_date[0])
OUTPUT :
2016April
2016May
2016June

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Creating a table for time series analysis including counts

I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)

How do I filter out multiple columns witha certain string in Python

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

iterate through csv columns to create multiple python dataframe

I am trying to create multiple data frames using the columns of a excel csv file. This is where I have been able to get to
import pandas as pd
file = pd.read_csv('file.csv')
df = pd.DataFrame(file)
cols = df.columns
#column names are 'Date', 'Stock 1', 'Stock 2', etc - I have 1000 columns
for i in range(len(cols)):
df[i] = df[['Date',b(i)]]
So the end result is I want multiple dataframes. The first dataframe is with columns 1 and 2 (so Date and Stock 1), the second dataframe is with columns 1 and 3 (so Date and Stock 2), the third dataframe is with columns 1 and 3, creating new dataframe all the way to Columns 1 and 1000.
I have tried several ways and either get index in not callable or I tried with usecols and I get usecols must be strings or integers.
Can anyone help me with this. Conceptually it is easy but I can not get the code right. Thank you.
This does what you are asking:
all_dfs = []
for col in df.columns:
if col != 'Date':
df_current = df[['Date', col]]
all_dfs.append(df_current)
Or as one line:
all_dfs = [df[['Date', col]] for col in df.columns if col != 'Date']
But you probably don't want to do that. There's not much point. What are you really trying to do?

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Categories